Graph Databases: Neo4j and Graph Traversal Patterns
Learn Neo4j graph database modeling with Cypher. Covers nodes, edges, social networks, recommendation engines, fraud detection, and when graphs are not the right fit.
Graph Databases: Neo4j and Graph Traversal Patterns
Most databases store relationships as foreign keys and join them at query time. Graph databases store relationships as first-class citizens, making traversal across connections fast regardless of the total dataset size.
This changes what queries are practical. When you need to answer “find all friends of friends who bought this product,” relational databases slow down as the depth increases. Graph databases traverse relationships in time proportional to the hops you traverse, not the total data.
The Graph Model
A graph database has three core elements:
Nodes represent entities. They have properties (key-value pairs) and labels (type markers).
Edges (or relationships) connect nodes. They have direction, type, and properties. In Neo4j, edges are always directed but you can query in either direction.
Properties store data on both nodes and edges.
// A simple social graph
CREATE (alice:Person {name: 'Alice', age: 30})
CREATE (bob:Person {name: 'Bob', age: 28})
CREATE (carol:Person {name: 'Carol', age: 32})
CREATE (dave:Person {name: 'Dave', age: 27})
CREATE (alice)-[:FRIEND {since: 2020}]->(bob)
CREATE (alice)-[:FRIEND {since: 2019}]->(carol)
CREATE (bob)-[:FRIEND {since: 2021}]->(dave)
CREATE (carol)-[:FRIEND {since: 2018}]->(dave)
Visually:
[Alice] --FRIEND--> [Bob]
| |
FRIEND FRIEND
| |
v v
[Carol] <--FRIEND-- [Dave]
flowchart LR
App["Application<br/>(Cypher Query)"]
Traversal["Traversal<br/>Engine"]
Store["Property<br/>Store"]
RelStore["Relationship<br/>Store"]
Index["Index<br/>(Lucene)"]
Result["Result<br/>Set"]
App --> Traversal
Traversal --> Store
Traversal --> RelStore
Traversal --> Index
RelStore --> Result
Index --> Result
Neo4j’s traversal engine walks the graph by following relationships from starting nodes. It consults the property store for node/relationship data and the Lucene index for label and property lookups. Traversal time depends on the number of hops, not the total graph size.
Cypher: Neo4j’s Query Language
Cypher uses ASCII-art-like syntax to represent patterns. Nodes are in parentheses, relationships are arrows.
// Find all of Alice's friends
MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->(friend)
RETURN friend.name
// Find mutual friends of Alice and Bob
MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->(mutual)
MATCH (bob:Person {name: 'Bob'})-[:FRIEND]->(mutual)
RETURN mutual.name
// Find friends of friends (2nd degree connections)
MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->()-[:FRIEND]->(fof)
RETURN fof.name
// Exclude already-friends
MATCH (alice:Person {name: 'Alice'})-[:FRIEND*2]->(potential_friend)
WHERE NOT (alice)-[:FRIEND]->(potential_friend)
RETURN potential_friend.name
Pattern Matching
// Find users who bought products in the same category as their friends
MATCH (person:Person)-[:PURCHASED]->(product:Product)-[:IN_CATEGORY]->(category:Category)<-[:IN_CATEGORY]-(friend:Person)-[:PURCHASED]->(rec:Product)
WHERE NOT (person)-[:PURCHASED]->(rec)
RETURN rec.name AS recommendation, count(*) AS frequency
ORDER BY frequency DESC
LIMIT 10
// Find the shortest path between two people
MATCH path = shortestPath(
(alice:Person {name: 'Alice'})-[*]-(dave:Person {name: 'Dave'})
)
RETURN path
// Find all relationships in a network path
MATCH path = (a:Person)-[:FRIEND|PURCHASED|REVIEWED*1..3]-(b:Person)
WHERE a.name = 'Alice' AND b.name = 'Dave'
RETURN path
Aggregation and Analysis
// Count friends by person
MATCH (p:Person)-[:FRIEND]->(friend)
RETURN p.name, count(friend) AS friend_count
ORDER BY friend_count DESC
// Find influential users (most connections)
MATCH (p:Person)
RETURN p.name, size((p)--()) AS connections
ORDER BY connections DESC
LIMIT 10
// Collaborative filtering: "users who liked X also liked Y"
MATCH (user:Person)-[:LIKED]->(product:Product)<-[:LIKED]-()-[:LIKED]->(rec)
WHERE (user)-[:LIKED]->(product) AND not(user)-[:LIKED]->(rec)
RETURN product.name AS liked, rec.name AS recommended, count(*) AS strength
ORDER BY strength DESC
Where Graphs Actually Shine
Social Networks
Social networks are the canonical graph use case. Friends, followers, groups, likes, comments — all form a natural graph structure.
// Find friend recommendations (friends of friends who aren't friends yet)
MATCH (me:Person {name: 'Alice'})
MATCH (me)-[:FRIEND]->(friend)-[:FRIEND]->(suggestion)
WHERE NOT (me)-[:FRIEND]->(suggestion) AND suggestion <> me
RETURN suggestion.name, count(*) AS mutual_friend_count
ORDER BY mutual_friend_count DESC
LIMIT 5
// Find communities using weakly connected components
CALL algo.labelPropagation.stream('Person', 'FRIEND', {})
YIELD nodeId, label
RETURN label AS community, collect(members.name) AS members
// Note: above is pseudocode - actual implementation varies by Neo4j version
Recommendation Engines
Graphs power real-time recommendations by finding patterns in user behavior.
// "Users who viewed X also viewed Y"
MATCH (p:Person)-[:VIEWED]->(product:Product)<-[:VIEWED]-(other)-[:VIEWED]->(rec)
WHERE p.id = $current_user AND rec <> product
RETURN rec.name, count(*) AS views_together
ORDER BY views_together DESC
// Content-based recommendations
MATCH (user:Person {id: $user_id})-[:PURCHASED]->(owned:Product)-[:IN_CATEGORY|hasTag|TAGGED_WITH]->(attr)
WHERE NOT (user)-[:PURCHASED]->(rec:Product)-[:IN_CATEGORY|hasTag|TAGGED_WITH]->(attr)
AND NOT (user)-[:PURCHASED]->(rec)
RETURN rec, count(attr) AS matching_attributes
ORDER BY matching_attributes DESC
// "Popular in your network"
MATCH (user:Person {id: $user_id})-[:FRIEND*1..2]->(friend)-[:PURCHASED]->(product)
WHERE NOT (user)-[:PURCHASED]->(product)
RETURN product.name, count(DISTINCT friend) AS friend_count
ORDER BY friend_count DESC
Fraud Detection
Graphs reveal fraud patterns invisible to row-by-row analysis.
// Detect accounts sharing the same identifying info
MATCH (a:Account)-[:HAS_EMAIL|HAS_PHONE|HAS_IP]->(info)<-[:HAS_EMAIL|HAS_PHONE|HAS_IP]-(b:Account)
WHERE a <> b
WITH info, collect(a.id) AS accounts, collect(b.id) AS others
WHERE size(accounts) + size(others) > 5 // many accounts sharing info
RETURN info, accounts + others AS suspicious_accounts
// Find circular transactions (potential money laundering)
MATCH path = (a:Account)-[:SENT_TO*5..10]->(a)
WHERE a <> head(nodes(path))
RETURN path
// Detect velocity patterns (many small transactions)
MATCH (a:Account)-[t:TRANSACTION]->()
WHERE t.timestamp > datetime() - duration('P1D')
WITH a, count(t) AS tx_count, sum(t.amount) AS total_amount
WHERE tx_count > 100 AND total_amount < tx_count * 10 // many small amounts
RETURN a.id AS account, tx_count, total_amount
Network and IT Operations
// Find dependencies between services
MATCH (a:Service)-[:DEPENDS_ON]->(b:Service)
RETURN a.name AS service, collect(b.name) AS dependencies
// Find critical paths (no redundancy)
MATCH path = (a:Service)-[:DEPENDS_ON*]->(b:Service)
WHERE NOT exists(()-[:DEPENDS_ON]->(a)) // no incoming dependencies
AND NOT exists((b)-[:DEPENDS_ON]->()) // no outgoing dependencies
RETURN a.name AS source, b.name AS sink
// Root cause analysis
MATCH path = (failure:Incident)-[:AFFECTS*0..3]-(service:Service)
WHERE failure.severity = 'critical'
RETURN service.name, length(path) AS distance_from_incident
ORDER BY distance_from_incident
Modeling: Nodes vs Edges
The choice of what becomes a node vs an edge shapes query capabilities.
Things that work well as nodes:
- Entities with independent existence (Person, Product, Company)
- Abstract concepts that participate in multiple relationships
- Records you need to query directly
Things that work well as edges:
- Relationships between entities
- Events that have no identity beyond the connection
- Attributes that describe a relationship (date, weight, role)
// Good: Purchase as an edge (an event, not an entity)
MATCH (customer)-[p:PURCHASED]->(product)
WHERE p.date > '2024-01-01'
RETURN product.name, sum(p.amount) AS revenue
// Consider: Purchase as a node when you need to query purchases independently
// e.g., "find all purchases with these exact items"
CREATE (p:Purchase {id: 'pur-123', date: date('2024-03-15')})
CREATE (customer)-[:MADE_PURCHASE]->(p)
CREATE (p)-[:INCLUDED_ITEM]->(product)
Practical Neo4j Operations
Python Driver
from neo4j import GraphDatabase
class Neo4jConnection:
def __init__(self, uri, user, password):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def query(self, cypher, parameters=None):
with self.driver.session() as session:
result = session.run(cypher, parameters)
return [dict(record) for record in result]
def close(self):
self.driver.close()
# Usage
conn = Neo4jConnection('bolt://localhost:7687', 'neo4j', 'password')
# Create nodes
conn.query('''
CREATE (p:Person {name: $name, email: $email})
''', {'name': 'Alice', 'email': 'alice@example.com'})
# Query with parameters
friends = conn.query('''
MATCH (me:Person {name: $name})-[:FRIEND]->(friend)
RETURN friend.name, friend.email
''', {'name': 'Alice'})
# Find recommendations
recommendations = conn.query('''
MATCH (me:Person {name: $name})-[:PURCHASED]->(product)-[:IN_CATEGORY]->(cat)<-[:IN_CATEGORY]-(rec)
WHERE NOT (me)-[:PURCHASED]->(rec)
RETURN rec.name, count(*) AS score
ORDER BY score DESC
LIMIT 5
''', {'name': 'Alice'})
Importing Data
// Import from CSV
LOAD CSV FROM 'file:///users.csv' AS row
CREATE (u:User {
id: toInteger(row.id),
name: row.name,
email: row.email,
created_at: datetime(row.created_at)
});
// Import relationships
LOAD CSV FROM 'file:///friendships.csv' AS row
MATCH (a:User {id: toInteger(row.user_a)})
MATCH (b:User {id: toInteger(row.user_b)})
CREATE (a)-[:FRIEND {since: toInteger(row.year)}]->(b);
// Batch import with periodic commits
:auto USING PERIODIC COMMIT 1000
LOAD CSV FROM 'file:///events.csv' AS row
CREATE (:Event {
id: row.event_id,
type: row.type,
timestamp: datetime(row.timestamp)
});
When to Use / When Not to Use Graph Databases
Graph databases shine when relationships are first-class data, not afterthoughts.
Use graph databases when:
- You need multi-hop traversals (friends-of-friends, recommendation chains, dependency graphs)
- Relationship patterns drive your core business logic (fraud rings, supply chains, social graphs)
- Your queries depend on the topology of connections, not just attribute values
- You need real-time path finding or shortest-path queries
- You are modeling domains where context and connection history directly inform outcomes
Do not use graph databases when:
- Your data is mostly independent records with few meaningful relationships
- Your primary access pattern is full-text search — use Elasticsearch instead
- You need heavy OLAP aggregation (sums, group-bys across millions of rows) — columnar databases win here
- Your schema is stable and relational with straightforward joins — the graph model adds complexity you will not use
- Writes heavily outnumber reads and the writes are not relationship-driven — a write-optimized store like Cassandra handles that better
Hybrid Approaches
Many systems combine graph with other databases.
Application Layer
|
+---> Graph DB (Neo4j) for social, recommendations
|
+---> Relational DB (PostgreSQL) for transactional
|
+---> Search (Elasticsearch) for full-text
|
+---> Cache (Redis) for hot data
# Example: hybrid architecture for e-commerce
def get_product_recommendations(user_id):
# Fast filter with search
candidates = elasticsearch.search(
query='laptop',
filters={'category': 'electronics', 'price_range': '500-2000'}
)
# Use graph for collaborative filtering
graph_result = neo4j.query('''
MATCH (u:User {id: $user_id})-[:PURCHASED]->()-[:IN_CATEGORY]->(cat)
WHERE u.id IN $candidates
WITH cat, count(*) AS score
MATCH (cat)<-[:IN_CATEGORY]-(rec)
WHERE NOT (u)-[:PURCHASED]->(rec)
RETURN rec.id, score
ORDER BY score DESC
LIMIT 10
''', {'user_id': user_id, 'candidates': [c['id'] for c in candidates]})
return merge_search_and_graph(candidates, graph_result)
Graph Database Comparison
| Feature | Neo4j | Amazon Neptune | TigerGraph |
|---|---|---|---|
| Query language | Cypher (openCypher) | SPARQL, Gremlin, OpenCypher | GSQL |
| License | GPL Community, commercial | AWS managed service | Open source, commercial |
| Native traversal | Yes — purpose-built | Yes | Yes |
| Scalability | Single-machine to cluster | Fully managed, multi-AZ | Horizontally scalable |
| OLAP / Analytics | Limited (via Apache Spark connector) | Analytics via Gremlin | Built-in parallel analytics |
| Cloud offering | Aura (managed), self-hosted | Neptune Serverless, provisioned | TigerGraph Cloud |
| Best for | General-purpose graphs, operational | AWS integration, SPARQL workloads | Large-scale analytics, real-time deep link analysis |
| Learning curve | Moderate (Cypher is intuitive) | Moderate (multiple query APIs) | Steeper (GSQL is powerful but verbose) |
Capacity Estimation: Node and Edge Storage in Neo4j
Neo4j stores each node, relationship, and property as separate records on disk. Node records are 15 bytes each plus property references. Relationship records are 34 bytes each plus property references. Property records vary by type: 1 byte for a small integer, n+1 bytes for a string (n bytes plus length byte), 8 bytes for a floating-point number.
For a social graph with 10 million users and 100 million friendships, the base storage is straightforward: 10 million nodes at roughly 15 bytes each is 150MB, and 100 million relationships at 34 bytes each is 3.4GB. Properties add more: if each user has 5 properties averaging 50 bytes each, that is 2.5GB. The total working set for this graph is roughly 6GB before indexes.
Indexes in Neo4j are separate from the graph store. The default index type is a B-tree, and it consumes roughly 30% of the node/relationship data size for highly selective indexes. Property existence constraints and range indexes add more overhead. For the 10 million user graph above, adding indexes on Person(name) and Person(email) adds roughly 1-2GB of index storage.
The practical implication: Neo4j on a single machine handles graphs in the hundreds of millions of edges comfortably if your working set fits in RAM. Beyond that, you need clustering. Neo4j Enterprise clusters data across machines using causal clustering, which splits the graph into distinct cores. Each core is a raft group of 3 or more replicas, and reads get distributed across replicas.
For memory sizing, Neo4j recommends your page cache (which caches the graph store on disk) be large enough to hold your hot data. With 6GB of graph data and 30% hot, aim for 2GB page cache minimum. Add heap for query execution: simple traversals need 512MB to 1GB heap, complex aggregations need 2GB+.
Observability Hooks: Neo4j DBMS Metrics and Query Profiling
Neo4j exposes DBMS metrics via the dbms.metrics namespace. The JMX endpoint (dbms.metrics.prometheus) or the Neo4j HTTP endpoint (/db/neo4j/metrics) provides access to key operational metrics.
Page cache hit ratio is the most important metric: page_cache_hits divided by page_cache_hits + page_cache_misses. Above 95% and Neo4j barely touches the disk. Below 80% and you are latency-bound on page faults. If page cache hit ratio drops, increase dbms.memory.pagecache.size or reduce the graph working set.
Transaction latency: transaction_started, transaction_committed, and transaction_rollbacks counters tell you throughput. transaction_time histograms give you p50/p95/p99 latency. If p99 latency spikes while throughput stays flat, you have slow queries holding connections. db.checking.time tracks constraint checking time separately from transaction execution.
For query profiling, use EXPLAIN before running a query to see the query plan without executing it, and PROFILE to execute and return actual runtime statistics. The hits column in a profile shows how many times each operator ran, and estimatedRows shows the planner’s cardinality estimate. A large gap between estimated and actual rows means stale statistics — run CALL db.stats.collect() to update.
For Cypher query monitoring, Neo4j Enterprise tracks query runtime via dbms.query_tracking.enabled. The slow query log (dbms.logs.query.enabled) logs queries exceeding dbms.logs.query.threshold (default 0, set to something like 100ms). Monitor dbms.transaction.timeout — if set, queries exceeding this limit are killed automatically.
Real-World Case Study: LinkedIn’s Graph Search
LinkedIn’s graph is the social network itself — 900+ million members, billions of connections, and search across professional relationships, company pages, job listings, and content. Their search infrastructure evolved through multiple generations, and understanding what they tried explains why graph search is harder than it looks.
Early LinkedIn search was keyword-based: you typed “Python developer” and got resumes with those keywords, ranked by some scoring function. This worked but ignored professional context. A Python developer at Google is different from one at a startup, and keyword search treated them the same.
Their solution was a graph-based approach that modeled professional identity as nodes and connections as edges: Person nodes, Company nodes, School nodes, Job nodes, all connected by edges like WORKS_AT, STUDIED_AT, CONNECTED_TO. Searching “people who worked at Google and now at a Series B startup” became a graph traversal, not a keyword search.
The operational challenge was latency. A graph traversal across LinkedIn’s scale takes time — traversing 2nd-degree connections (friends of friends) across 900 million members meant checking billions of potential paths. LinkedIn’s fix was pre-computation for common traversal patterns: they materialized common graph neighborhoods and cached results for frequently-searched relationship patterns. This is the graph equivalent of denormalization — you trade write amplification for read speed.
The lesson: graph databases make traversals possible, but at scale you still need to think about pre-computation, caching, and the cost of deep traversals. The graph model is expressive, but you cannot escape the fundamental tradeoff between flexibility and performance.
Interview Questions
Model email and phone as separate nodes or as properties on Account nodes. The query matches accounts sharing an email via HAS_EMAIL but diverging on phone via HAS_PHONE. The Cypher pattern finds accounts that share info (email) but have different phone nodes. The WHERE size(phones) > 0 filters out accounts without a recorded phone, and size(shared_phones) < size(phones) confirms at least one phone is not shared — meaning the same email is used by multiple accounts with different phone contact info.
The first thing to check is whether the query has an explicit depth limit. A pattern like (a)-[:FRIEND*]->(b) with no upper bound is the most common cause — it tries to traverse the entire reachable graph. Add a maximum depth: (a)-[:FRIEND*1..3]->(b). Then check whether indexes exist on the starting label and property — if Neo4j must scan all Person nodes to find Alice, that is expensive before the traversal even starts. Use PROFILE to see the operator tree, looking for NodeByLabelScan or AllNodesScan, which indicate missing index-based start. Finally, check if the starting node is a high-degree hub — if Alice has a million friends, expanding to all of them is slow regardless of indexes.
Choose a graph database when your core queries are traversals: friends of friends, shortest path between two people, multi-hop relationship-based recommendations. Relational databases can model social graphs and run these queries, but as traversal depth increases, join performance degrades because each join is a new table scan. A graph database traverses relationships directly, and traversal time depends on the depth of the traversal, not the total graph size. If your social feature mostly involves simple lookups by user ID or aggregate counts (how many friends does Alice have?), a relational database is simpler and sufficient. If it involves paths, recommendations based on connection patterns, or fraud detection across relationship graphs, a graph database pays off.
A single-server Neo4j deployment is straightforward — one machine, ACID transactions, no network overhead. It handles graphs up to whatever fits on that machine. Causal clustering splits the graph into replica cores using Raft consensus, providing fault tolerance and read scaling. Writes go through the leader replica, reads can be distributed to followers. The tradeoff is operational complexity: you now manage multiple servers, the consensus protocol adds latency to writes, and replica lag means you can read stale data on followers if you are not careful. Causal consistency in Neo4j guarantees you read your own writes on followers, but it does not guarantee all followers are up-to-date at any given moment.
Cypher MATCH queries are declarative — you specify the pattern, not the traversal method. The Cypher planner chooses the execution strategy. Traversal queries (using the Traversal API in Java or similar) are imperative — you explicitly define the traversal description (where to start, which relationships to follow, how deep, what to prune). Use Cypher MATCH for most queries — it is expressive and the planner is efficient. Use the Traversal API when you need fine-grained control over expansion order (e.g., breadth-first with depth cutoff), custom pruning logic (stop after finding N results), or when you are writing custom procedures in Java. Traversal API queries bypass the Cypher planner entirely, giving you explicit control at the cost of portability.
High-degree nodes are vertices with thousands or millions of relationships — a celebrity user, a popular product, a major hub in a network. When Neo4j traverses from a high-degree node, it must expand all of its relationships at once, consuming significant heap and CPU. The mitigation: always filter before expanding. Instead of MATCH (p:Person)-[:FRIEND]->(friend), add a label filter: MATCH (p:Person {name: 'Alice'})-[:FRIEND]->(friend:Friend) — filtering on friend label before expansion reduces the expansion scope. Use relationship-type filtering: MATCH (p)-[r:FRIEND]->(friend) without a label filter still requires expanding all FRIEND relationships. For queries that must process all relationships of a high-degree node, paginate the results using SKIP/LIMIT or use the ScrollableResult API in the Java driver to stream results rather than loading all at once.
Category
Related Posts
Document Databases: MongoDB and CouchDB Data Modeling
Learn MongoDB and CouchDB data modeling, embedding vs referencing, schema validation, and when document stores fit better than relational databases.
Column-Family Databases: Cassandra and HBase Architecture
Cassandra and HBase data storage explained. Learn partition key design, column families, time-series modeling, and consistency tradeoffs.
Denormalization
When to intentionally duplicate data for read performance. Tradeoffs with normalization, update anomalies, and application-level denormalization strategies.