Graph Databases: Neo4j and Graph Traversal Patterns

Learn Neo4j graph database modeling with Cypher. Covers nodes, edges, social networks, recommendation engines, fraud detection, and when graphs are not the right fit.

published: March 26, 2026 reading time: 17 min read author: GeekWorkBench

Graph Databases: Neo4j and Graph Traversal Patterns

Most databases store relationships as foreign keys and join them at query time. Graph databases store relationships as first-class citizens, making traversal across connections fast regardless of the total dataset size.

This changes what queries are practical. When you need to answer “find all friends of friends who bought this product,” relational databases slow down as the depth increases. Graph databases traverse relationships in time proportional to the hops you traverse, not the total data.

The Graph Model

A graph database has three core elements:

Nodes represent entities. They have properties (key-value pairs) and labels (type markers).

Edges (or relationships) connect nodes. They have direction, type, and properties. In Neo4j, edges are always directed but you can query in either direction.

Properties store data on both nodes and edges.

// A simple social graph
CREATE (alice:Person {name: 'Alice', age: 30})
CREATE (bob:Person {name: 'Bob', age: 28})
CREATE (carol:Person {name: 'Carol', age: 32})
CREATE (dave:Person {name: 'Dave', age: 27})

CREATE (alice)-[:FRIEND {since: 2020}]->(bob)
CREATE (alice)-[:FRIEND {since: 2019}]->(carol)
CREATE (bob)-[:FRIEND {since: 2021}]->(dave)
CREATE (carol)-[:FRIEND {since: 2018}]->(dave)

Visually:

    [Alice] --FRIEND--> [Bob]
      |                |
   FRIEND            FRIEND
      |                |
      v                v
    [Carol] <--FRIEND-- [Dave]

flowchart LR
    App["Application<br/>(Cypher Query)"]
    Traversal["Traversal<br/>Engine"]
    Store["Property<br/>Store"]
    RelStore["Relationship<br/>Store"]
    Index["Index<br/>(Lucene)"]
    Result["Result<br/>Set"]

    App --> Traversal
    Traversal --> Store
    Traversal --> RelStore
    Traversal --> Index
    RelStore --> Result
    Index --> Result

Neo4j’s traversal engine walks the graph by following relationships from starting nodes. It consults the property store for node/relationship data and the Lucene index for label and property lookups. Traversal time depends on the number of hops, not the total graph size.

Cypher: Neo4j’s Query Language

Cypher uses ASCII-art-like syntax to represent patterns. Nodes are in parentheses, relationships are arrows.

// Find all of Alice's friends
MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->(friend)
RETURN friend.name

// Find mutual friends of Alice and Bob
MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->(mutual)
MATCH (bob:Person {name: 'Bob'})-[:FRIEND]->(mutual)
RETURN mutual.name

// Find friends of friends (2nd degree connections)
MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->()-[:FRIEND]->(fof)
RETURN fof.name

// Exclude already-friends
MATCH (alice:Person {name: 'Alice'})-[:FRIEND*2]->(potential_friend)
WHERE NOT (alice)-[:FRIEND]->(potential_friend)
RETURN potential_friend.name

Pattern Matching

// Find users who bought products in the same category as their friends
MATCH (person:Person)-[:PURCHASED]->(product:Product)-[:IN_CATEGORY]->(category:Category)<-[:IN_CATEGORY]-(friend:Person)-[:PURCHASED]->(rec:Product)
WHERE NOT (person)-[:PURCHASED]->(rec)
RETURN rec.name AS recommendation, count(*) AS frequency
ORDER BY frequency DESC
LIMIT 10

// Find the shortest path between two people
MATCH path = shortestPath(
  (alice:Person {name: 'Alice'})-[*]-(dave:Person {name: 'Dave'})
)
RETURN path

// Find all relationships in a network path
MATCH path = (a:Person)-[:FRIEND|PURCHASED|REVIEWED*1..3]-(b:Person)
WHERE a.name = 'Alice' AND b.name = 'Dave'
RETURN path

Aggregation and Analysis

// Count friends by person
MATCH (p:Person)-[:FRIEND]->(friend)
RETURN p.name, count(friend) AS friend_count
ORDER BY friend_count DESC

// Find influential users (most connections)
MATCH (p:Person)
RETURN p.name, size((p)--()) AS connections
ORDER BY connections DESC
LIMIT 10

// Collaborative filtering: "users who liked X also liked Y"
MATCH (user:Person)-[:LIKED]->(product:Product)<-[:LIKED]-()-[:LIKED]->(rec)
WHERE (user)-[:LIKED]->(product) AND not(user)-[:LIKED]->(rec)
RETURN product.name AS liked, rec.name AS recommended, count(*) AS strength
ORDER BY strength DESC

Where Graphs Actually Shine

Social networks are the canonical graph use case. Friends, followers, groups, likes, comments — all form a natural graph structure.

// Find friend recommendations (friends of friends who aren't friends yet)
MATCH (me:Person {name: 'Alice'})
MATCH (me)-[:FRIEND]->(friend)-[:FRIEND]->(suggestion)
WHERE NOT (me)-[:FRIEND]->(suggestion) AND suggestion <> me
RETURN suggestion.name, count(*) AS mutual_friend_count
ORDER BY mutual_friend_count DESC
LIMIT 5

// Find communities using weakly connected components
CALL algo.labelPropagation.stream('Person', 'FRIEND', {})
YIELD nodeId, label
RETURN label AS community, collect(members.name) AS members

// Note: above is pseudocode - actual implementation varies by Neo4j version

Recommendation Engines

Graphs power real-time recommendations by finding patterns in user behavior.

// "Users who viewed X also viewed Y"
MATCH (p:Person)-[:VIEWED]->(product:Product)<-[:VIEWED]-(other)-[:VIEWED]->(rec)
WHERE p.id = $current_user AND rec <> product
RETURN rec.name, count(*) AS views_together
ORDER BY views_together DESC

// Content-based recommendations
MATCH (user:Person {id: $user_id})-[:PURCHASED]->(owned:Product)-[:IN_CATEGORY|hasTag|TAGGED_WITH]->(attr)
WHERE NOT (user)-[:PURCHASED]->(rec:Product)-[:IN_CATEGORY|hasTag|TAGGED_WITH]->(attr)
AND NOT (user)-[:PURCHASED]->(rec)
RETURN rec, count(attr) AS matching_attributes
ORDER BY matching_attributes DESC

// "Popular in your network"
MATCH (user:Person {id: $user_id})-[:FRIEND*1..2]->(friend)-[:PURCHASED]->(product)
WHERE NOT (user)-[:PURCHASED]->(product)
RETURN product.name, count(DISTINCT friend) AS friend_count
ORDER BY friend_count DESC

Fraud Detection

Graphs reveal fraud patterns invisible to row-by-row analysis.

// Detect accounts sharing the same identifying info
MATCH (a:Account)-[:HAS_EMAIL|HAS_PHONE|HAS_IP]->(info)<-[:HAS_EMAIL|HAS_PHONE|HAS_IP]-(b:Account)
WHERE a <> b
WITH info, collect(a.id) AS accounts, collect(b.id) AS others
WHERE size(accounts) + size(others) > 5  // many accounts sharing info
RETURN info, accounts + others AS suspicious_accounts

// Find circular transactions (potential money laundering)
MATCH path = (a:Account)-[:SENT_TO*5..10]->(a)
WHERE a <> head(nodes(path))
RETURN path

// Detect velocity patterns (many small transactions)
MATCH (a:Account)-[t:TRANSACTION]->()
WHERE t.timestamp > datetime() - duration('P1D')
WITH a, count(t) AS tx_count, sum(t.amount) AS total_amount
WHERE tx_count > 100 AND total_amount < tx_count * 10  // many small amounts
RETURN a.id AS account, tx_count, total_amount

Network and IT Operations

// Find dependencies between services
MATCH (a:Service)-[:DEPENDS_ON]->(b:Service)
RETURN a.name AS service, collect(b.name) AS dependencies

// Find critical paths (no redundancy)
MATCH path = (a:Service)-[:DEPENDS_ON*]->(b:Service)
WHERE NOT exists(()-[:DEPENDS_ON]->(a))  // no incoming dependencies
AND NOT exists((b)-[:DEPENDS_ON]->())   // no outgoing dependencies
RETURN a.name AS source, b.name AS sink

// Root cause analysis
MATCH path = (failure:Incident)-[:AFFECTS*0..3]-(service:Service)
WHERE failure.severity = 'critical'
RETURN service.name, length(path) AS distance_from_incident
ORDER BY distance_from_incident

Modeling: Nodes vs Edges

The choice of what becomes a node vs an edge shapes query capabilities.

Things that work well as nodes:

Entities with independent existence (Person, Product, Company)
Abstract concepts that participate in multiple relationships
Records you need to query directly

Things that work well as edges:

Relationships between entities
Events that have no identity beyond the connection
Attributes that describe a relationship (date, weight, role)

// Good: Purchase as an edge (an event, not an entity)
MATCH (customer)-[p:PURCHASED]->(product)
WHERE p.date > '2024-01-01'
RETURN product.name, sum(p.amount) AS revenue

// Consider: Purchase as a node when you need to query purchases independently
// e.g., "find all purchases with these exact items"
CREATE (p:Purchase {id: 'pur-123', date: date('2024-03-15')})
CREATE (customer)-[:MADE_PURCHASE]->(p)
CREATE (p)-[:INCLUDED_ITEM]->(product)

Practical Neo4j Operations

Python Driver

from neo4j import GraphDatabase

class Neo4jConnection:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def query(self, cypher, parameters=None):
        with self.driver.session() as session:
            result = session.run(cypher, parameters)
            return [dict(record) for record in result]

    def close(self):
        self.driver.close()

# Usage
conn = Neo4jConnection('bolt://localhost:7687', 'neo4j', 'password')

# Create nodes
conn.query('''
CREATE (p:Person {name: $name, email: $email})
''', {'name': 'Alice', 'email': 'alice@example.com'})

# Query with parameters
friends = conn.query('''
MATCH (me:Person {name: $name})-[:FRIEND]->(friend)
RETURN friend.name, friend.email
''', {'name': 'Alice'})

# Find recommendations
recommendations = conn.query('''
MATCH (me:Person {name: $name})-[:PURCHASED]->(product)-[:IN_CATEGORY]->(cat)<-[:IN_CATEGORY]-(rec)
WHERE NOT (me)-[:PURCHASED]->(rec)
RETURN rec.name, count(*) AS score
ORDER BY score DESC
LIMIT 5
''', {'name': 'Alice'})

Importing Data

// Import from CSV
LOAD CSV FROM 'file:///users.csv' AS row
CREATE (u:User {
    id: toInteger(row.id),
    name: row.name,
    email: row.email,
    created_at: datetime(row.created_at)
});

// Import relationships
LOAD CSV FROM 'file:///friendships.csv' AS row
MATCH (a:User {id: toInteger(row.user_a)})
MATCH (b:User {id: toInteger(row.user_b)})
CREATE (a)-[:FRIEND {since: toInteger(row.year)}]->(b);

// Batch import with periodic commits
:auto USING PERIODIC COMMIT 1000
LOAD CSV FROM 'file:///events.csv' AS row
CREATE (:Event {
    id: row.event_id,
    type: row.type,
    timestamp: datetime(row.timestamp)
});

When to Use / When Not to Use Graph Databases

Graph databases shine when relationships are first-class data, not afterthoughts.

Use graph databases when:

You need multi-hop traversals (friends-of-friends, recommendation chains, dependency graphs)
Relationship patterns drive your core business logic (fraud rings, supply chains, social graphs)
Your queries depend on the topology of connections, not just attribute values
You need real-time path finding or shortest-path queries
You are modeling domains where context and connection history directly inform outcomes

Do not use graph databases when:

Your data is mostly independent records with few meaningful relationships
Your primary access pattern is full-text search — use Elasticsearch instead
You need heavy OLAP aggregation (sums, group-bys across millions of rows) — columnar databases win here
Your schema is stable and relational with straightforward joins — the graph model adds complexity you will not use
Writes heavily outnumber reads and the writes are not relationship-driven — a write-optimized store like Cassandra handles that better

Hybrid Approaches

Many systems combine graph with other databases.

Application Layer
        |
        +---> Graph DB (Neo4j) for social, recommendations
        |
        +---> Relational DB (PostgreSQL) for transactional
        |
        +---> Search (Elasticsearch) for full-text
        |
        +---> Cache (Redis) for hot data

# Example: hybrid architecture for e-commerce
def get_product_recommendations(user_id):
    # Fast filter with search
    candidates = elasticsearch.search(
        query='laptop',
        filters={'category': 'electronics', 'price_range': '500-2000'}
    )

    # Use graph for collaborative filtering
    graph_result = neo4j.query('''
    MATCH (u:User {id: $user_id})-[:PURCHASED]->()-[:IN_CATEGORY]->(cat)
    WHERE u.id IN $candidates
    WITH cat, count(*) AS score
    MATCH (cat)<-[:IN_CATEGORY]-(rec)
    WHERE NOT (u)-[:PURCHASED]->(rec)
    RETURN rec.id, score
    ORDER BY score DESC
    LIMIT 10
    ''', {'user_id': user_id, 'candidates': [c['id'] for c in candidates]})

    return merge_search_and_graph(candidates, graph_result)

Graph Database Comparison

Feature	Neo4j	Amazon Neptune	TigerGraph
Query language	Cypher (openCypher)	SPARQL, Gremlin, OpenCypher	GSQL
License	GPL Community, commercial	AWS managed service	Open source, commercial
Native traversal	Yes — purpose-built	Yes	Yes
Scalability	Single-machine to cluster	Fully managed, multi-AZ	Horizontally scalable
OLAP / Analytics	Limited (via Apache Spark connector)	Analytics via Gremlin	Built-in parallel analytics
Cloud offering	Aura (managed), self-hosted	Neptune Serverless, provisioned	TigerGraph Cloud
Best for	General-purpose graphs, operational	AWS integration, SPARQL workloads	Large-scale analytics, real-time deep link analysis
Learning curve	Moderate (Cypher is intuitive)	Moderate (multiple query APIs)	Steeper (GSQL is powerful but verbose)

Capacity Estimation: Node and Edge Storage in Neo4j

Neo4j stores each node, relationship, and property as separate records on disk. Node records are 15 bytes each plus property references. Relationship records are 34 bytes each plus property references. Property records vary by type: 1 byte for a small integer, n+1 bytes for a string (n bytes plus length byte), 8 bytes for a floating-point number.

For a social graph with 10 million users and 100 million friendships, the base storage is straightforward: 10 million nodes at roughly 15 bytes each is 150MB, and 100 million relationships at 34 bytes each is 3.4GB. Properties add more: if each user has 5 properties averaging 50 bytes each, that is 2.5GB. The total working set for this graph is roughly 6GB before indexes.

Indexes in Neo4j are separate from the graph store. The default index type is a B-tree, and it consumes roughly 30% of the node/relationship data size for highly selective indexes. Property existence constraints and range indexes add more overhead. For the 10 million user graph above, adding indexes on Person(name) and Person(email) adds roughly 1-2GB of index storage.

The practical implication: Neo4j on a single machine handles graphs in the hundreds of millions of edges comfortably if your working set fits in RAM. Beyond that, you need clustering. Neo4j Enterprise clusters data across machines using causal clustering, which splits the graph into distinct cores. Each core is a raft group of 3 or more replicas, and reads get distributed across replicas.

For memory sizing, Neo4j recommends your page cache (which caches the graph store on disk) be large enough to hold your hot data. With 6GB of graph data and 30% hot, aim for 2GB page cache minimum. Add heap for query execution: simple traversals need 512MB to 1GB heap, complex aggregations need 2GB+.

Observability Hooks: Neo4j DBMS Metrics and Query Profiling

Neo4j exposes DBMS metrics via the dbms.metrics namespace. The JMX endpoint (dbms.metrics.prometheus) or the Neo4j HTTP endpoint (/db/neo4j/metrics) provides access to key operational metrics.

Page cache hit ratio is the most important metric: page_cache_hits divided by page_cache_hits + page_cache_misses. Above 95% and Neo4j barely touches the disk. Below 80% and you are latency-bound on page faults. If page cache hit ratio drops, increase dbms.memory.pagecache.size or reduce the graph working set.

Transaction latency: transaction_started, transaction_committed, and transaction_rollbacks counters tell you throughput. transaction_time histograms give you p50/p95/p99 latency. If p99 latency spikes while throughput stays flat, you have slow queries holding connections. db.checking.time tracks constraint checking time separately from transaction execution.

For query profiling, use EXPLAIN before running a query to see the query plan without executing it, and PROFILE to execute and return actual runtime statistics. The hits column in a profile shows how many times each operator ran, and estimatedRows shows the planner’s cardinality estimate. A large gap between estimated and actual rows means stale statistics — run CALL db.stats.collect() to update.

For Cypher query monitoring, Neo4j Enterprise tracks query runtime via dbms.query_tracking.enabled. The slow query log (dbms.logs.query.enabled) logs queries exceeding dbms.logs.query.threshold (default 0, set to something like 100ms). Monitor dbms.transaction.timeout — if set, queries exceeding this limit are killed automatically.

Real-World Case Study: LinkedIn’s Graph Search

LinkedIn’s graph is the social network itself — 900+ million members, billions of connections, and search across professional relationships, company pages, job listings, and content. Their search infrastructure evolved through multiple generations, and understanding what they tried explains why graph search is harder than it looks.

Early LinkedIn search was keyword-based: you typed “Python developer” and got resumes with those keywords, ranked by some scoring function. This worked but ignored professional context. A Python developer at Google is different from one at a startup, and keyword search treated them the same.

Their solution was a graph-based approach that modeled professional identity as nodes and connections as edges: Person nodes, Company nodes, School nodes, Job nodes, all connected by edges like WORKS_AT, STUDIED_AT, CONNECTED_TO. Searching “people who worked at Google and now at a Series B startup” became a graph traversal, not a keyword search.

The operational challenge was latency. A graph traversal across LinkedIn’s scale takes time — traversing 2nd-degree connections (friends of friends) across 900 million members meant checking billions of potential paths. LinkedIn’s fix was pre-computation for common traversal patterns: they materialized common graph neighborhoods and cached results for frequently-searched relationship patterns. This is the graph equivalent of denormalization — you trade write amplification for read speed.

The lesson: graph databases make traversals possible, but at scale you still need to think about pre-computation, caching, and the cost of deep traversals. The graph model is expressive, but you cannot escape the fundamental tradeoff between flexibility and performance.

Interview Questions

1. You need to find all pairs of people in your database who share the same email address but have different phone numbers — for fraud detection. How do you model and query this in Neo4j?

Model email and phone as separate nodes or as properties on Account nodes. The query matches accounts sharing an email via HAS_EMAIL but diverging on phone via HAS_PHONE. The Cypher pattern finds accounts that share info (email) but have different phone nodes. The WHERE size(phones) > 0 filters out accounts without a recorded phone, and size(shared_phones) < size(phones) confirms at least one phone is not shared — meaning the same email is used by multiple accounts with different phone contact info.

2. Your Neo4j query for "friends of friends of friends" is timing out after 30 seconds. What do you check?

The first thing to check is whether the query has an explicit depth limit. A pattern like (a)-[:FRIEND*]->(b) with no upper bound is the most common cause — it tries to traverse the entire reachable graph. Add a maximum depth: (a)-[:FRIEND*1..3]->(b). Then check whether indexes exist on the starting label and property — if Neo4j must scan all Person nodes to find Alice, that is expensive before the traversal even starts. Use PROFILE to see the operator tree, looking for NodeByLabelScan or AllNodesScan, which indicate missing index-based start. Finally, check if the starting node is a high-degree hub — if Alice has a million friends, expanding to all of them is slow regardless of indexes.

3. When would you choose a graph database over a relational database for a social network feature?

Choose a graph database when your core queries are traversals: friends of friends, shortest path between two people, multi-hop relationship-based recommendations. Relational databases can model social graphs and run these queries, but as traversal depth increases, join performance degrades because each join is a new table scan. A graph database traverses relationships directly, and traversal time depends on the depth of the traversal, not the total graph size. If your social feature mostly involves simple lookups by user ID or aggregate counts (how many friends does Alice have?), a relational database is simpler and sufficient. If it involves paths, recommendations based on connection patterns, or fraud detection across relationship graphs, a graph database pays off.

4. What is the difference between Neo4j causal clustering and a single-server deployment, and what are the operational tradeoffs?

A single-server Neo4j deployment is straightforward — one machine, ACID transactions, no network overhead. It handles graphs up to whatever fits on that machine. Causal clustering splits the graph into replica cores using Raft consensus, providing fault tolerance and read scaling. Writes go through the leader replica, reads can be distributed to followers. The tradeoff is operational complexity: you now manage multiple servers, the consensus protocol adds latency to writes, and replica lag means you can read stale data on followers if you are not careful. Causal consistency in Neo4j guarantees you read your own writes on followers, but it does not guarantee all followers are up-to-date at any given moment.

5. What is the difference between Neo4j's traversal query and a Cypher MATCH query? When would you use each?

Cypher MATCH queries are declarative — you specify the pattern, not the traversal method. The Cypher planner chooses the execution strategy. Traversal queries (using the Traversal API in Java or similar) are imperative — you explicitly define the traversal description (where to start, which relationships to follow, how deep, what to prune). Use Cypher MATCH for most queries — it is expressive and the planner is efficient. Use the Traversal API when you need fine-grained control over expansion order (e.g., breadth-first with depth cutoff), custom pruning logic (stop after finding N results), or when you are writing custom procedures in Java. Traversal API queries bypass the Cypher planner entirely, giving you explicit control at the cost of portability.

6. How do you handle high-degree nodes (celebrity nodes) in Neo4j, and what impact do they have on query performance?

High-degree nodes are vertices with thousands or millions of relationships — a celebrity user, a popular product, a major hub in a network. When Neo4j traverses from a high-degree node, it must expand all of its relationships at once, consuming significant heap and CPU. The mitigation: always filter before expanding. Instead of MATCH (p:Person)-[:FRIEND]->(friend), add a label filter: MATCH (p:Person {name: 'Alice'})-[:FRIEND]->(friend:Friend) — filtering on friend label before expansion reduces the expansion scope. Use relationship-type filtering: MATCH (p)-[r:FRIEND]->(friend) without a label filter still requires expanding all FRIEND relationships. For queries that must process all relationships of a high-degree node, paginate the results using SKIP/LIMIT or use the ScrollableResult API in the Java driver to stream results rather than loading all at once.

Graph Databases: Neo4j and Graph Traversal Patterns

The Graph Model

Cypher: Neo4j’s Query Language

Pattern Matching

Aggregation and Analysis

Where Graphs Actually Shine

Social Networks

Recommendation Engines

Fraud Detection

Network and IT Operations

Modeling: Nodes vs Edges

Practical Neo4j Operations

Python Driver

Importing Data

When to Use / When Not to Use Graph Databases

Hybrid Approaches

Graph Database Comparison

Capacity Estimation: Node and Edge Storage in Neo4j

Observability Hooks: Neo4j DBMS Metrics and Query Profiling

Real-World Case Study: LinkedIn’s Graph Search

Interview Questions

Category

Tags

Related Posts

Document Databases: MongoDB and CouchDB Data Modeling

Column-Family Databases: Cassandra and HBase Architecture

Denormalization