Apache Solr: Enterprise Search Platform

Explore Apache Solr's powerful search capabilities including faceted search, relevance tuning, indexing strategies, and how it compares to Elasticsearch.

published: March 22, 2026 reading time: 22 min read author: GeekWorkBench updated: May 17, 2026

Quick Summary

Provides a practical guide to Apache Solr as an enterprise search platform, covering document indexing with XML/JSON/CSV, explicit schema.xml configuration, Tika-based binary format extraction (PDF, DOCX), and faceted search including pivot facets. The guide explains relevance tuning via edismax, function queries, and BM25 similarity, with a detailed comparison against Elasticsearch covering schema control, ZooKeeper dependency, and faceting depth. Production topics include cache warming strategies, ZooKeeper failure handling, SolrJ connection pooling, and security configuration. After reading, you can deploy, configure, and troubleshoot Solr in production, and make an informed decision between Solr and Elasticsearch for a given use case.

Apache Solr: Enterprise Search Platform

Apache Solr is an open-source search platform built on Apache Lucene, first released in 2007. Elasticsearch came along a few years later and took most of the new-project market, but Solr still runs plenty of production systems — particularly in enterprises that already have Solr infrastructure or need specific security and operational features that Solr handles well.

Indexing in Solr

Solr indexes data as documents, which are similar to Elasticsearch documents. Each document contains fields, and each field has a type that determines how it is analyzed and stored.

Document Structure

<add>
  <doc>
    <field name="id">1</field>
    <field name="title">Getting Started with Solr</field>
    <field name="content">Solr is a search platform built on Lucene</field>
    <field name="category">tutorials</field>
    <field name="publish_date">2024-01-15T00:00:00Z</field>
  </doc>
</add>

Solr accepts data in multiple formats: XML, JSON, CSV, and binary. The ExtractingRequestHandler (SolrCell) can also extract content from PDFs, Word documents, and other binary formats using Apache Tika.

Schema Configuration

Solr uses a schema.xml file to define field types and fields. Unlike Elasticsearch’s automatic mapping, Solr’s schema is typically managed explicitly, which gives you precise control over how data is indexed.

<schema name="example" version="1.6">
  <fieldType name="text_en" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPorterFilterFactory"/>
    </analyzer>
  </fieldType>

  <field name="id" type="string" indexed="true" stored="true"/>
  <field name="title" type="text_en" indexed="true" stored="true"/>
  <field name="content" type="text_en" indexed="true" stored="false"/>
</schema>

The explicit schema means you get predictable, reproducible index behavior across deployments.

Tika Extraction Optimization

Apache Tika handles binary format extraction — PDFs, DOCX, PPTX, images with OCR — via the ExtractingRequestHandler. Out-of-the-box settings work fine for low-volume workloads, but production pipelines need tuning.

Tika configuration to set in production:

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="fmap.content">_text_</str>
    <str name="fmap.meta">_meta</str>
    <str name="uprefix">ignored_</str>
    <str name="lowernames">true</str>
    <!-- Timeout in milliseconds -->
    <int name="timeout">30000</int>
    <!-- Max bytes to extract (50MB) -->
    <long name="maxExtractedBytes">52428800</long>
  </lst>
</requestHandler>

Tika extraction failure handling:

Problem	Symptom	Solution
Password-protected PDFs	No text extracted, no error	Set `X-Password` header or skip via `SkipPasswordMetadata`
Corrupt files	Tika throws exception	Wrap in try-catch, fallback to raw text
Very large files	OOM or timeout	Chunk via `TikaInputStream`, process in segments
OCR failure	Image searches return empty	Use `SolrTika` with `TesseractOCRConfig`

Metadata extraction for search filtering:

{
  "extracted_metadata": {
    "title": "Document Title",
    "author": "John Doe",
    "created": "2024-03-15T10:00:00Z",
    "keywords": ["search", "solr", "tika"]
  }
}

Extract metadata fields and map them to schema fields for faceting:

<field name="author" type="string" stored="true" indexed="true"/>
<field name="doc_date" type="date" stored="true" indexed="true"/>

Tika Extraction Key Takeaways

Set timeout on ExtractingRequestHandler to prevent pipeline stalls
Configure maxExtractedBytes to protect against memory exhaustion
Map Tika metadata fields to schema for facetable attributes
Use password override headers for encrypted corporate documents
Fallback to raw text extraction when Tika parsing fails

Search Features

Faceted Search

Solr’s faceted search implementation is one of its strongest features. It categorizes search results by field values, counts, and ranges, letting users narrow down results by selecting facets.

{
  "query": "search",
  "facet": {
    "categories": {
      "field": "category",
      "mincount": 1
    },
    "date_ranges": {
      "type": "range",
      "field": "publish_date",
      "start": "2020-01-01",
      "end": "2026-12-31",
      "gap": "+1YEAR"
    },
    "top_authors": {
      "type": "terms",
      "field": "author",
      "limit": 5
    }
  }
}

The response includes counts for each facet value, letting you display something like “Category: Tutorials (42), Technical (38), Opinion (15)” alongside search results.

Pivot facets let you nest facets. For example, you can find counts for “category x author” combinations:

{
  "facet": {
    "pivot": "category,author"
  }
}

This is useful for analytics dashboards where you want to see not just counts per category, but counts per author within each category.

Relevance Tuning

Solr offers extensive relevance controls. The edismax query parser is more forgiving than the standard parser and allows fine-tuning through parameters.

Boosting — You can boost fields or documents at query time:

q=search&qf=title^2 content&pf=title^3

This sets:

qf: Query fields with title weighted 2x over content
pf: Phrase fields (title) with 3x boost for phrase matches

Function Queries — Solr’s function queries let you incorporate mathematical expressions into relevance scoring:

{
  "query": "{!func}product(rord(popularity),0.1)",
  "boost": "recip(rord(last_modified),1,1000,1000)"
}

Functions like rord, recip, linear, and max give you precise control over how documents are ranked.

Term Frequency Normalization — If a field has very long documents, term frequency becomes less meaningful. Solr’s bogus similarity or custom implementations can normalize term frequencies differently:

<similarity class="solr.BM25SimilarityFactory"/>

Solr vs Elasticsearch

Both Solr and Elasticsearch sit on Apache Lucene, but they have different philosophies.

Aspect	Solr	Elasticsearch
Schema	Explicit schema.xml	Dynamic mapping with explicit override
Configuration	XML files, more verbose	JSON-based REST API
Distributed search	SolrCloud (ZooKeeper required)	Built-in, no ZooKeeper
Faceted search	More mature faceting	Good faceting, less flexible
Operational complexity	Higher	Lower
Enterprise adoption	Legacy systems	Modern cloud-native

Solr’s explicit schema and XML configuration appeal to teams that want declarative control and reproducibility. Elasticsearch’s REST-first approach and dynamic mapping make it more developer-friendly.

If you need deep faceting for e-commerce or analytics, Solr is the better choice. For simple full-text search with easier operations, Elasticsearch is usually the default.

When to Use / When Not to Use

When to Use Apache Solr

E-commerce faceted search with complex hierarchies (category, brand, price range, attributes)
Enterprise search with strict security requirements and legacy system integration
Data preprocessing pipelines needing ETL-style document enrichment via Update Request Processors
Deep analytics with pivot facets, function queries, and complex aggregations
Document-centric search where schema control and reproducibility matter
Strict consistency requirements where SolrCloud’s ZooKeeper-based leader election fits the use case

When Not to Use Apache Solr

Cloud-native microservices where the ZooKeeper dependency adds operational overhead
Real-time log analysis (use Elasticsearch with Beats/Logstash instead)
Simple CRUD with search where a database full-text search feature suffices
Teams without Solr expertise — the learning curve is steeper than Elasticsearch
Rapid prototyping — Solr’s XML configuration slows down initial development
Kubernetes-first deployments where stateless pods and auto-scaling are priorities

Trade-off Analysis

When choosing Solr over alternatives, understanding the key tradeoffs helps make informed architectural decisions.

Schema Control vs Development Speed

| | Factor | Solr (Explicit Schema) | Elasticsearch (Dynamic) ||| | | ------------------------ | --------------------------- | -------------------------------- ||| | | Initial setup time | Longer — requires field defs| Faster — auto-creates mappings ||| | | Type safety | Strong — mismatches rejected | Weak — silent conversions ||| | | Operational predictability| High — reproducible behavior | Medium — varies by data shape ||| | | Refactoring effort | Higher — must update schema | Lower — mappings adapt automatically||| | | Performance tuning | Precise — known field types | Requires profiling per field type|||

Recommendation: Use Solr when schema stability and type safety are priorities. Use Elasticsearch when rapid prototyping and schema flexibility are needed.

ZooKeeper Dependency vs Built-in Coordination — SolrCloud requires 3–5 ZooKeeper nodes for cluster state management, whereas Elasticsearch uses a gossip protocol with no extra infrastructure. ZooKeeper adds operational overhead but delivers strong consistency guarantees.

XML Configuration vs REST API — Solr uses verbose but explicit XML configs ideal for auditability and reproducibility; Elasticsearch’s JSON REST API enables faster iteration and programmatic management.

Faceted Search Depth vs Operational Complexity — Solr’s faceted search is superior for complex e-commerce and analytics use cases; Elasticsearch handles standard faceting adequately.

Indexing Throughput vs Resource Consumption — Solr has a higher memory footprint (JVM + file handles) but excels at bulk indexing and includes native Tika integration for binary formats. Elasticsearch offers lower per-shard resource consumption.

Backup and Recovery Trade-offs

| Aspect | Solr | Elasticsearch ||| | --------------- | --------------------------------- | --------------------------------- ||| | Snapshot method | CREATESNAPSHOT / RESTORESNAPSHOT| snapshot / restore API ||| | Storage | Shared filesystem (NFS) or HDFS | Shared bucket (S3, GCS, Azure) ||| | Incremental backup| Manual via tiered snapshots | Native with repository plugins |||

Production Failure Scenarios

Failure	Impact	Mitigation
ZooKeeper connection loss	Cluster goes read-only, leader election stalls	Deploy ZooKeeper ensemble (3+ nodes), monitor `zk_connections`, set `leader_volatile_ttl`
Leader shard replica failure	Write operations halt until failover completes	Configure `replicationFactor >= 2`, use SolrCloud auto-recovery
Index corruption	Queries return empty results or throw exceptions	Use `solr-admin UI` to verify index integrity, run `IndexFingerprint` comparison
Tika extraction failure	Binary attachments (PDF, DOCX) not indexed	Set `timeout` on `ExtractingRequestHandler`, fallback to raw text extraction
Facet overflow (too many facet buckets)	OOM or slow queries	Set `facet.enum.cache.minDf` to reduce enum iterations, paginate facets
Config drift across nodes	Inconsistent analyzers, different scoring	Store config in ZooKeeper (`configsets`), push updates atomically

Common Pitfalls / Anti-Patterns

Over-Configuring `solrconfig.xml`

Solr’s XML configuration is powerful but verbose. Teams often copy-paste configurations with unused components, causing memory bloat and slower query parsing.

Fix: Audit your solrconfig.xml quarterly. Remove unused request handlers, caches, and query parsers.

Ignoring the Overseer Bottleneck

The Overseer handles cluster state management. In large clusters with frequent shard moves, the Overseer queue backs up, causing delayed state propagation.

Fix: Avoid unnecessary shard moves. If you need frequent rebalancing, consider splitting collections instead.

Not Warming Caches Properly

Solr’s caches (filterCache, queryResultCache, documentCache) start empty. A cold cache after restart causes slow initial queries.

Fix: Configure queryResultCache warming via warmQueries or use autowarming with named caches.

<!-- Configure autowarming for filter cache -->
<filterCache class="solr.LRUCache"
             size="10000"
             initialSize="5000"
             autowarmCount="2000"/>

Using the Wrong Parser for User Input

The standard query parser (defType=dismax) is strict and throws exceptions on syntax errors. User-facing queries should use edismax, which is more forgiving.

Fix: Always use edismax or lucene for user-generated query strings:

q=user_input&defType=edismax&qf=title content

Skipping Schema Validation

Solr’s schema is flexible but silent on type mismatches. Sending a string to an int field converts it without warning, which can cause range queries to behave unexpectedly.

Fix: Validate field types explicitly. Use schema.xml with required="true" and multiValued appropriately.

Quick Recap

Key bullets:

Solr’s explicit schema.xml gives precise control over field types and analyzers
Faceted search is one of Solr’s strongest features — pivot facets are useful for analytics dashboards
edismax query parser is more forgiving than standard dismax for user input
SolrCloud requires ZooKeeper for distributed coordination; Elasticsearch does not
Function queries enable mathematical expressions in relevance scoring
Document-level security via SecurityBasicAuth or rule-based authorization
Tika integration handles PDF, DOCX, and other binary formats out of the box

Copy/Paste Checklist:

# Check cluster health
GET /admin/collections?action=CLUSTERSTATUS

# Create a collection with replication
GET /admin/collections?action=CREATE&name=my-collection&numShards=3&replicationFactor=2

# Reload collection after config change
GET /admin/collections?action=RELOAD&name=my-collection

# Force commit after bulk indexing
GET /my-collection/update?wt=json&commit=true

# Check facet counts
GET /my-collection/select?q=*:*&facet=true&facet.field=category&facet.limit=10

# Export schema field names
GET /my-collection/schema/fields

# Set cache warming query
GET /admin/cores?action=RELOAD&core=my-collection&warm=true

Observability

Solr metrics to track:

{
  "cluster_metrics": {
    "cluster_status": "active/standby/degraded",
    "num_shards": "total across collection",
    "num_nodes": "live nodes count",
    " overseer_queue_size": "< 100 typically"
  },
  "collection_metrics": {
    "docs_count": "total indexed documents",
    "size_in_bytes": "index size on disk",
    "num_segements": "< 50 per shard",
    "deletion_ratio": "< 10% ideal"
  },
  "query_metrics": {
    "QTime_avg": "< 100ms for interactive",
    "QTime_p99": "< 500ms for complex faceted queries",
    "request_times": "track by handler endpoint"
  }
}

Solr logs: logs/solr.log — shard leader elections, distributed search failures
GC logs: Solr is Java-heavy; long GC pauses cause query timeouts
Audit logs: security events, authentication failures, authorization denials
Slow query logs: enable trackQueryExecution and log queries exceeding thresholds

Cache Warming

Solr uses multiple caches that are cold on startup or after eviction. Proper warming prevents slow queries on cache misses.

Cache Types:

Cache	Purpose	Size Guidance
`filterCache`	Stores document IDs matching filters	`numDocs * avgFiltersPerQuery`
`queryResultCache`	Caches document ID sets per query	`maxResults * numQueries`
`documentCache`	Stores stored fields per document	`maxDoc`
`fieldValueCache`	Caches terms for faceting/sorting	`numUniqueFields * avgValuesPerField`

Autowarming Configuration — Autowarming copies cached entries from an old cache to a new one on reload:

<!-- filterCache with autowarming -->
<filterCache class="solr.LRUCache"
             size="10000"
             initialSize="5000"
             autowarmCount="2000"/>

<!-- queryResultCache with autowarming using named queries -->
<queryResultCache class="solr.LRUCache"
                  size="1000"
                  initialSize="500"
                  autowarmCount="500"
                  name="myQueryCache"/>

Warm Queries Strategy — Define a set of representative queries to run on startup:

<queryWriter name="warmQueryWriter" class="solr.JSONResponseWriter">
  <str name="warmQueries">
    <lst>
      <str name="q">featured:true</str>
      <str name="rows">10</str>
    </lst>
    <lst>
      <str name="q">category:technical</str>
      <str name="facet">true</str>
      <str name="rows">50</str>
    </lst>
  </str>
</queryWriter>

Continuous Warming via Scheduled Searches — For always-warm caches, schedule common queries via an external job:

# Warm query executed before cluster reload
curl "http://solr-node:8983/solr/my-collection/select" \
  "?q=category:tutorials&rows=10&ftimeout=5000"
curl "http://solr-node:8983/solr/my-collection/select" \
  "?q=*:*&rows=0&facet=true&facet.field=category&ftimeout=5000"

Cache Warming Benchmarks

Scenario	Cold Cache QTime	Warm Cache QTime	Improvement
Simple term query	450ms	12ms	~38x
Faceted query (5 fields)	1200ms	45ms	~27x
Join query	800ms	80ms	~10x

Cache Warming Key Takeaways

Configure autowarmCount to at least 50% of cache size for faster recovery
Define warmQueries using your top 10 most frequent query patterns
Monitor cache warmup time in metrics — target under 30 seconds
Use queryResultCache autowarming to pre-populate paginated result sets
Schedule background warming jobs during low-traffic windows

<!-- Enable slow query tracking in solrconfig.xml -->
<slowQueryThresholdMillis>5000</slowQueryThresholdMillis>
<trackQueryExecution>true</trackQueryExecution>

Alerts to Configure

Alert	Condition	Severity
ZooKeeper down	`zk_connections == 0`	Critical
No leader for shard	`shard leader == null` for > 1 min	Critical
Disk usage high	`disk usage > 80%`	Warning
GC pause	`GC pause > 1s`	Warning
Query timeout rate	`timeout_rate > 5%` over 5 min	Warning
Index replication lag	`replication_failed_count > 0`	Warning

Security Checklist

Enable Solr’s authentication via BasicAuth or Kerberos plugin
Configure authorization using rule-based or permission-based security
Use ZooKeeper ACLs to protect cluster configuration from unauthorized access
Enable TLS for all node-to-node and client-to-node communication
Restrict JMX/RMI endpoints; expose only on internal networks
Validate input in UpdateRequestProcessor chains to prevent injection
Disable the Admin UI in production (authenticationclass: solr.DisableAuthentication)
Use separate configsets for multi-tenant deployments to isolate schemas
Audit security events — log authentication failures and permission denials

<!-- Example: Enable BasicAuth in security.json -->
{
  "authentication": {
    "class": "solr.BasicAuthPlugin",
    "credentials": {
      "admin": "encrypted_password_hash"
    }
  }
}

Interview Questions

1. How does Solr's explicit schema.xml differ from Elasticsearch's dynamic mapping, and what are the advantages of each approach?

Expected answer points:

Solr requires explicit field definitions in schema.xml before indexing; Elasticsearch creates mappings automatically from documents
Solr's advantage: precise control over field types, analyzers, and indexing behavior; predictable, reproducible across deployments
Elasticsearch's advantage: faster initial setup, schema flexibility for varying document structures
Trade-off: Solr requires upfront schema design but prevents unexpected type conversions; Elasticsearch is more agile but silent type mismatches can cause issues

2. Explain how ZooKeeper coordinates leader election in a SolrCloud cluster.

Expected answer points:

ZooKeeper maintains cluster state, tracks shard-to-node mapping, and manages leader election
Each shard has one leader and N replica nodes
When a leader fails, ZooKeeper detects node loss and triggers election among replicas
Replicas compete based on sync policy (minority vs majority) and ZooKeeper assigns leader
Cluster state updates are atomic in ZooKeeper, ensuring consistency

3. What is faceted search in Solr, and how do pivot facets differ from regular field facets?

Expected answer points:

Faceted search categorizes results by field values, counts, or ranges — users filter by facets
Regular facets: single field categorization (e.g., count by category)
Pivot facets: nested categorization (e.g., category x author combinations)
Example: "Show me 42 technical posts by author A, 15 opinion posts by author A"
Useful for analytics dashboards requiring multi-dimensional analysis

4. How do you prevent cache misses from causing slow queries after a Solr restart?

Expected answer points:

Configure cache autowarming in solrconfig.xml to pre-populate caches on reload
Set `autowarmCount` to at least 50% of cache size for faster recovery
Define `warmQueries` using representative query patterns from production traffic
Schedule background warming jobs to continuously refresh warm caches
Monitor `cache warmup time` metric — target under 30 seconds

5. What is the difference between `edismax` and standard query parsers in Solr?

Expected answer points:

Standard parser (`dismax`): strict syntax, throws exceptions on parse errors
`edismax` (extended dismax): more forgiving, handles malformed queries gracefully
`edismax` supports field boosting (`qf`), phrase boosting (`pf`), and minimum match (`mm`)
Recommendation: always use `edismax` for user-facing search interfaces
Standard parser is appropriate for controlled query construction by developers

6. How does Solr handle document extraction from binary formats like PDF and DOCX?

Expected answer points:

Solr uses Apache Tika via the ExtractingRequestHandler (SolrCell)
Tika extracts text and metadata from PDF, DOCX, PPTX, images (OCR), and other binary formats
Configuration includes timeout (default 30s), max bytes extracted (50MB)
Metadata fields can be mapped to schema fields for faceting
Failure handling: set timeouts, fallback to raw text, skip password-protected files

7. What are the key differences between Solr and Elasticsearch for distributed search?

Expected answer points:

SolrCloud requires ZooKeeper for cluster coordination; Elasticsearch has built-in coordination with no ZooKeeper dependency
Solr uses explicit XML configuration; Elasticsearch uses JSON REST API
Solr has more mature faceting; Elasticsearch has more flexible aggregations
Solr's operational complexity is higher; Elasticsearch is easier to operate
Both sit on Apache Lucene — core search functionality is similar

8. How do function queries work in Solr, and when would you use them?

Expected answer points:

Function queries incorporate mathematical expressions into relevance scoring
Functions include: `rord()`, `recip()`, `linear()`, `max()`, `product()`, `div()`
Example: `product(rord(popularity),0.1)` incorporates document age in ranking
Use cases: social signals (popularity decay), geographic distance scoring, custom business rules
Can be used in `bf` (boost functions) parameter with edismax

9. What monitoring metrics are critical for a production Solr cluster?

Expected answer points:

QTime metrics: avg < 100ms for interactive, p99 < 500ms for complex faceted
Cluster health: numNodes, numShards, overseer_queue_size (target < 100)
Index metrics: docs_count, size_in_bytes, deletion_ratio (target < 10%)
GC pauses: monitor for pauses > 1 second
Disk usage: alert at > 80%
Replication lag: replication_failed_count > 0 indicates problems

10. What security measures should you implement for a public-facing Solr deployment?

Expected answer points:

Enable authentication: BasicAuth plugin or Kerberos
Configure authorization: rule-based or permission-based security
Enable TLS for all node-to-node and client-to-node communication
Protect ZooKeeper with ACLs to prevent cluster configuration access
Restrict JMX/RMI endpoints to internal networks only
Disable Admin UI in production or restrict access
Validate all input in UpdateRequestProcessor chains to prevent injection
Audit security events (authentication failures, permission denials)

11. How does shard routing work in SolrCloud, and what are the trade-offs between doc-based and custom routing?

Expected answer points:

Doc-based routing: `hash(document_id) % numShards` — automatic but requires resharding to scale
Custom routing: implement custom `ShardHandlerFactory` for specialized routing needs
Composite ID routing: supports parent-child hierarchical documents
Trade-off: doc-based is simpler but inflexible; custom routing adds complexity but enables tenant isolation or geo-partitioning
Shard moves trigger re-routing overhead — avoid frequent rebalancing

12. Explain the difference between `softCommit` and `hardCommit` in Solr indexing.

Expected answer points:

`hardCommit`: fsyncs segment files to disk — durable, visible to all replicas, slower
`softCommit`: opens new searchers without fsync — faster NRT (near-real-time), not durable
Use `softCommit` for NRT search when durability isn't critical (e.g., user-generated content)
Use `hardCommit` for truly persistent data that must survive crashes
Production often uses both: frequent `softCommit` for NRT, periodic `hardCommit` for durability

13. What is the purpose of the Overseer in SolrCloud, and how can its queue backlog affect cluster performance?

Expected answer points:

The Overseer handles cluster state management: shard creation, replica assignment, leader election
Every state change (shard move, node join/leave) goes through the Overseer queue
`overseer_queue_size` backlog signals cluster instability or too many concurrent operations
Large backlog causes delayed state propagation — nodes may serve stale state
Fix: avoid unnecessary shard moves; split collections instead of frequent rebalancing

14. How do you configure and optimize Solr's filterCache for high-throughput filtering?

Expected answer points:

Filter cache stores document IDs matching filter queries — cached independently of scoring
Size: `numDocs * avgFiltersPerQuery` — monitor hit ratio, target > 70%
Autowarm: copy top N entries from old cache on reload to maintain hit rate
`filterCache` with `initialSize` pre-allocates to avoid lazy initialization stalls
For high-cardinality facets, consider `facet.enum.cache.minDf` to limit enum iterations

15. What are configsets in SolrCloud, and why are they important for preventing configuration drift?

Expected answer points:

Configsets store `solrconfig.xml` and `schema.xml` in ZooKeeper — shared across cluster nodes
Uploaded via `CREATE` action or `api/cluster/configs` endpoint
When a collection is created, it references a configset — all nodes use the same configuration
Prevents drift: changes pushed atomically via ZooKeeper instead of manual file sync
Isolate configs per environment (dev, staging, prod) using separate configset names

16. How does Solr's term frequency normalization work, and when would you customize the similarity implementation?

Expected answer points:

Default Lucene similarity uses TF (term frequency) and IDF (inverse document frequency)
Long documents: high TF naturally — can drown out relevance signals
BM25SimilarityFactory: normalizes TF based on document length — more sophisticated than default
Custom similarity useful for: legal/document search (length normalization), e-commerce (boost by sales velocity), or custom scoring based on business signals
Configure in `schema.xml` via `` element or per-field similarity

17. What monitoring and alerting should be configured for a production Solr deployment?

Expected answer points:

Cluster health: `zk_connections`, `overseer_queue_size` (critical alerts)
Leader status: alert if any shard has no leader for > 1 minute
Disk usage: alert at > 80%, critical at > 90%
Query latency: QTime_avg < 100ms, QTime_p99 < 500ms for interactive queries
GC pauses: alert if > 1 second (Solr is Java-heavy)
Replication lag: `replication_failed_count > 0` indicates problems
Index metrics: deletion_ratio < 10%, numSegments < 50 per shard

18. How does SolrJ connection pooling work, and what are the recommended settings for a production Solr cluster?

Expected answer points:

Use a single shared `HttpSolrClient` with `HttpClient` connection pooling
`maxConnPerRoute = numShards * replicationFactor * 2`
`maxConnTotal = maxConnPerRoute * numNodes`
Set `connectionTimeout` to 5000ms and `socketTimeout` to 30000ms
Close client in `@PreDestroy` or try-with-resources to prevent leaks
Monitor connection pool exhaustion via metrics — signs of overload

19. Describe how to implement document-level security in Solr for multi-tenant deployments.

Expected answer points:

Use `SecurityBasicAuth` plugin for authentication with hashed passwords
Configure `rule-based` authorization: define roles and map to permissions
Document-level security via `SecurityFilter` — add filter query based on user identity
Example: field `tenant_id` filtered by `tenant_id:currentUser.tenantId`
Use separate configsets per tenant to isolate schemas and request handlers
Enable ZooKeeper ACLs to protect configuration from cross-tenant access

20. What are the key differences between Solr's standard query parser, dismax, and edismax?

Expected answer points:

Standard parser: strict Lucene syntax, throws exceptions on parse errors, supports all Lucene features
`dismax`: simpler, handles user input more gracefully, no advanced Lucene features
`edismax` (extended dismax): combines dismax simplicity with advanced features
Key `edismax` parameters: `qf` (query fields), `pf` (phrase boost), `mm` (minimum match), `bf` (boost functions)
Recommendation: use `edismax` for all user-facing search interfaces
Use standard parser for programmatic, developer-controlled queries

Conclusion

Solr has been around since 2007, and it still holds up. The explicit schema appeals to teams that want declarative control. Faceted search is mature and flexible in ways Elasticsearch’s aggregations still catch up to.

The tradeoff is operational complexity. SolrCloud with ZooKeeper adds overhead that Elasticsearch sidesteps entirely. If you already have Solr infrastructure or you need deep faceting for e-commerce or analytics, Solr is worth keeping. For new projects without existing Solr expertise, Elasticsearch’s easier operations and broader ecosystem usually win out.

Apache Solr: Enterprise Search Platform

Indexing in Solr

Document Structure

Schema Configuration

Tika Extraction Optimization

Tika Extraction Key Takeaways

Search Features

Faceted Search

Pivot Facets

Relevance Tuning

Solr vs Elasticsearch

When to Use / When Not to Use

When to Use Apache Solr

When Not to Use Apache Solr

Trade-off Analysis

Schema Control vs Development Speed

Backup and Recovery Trade-offs

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

Over-Configuring solrconfig.xml

Ignoring the Overseer Bottleneck

Not Warming Caches Properly

Using the Wrong Parser for User Input

Skipping Schema Validation

Quick Recap

Observability

Cache Warming

Cache Warming Benchmarks

Cache Warming Key Takeaways

Alerts to Configure

Security Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Elasticsearch: Full-Text Search at Scale

Skip Lists: Layered Linked Lists for Fast Search

Search Scaling: Sharding, Routing, and Horizontal Growth

Over-Configuring `solrconfig.xml`