Apache Solr: Enterprise Search Platform
Explore Apache Solr's powerful search capabilities including faceted search, relevance tuning, indexing strategies, and how it compares to Elasticsearch.
Apache Solr: Enterprise Search Platform
Apache Solr is an open-source search platform built on Apache Lucene, first released in 2007. Elasticsearch came along a few years later and took most of the new-project market, but Solr still runs plenty of production systems — particularly in enterprises that already have Solr infrastructure or need specific security and operational features that Solr handles well.
Indexing in Solr
Solr indexes data as documents, which are similar to Elasticsearch documents. Each document contains fields, and each field has a type that determines how it is analyzed and stored.
Document Structure
<add>
<doc>
<field name="id">1</field>
<field name="title">Getting Started with Solr</field>
<field name="content">Solr is a search platform built on Lucene</field>
<field name="category">tutorials</field>
<field name="publish_date">2024-01-15T00:00:00Z</field>
</doc>
</add>
Solr accepts data in multiple formats: XML, JSON, CSV, and binary. The ExtractingRequestHandler (SolrCell) can also extract content from PDFs, Word documents, and other binary formats using Apache Tika.
Schema Configuration
Solr uses a schema.xml file to define field types and fields. Unlike Elasticsearch’s automatic mapping, Solr’s schema is typically managed explicitly, which gives you precise control over how data is indexed.
<schema name="example" version="1.6">
<fieldType name="text_en" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldType>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="title" type="text_en" indexed="true" stored="true"/>
<field name="content" type="text_en" indexed="true" stored="false"/>
</schema>
The explicit schema means you get predictable, reproducible index behavior across deployments.
Tika Extraction Optimization
Apache Tika handles binary format extraction — PDFs, DOCX, PPTX, images with OCR — via the ExtractingRequestHandler. Out-of-the-box settings work fine for low-volume workloads, but production pipelines need tuning.
Tika configuration to set in production:
<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.content">_text_</str>
<str name="fmap.meta">_meta</str>
<str name="uprefix">ignored_</str>
<str name="lowernames">true</str>
<!-- Timeout in milliseconds -->
<int name="timeout">30000</int>
<!-- Max bytes to extract (50MB) -->
<long name="maxExtractedBytes">52428800</long>
</lst>
</requestHandler>
Tika extraction failure handling:
| Problem | Symptom | Solution |
|---|---|---|
| Password-protected PDFs | No text extracted, no error | Set X-Password header or skip via SkipPasswordMetadata |
| Corrupt files | Tika throws exception | Wrap in try-catch, fallback to raw text |
| Very large files | OOM or timeout | Chunk via TikaInputStream, process in segments |
| OCR failure | Image searches return empty | Use SolrTika with TesseractOCRConfig |
Metadata extraction for search filtering:
{
"extracted_metadata": {
"title": "Document Title",
"author": "John Doe",
"created": "2024-03-15T10:00:00Z",
"keywords": ["search", "solr", "tika"]
}
}
Extract metadata fields and map them to schema fields for faceting:
<field name="author" type="string" stored="true" indexed="true"/>
<field name="doc_date" type="date" stored="true" indexed="true"/>
Tika Extraction Key Takeaways
- Set
timeouton ExtractingRequestHandler to prevent pipeline stalls - Configure
maxExtractedBytesto protect against memory exhaustion - Map Tika metadata fields to schema for facetable attributes
- Use password override headers for encrypted corporate documents
- Fallback to raw text extraction when Tika parsing fails
Search Features
Faceted Search
Solr’s faceted search implementation is one of its strongest features. It categorizes search results by field values, counts, and ranges, letting users narrow down results by selecting facets.
{
"query": "search",
"facet": {
"categories": {
"field": "category",
"mincount": 1
},
"date_ranges": {
"type": "range",
"field": "publish_date",
"start": "2020-01-01",
"end": "2026-12-31",
"gap": "+1YEAR"
},
"top_authors": {
"type": "terms",
"field": "author",
"limit": 5
}
}
}
The response includes counts for each facet value, letting you display something like “Category: Tutorials (42), Technical (38), Opinion (15)” alongside search results.
Pivot Facets
Pivot facets let you nest facets. For example, you can find counts for “category x author” combinations:
{
"facet": {
"pivot": "category,author"
}
}
This is useful for analytics dashboards where you want to see not just counts per category, but counts per author within each category.
Relevance Tuning
Solr offers extensive relevance controls. The edismax query parser is more forgiving than the standard parser and allows fine-tuning through parameters.
Boosting — You can boost fields or documents at query time:
q=search&qf=title^2 content&pf=title^3
This sets:
qf: Query fields with title weighted 2x over contentpf: Phrase fields (title) with 3x boost for phrase matches
Function Queries — Solr’s function queries let you incorporate mathematical expressions into relevance scoring:
{
"query": "{!func}product(rord(popularity),0.1)",
"boost": "recip(rord(last_modified),1,1000,1000)"
}
Functions like rord, recip, linear, and max give you precise control over how documents are ranked.
Term Frequency Normalization — If a field has very long documents, term frequency becomes less meaningful. Solr’s bogus similarity or custom implementations can normalize term frequencies differently:
<similarity class="solr.BM25SimilarityFactory"/>
Solr vs Elasticsearch
Both Solr and Elasticsearch sit on Apache Lucene, but they have different philosophies.
| Aspect | Solr | Elasticsearch |
|---|---|---|
| Schema | Explicit schema.xml | Dynamic mapping with explicit override |
| Configuration | XML files, more verbose | JSON-based REST API |
| Distributed search | SolrCloud (ZooKeeper required) | Built-in, no ZooKeeper |
| Faceted search | More mature faceting | Good faceting, less flexible |
| Operational complexity | Higher | Lower |
| Enterprise adoption | Legacy systems | Modern cloud-native |
Solr’s explicit schema and XML configuration appeal to teams that want declarative control and reproducibility. Elasticsearch’s REST-first approach and dynamic mapping make it more developer-friendly.
If you need deep faceting for e-commerce or analytics, Solr is the better choice. For simple full-text search with easier operations, Elasticsearch is usually the default.
When to Use / When Not to Use
When to Use Apache Solr
- E-commerce faceted search with complex hierarchies (category, brand, price range, attributes)
- Enterprise search with strict security requirements and legacy system integration
- Data preprocessing pipelines needing ETL-style document enrichment via Update Request Processors
- Deep analytics with pivot facets, function queries, and complex aggregations
- Document-centric search where schema control and reproducibility matter
- Strict consistency requirements where SolrCloud’s ZooKeeper-based leader election fits the use case
When Not to Use Apache Solr
- Cloud-native microservices where the ZooKeeper dependency adds operational overhead
- Real-time log analysis (use Elasticsearch with Beats/Logstash instead)
- Simple CRUD with search where a database full-text search feature suffices
- Teams without Solr expertise — the learning curve is steeper than Elasticsearch
- Rapid prototyping — Solr’s XML configuration slows down initial development
- Kubernetes-first deployments where stateless pods and auto-scaling are priorities
Trade-off Analysis
When choosing Solr over alternatives, understanding the key tradeoffs helps make informed architectural decisions.
Schema Control vs Development Speed
| | Factor | Solr (Explicit Schema) | Elasticsearch (Dynamic) ||| | | ------------------------ | --------------------------- | -------------------------------- ||| | | Initial setup time | Longer — requires field defs| Faster — auto-creates mappings ||| | | Type safety | Strong — mismatches rejected | Weak — silent conversions ||| | | Operational predictability| High — reproducible behavior | Medium — varies by data shape ||| | | Refactoring effort | Higher — must update schema | Lower — mappings adapt automatically||| | | Performance tuning | Precise — known field types | Requires profiling per field type|||
Recommendation: Use Solr when schema stability and type safety are priorities. Use Elasticsearch when rapid prototyping and schema flexibility are needed.
ZooKeeper Dependency vs Built-in Coordination — SolrCloud requires 3–5 ZooKeeper nodes for cluster state management, whereas Elasticsearch uses a gossip protocol with no extra infrastructure. ZooKeeper adds operational overhead but delivers strong consistency guarantees.
XML Configuration vs REST API — Solr uses verbose but explicit XML configs ideal for auditability and reproducibility; Elasticsearch’s JSON REST API enables faster iteration and programmatic management.
Faceted Search Depth vs Operational Complexity — Solr’s faceted search is superior for complex e-commerce and analytics use cases; Elasticsearch handles standard faceting adequately.
Indexing Throughput vs Resource Consumption — Solr has a higher memory footprint (JVM + file handles) but excels at bulk indexing and includes native Tika integration for binary formats. Elasticsearch offers lower per-shard resource consumption.
Backup and Recovery Trade-offs
| Aspect | Solr | Elasticsearch |||
| --------------- | --------------------------------- | --------------------------------- |||
| Snapshot method | CREATESNAPSHOT / RESTORESNAPSHOT| snapshot / restore API |||
| Storage | Shared filesystem (NFS) or HDFS | Shared bucket (S3, GCS, Azure) |||
| Incremental backup| Manual via tiered snapshots | Native with repository plugins |||
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| ZooKeeper connection loss | Cluster goes read-only, leader election stalls | Deploy ZooKeeper ensemble (3+ nodes), monitor zk_connections, set leader_volatile_ttl |
| Leader shard replica failure | Write operations halt until failover completes | Configure replicationFactor >= 2, use SolrCloud auto-recovery |
| Index corruption | Queries return empty results or throw exceptions | Use solr-admin UI to verify index integrity, run IndexFingerprint comparison |
| Tika extraction failure | Binary attachments (PDF, DOCX) not indexed | Set timeout on ExtractingRequestHandler, fallback to raw text extraction |
| Facet overflow (too many facet buckets) | OOM or slow queries | Set facet.enum.cache.minDf to reduce enum iterations, paginate facets |
| Config drift across nodes | Inconsistent analyzers, different scoring | Store config in ZooKeeper (configsets), push updates atomically |
Common Pitfalls / Anti-Patterns
Over-Configuring solrconfig.xml
Solr’s XML configuration is powerful but verbose. Teams often copy-paste configurations with unused components, causing memory bloat and slower query parsing.
Fix: Audit your solrconfig.xml quarterly. Remove unused request handlers, caches, and query parsers.
Ignoring the Overseer Bottleneck
The Overseer handles cluster state management. In large clusters with frequent shard moves, the Overseer queue backs up, causing delayed state propagation.
Fix: Avoid unnecessary shard moves. If you need frequent rebalancing, consider splitting collections instead.
Not Warming Caches Properly
Solr’s caches (filterCache, queryResultCache, documentCache) start empty. A cold cache after restart causes slow initial queries.
Fix: Configure queryResultCache warming via warmQueries or use autowarming with named caches.
<!-- Configure autowarming for filter cache -->
<filterCache class="solr.LRUCache"
size="10000"
initialSize="5000"
autowarmCount="2000"/>
Using the Wrong Parser for User Input
The standard query parser (defType=dismax) is strict and throws exceptions on syntax errors. User-facing queries should use edismax, which is more forgiving.
Fix: Always use edismax or lucene for user-generated query strings:
q=user_input&defType=edismax&qf=title content
Skipping Schema Validation
Solr’s schema is flexible but silent on type mismatches. Sending a string to an int field converts it without warning, which can cause range queries to behave unexpectedly.
Fix: Validate field types explicitly. Use schema.xml with required="true" and multiValued appropriately.
Quick Recap
Key bullets:
- Solr’s explicit
schema.xmlgives precise control over field types and analyzers - Faceted search is one of Solr’s strongest features — pivot facets are useful for analytics dashboards
edismaxquery parser is more forgiving than standarddismaxfor user input- SolrCloud requires ZooKeeper for distributed coordination; Elasticsearch does not
- Function queries enable mathematical expressions in relevance scoring
- Document-level security via
SecurityBasicAuthorrule-basedauthorization - Tika integration handles PDF, DOCX, and other binary formats out of the box
Copy/Paste Checklist:
# Check cluster health
GET /admin/collections?action=CLUSTERSTATUS
# Create a collection with replication
GET /admin/collections?action=CREATE&name=my-collection&numShards=3&replicationFactor=2
# Reload collection after config change
GET /admin/collections?action=RELOAD&name=my-collection
# Force commit after bulk indexing
GET /my-collection/update?wt=json&commit=true
# Check facet counts
GET /my-collection/select?q=*:*&facet=true&facet.field=category&facet.limit=10
# Export schema field names
GET /my-collection/schema/fields
# Set cache warming query
GET /admin/cores?action=RELOAD&core=my-collection&warm=true
Observability
Solr metrics to track:
{
"cluster_metrics": {
"cluster_status": "active/standby/degraded",
"num_shards": "total across collection",
"num_nodes": "live nodes count",
" overseer_queue_size": "< 100 typically"
},
"collection_metrics": {
"docs_count": "total indexed documents",
"size_in_bytes": "index size on disk",
"num_segements": "< 50 per shard",
"deletion_ratio": "< 10% ideal"
},
"query_metrics": {
"QTime_avg": "< 100ms for interactive",
"QTime_p99": "< 500ms for complex faceted queries",
"request_times": "track by handler endpoint"
}
}
- Solr logs:
logs/solr.log— shard leader elections, distributed search failures - GC logs: Solr is Java-heavy; long GC pauses cause query timeouts
- Audit logs: security events, authentication failures, authorization denials
- Slow query logs: enable
trackQueryExecutionand log queries exceeding thresholds
Cache Warming
Solr uses multiple caches that are cold on startup or after eviction. Proper warming prevents slow queries on cache misses.
Cache Types:
| Cache | Purpose | Size Guidance |
|---|---|---|
filterCache | Stores document IDs matching filters | numDocs * avgFiltersPerQuery |
queryResultCache | Caches document ID sets per query | maxResults * numQueries |
documentCache | Stores stored fields per document | maxDoc |
fieldValueCache | Caches terms for faceting/sorting | numUniqueFields * avgValuesPerField |
Autowarming Configuration — Autowarming copies cached entries from an old cache to a new one on reload:
<!-- filterCache with autowarming -->
<filterCache class="solr.LRUCache"
size="10000"
initialSize="5000"
autowarmCount="2000"/>
<!-- queryResultCache with autowarming using named queries -->
<queryResultCache class="solr.LRUCache"
size="1000"
initialSize="500"
autowarmCount="500"
name="myQueryCache"/>
Warm Queries Strategy — Define a set of representative queries to run on startup:
<queryWriter name="warmQueryWriter" class="solr.JSONResponseWriter">
<str name="warmQueries">
<lst>
<str name="q">featured:true</str>
<str name="rows">10</str>
</lst>
<lst>
<str name="q">category:technical</str>
<str name="facet">true</str>
<str name="rows">50</str>
</lst>
</str>
</queryWriter>
Continuous Warming via Scheduled Searches — For always-warm caches, schedule common queries via an external job:
# Warm query executed before cluster reload
curl "http://solr-node:8983/solr/my-collection/select" \
"?q=category:tutorials&rows=10&ftimeout=5000"
curl "http://solr-node:8983/solr/my-collection/select" \
"?q=*:*&rows=0&facet=true&facet.field=category&ftimeout=5000"
Cache Warming Benchmarks
| Scenario | Cold Cache QTime | Warm Cache QTime | Improvement |
|---|---|---|---|
| Simple term query | 450ms | 12ms | ~38x |
| Faceted query (5 fields) | 1200ms | 45ms | ~27x |
| Join query | 800ms | 80ms | ~10x |
Cache Warming Key Takeaways
- Configure
autowarmCountto at least 50% of cache size for faster recovery - Define
warmQueriesusing your top 10 most frequent query patterns - Monitor
cache warmup timein metrics — target under 30 seconds - Use
queryResultCacheautowarming to pre-populate paginated result sets - Schedule background warming jobs during low-traffic windows
<!-- Enable slow query tracking in solrconfig.xml -->
<slowQueryThresholdMillis>5000</slowQueryThresholdMillis>
<trackQueryExecution>true</trackQueryExecution>
Alerts to Configure
| Alert | Condition | Severity |
|---|---|---|
| ZooKeeper down | zk_connections == 0 | Critical |
| No leader for shard | shard leader == null for > 1 min | Critical |
| Disk usage high | disk usage > 80% | Warning |
| GC pause | GC pause > 1s | Warning |
| Query timeout rate | timeout_rate > 5% over 5 min | Warning |
| Index replication lag | replication_failed_count > 0 | Warning |
Security Checklist
- Enable Solr’s authentication via
BasicAuthorKerberosplugin - Configure authorization using
rule-basedorpermission-basedsecurity - Use ZooKeeper ACLs to protect cluster configuration from unauthorized access
- Enable TLS for all node-to-node and client-to-node communication
- Restrict JMX/RMI endpoints; expose only on internal networks
- Validate input in UpdateRequestProcessor chains to prevent injection
- Disable the Admin UI in production (
authenticationclass: solr.DisableAuthentication) - Use separate configsets for multi-tenant deployments to isolate schemas
- Audit security events — log authentication failures and permission denials
<!-- Example: Enable BasicAuth in security.json -->
{
"authentication": {
"class": "solr.BasicAuthPlugin",
"credentials": {
"admin": "encrypted_password_hash"
}
}
}
Interview Questions
Expected answer points:
- Solr requires explicit field definitions in schema.xml before indexing; Elasticsearch creates mappings automatically from documents
- Solr's advantage: precise control over field types, analyzers, and indexing behavior; predictable, reproducible across deployments
- Elasticsearch's advantage: faster initial setup, schema flexibility for varying document structures
- Trade-off: Solr requires upfront schema design but prevents unexpected type conversions; Elasticsearch is more agile but silent type mismatches can cause issues
Expected answer points:
- ZooKeeper maintains cluster state, tracks shard-to-node mapping, and manages leader election
- Each shard has one leader and N replica nodes
- When a leader fails, ZooKeeper detects node loss and triggers election among replicas
- Replicas compete based on sync policy (minority vs majority) and ZooKeeper assigns leader
- Cluster state updates are atomic in ZooKeeper, ensuring consistency
Expected answer points:
- Faceted search categorizes results by field values, counts, or ranges — users filter by facets
- Regular facets: single field categorization (e.g., count by category)
- Pivot facets: nested categorization (e.g., category x author combinations)
- Example: "Show me 42 technical posts by author A, 15 opinion posts by author A"
- Useful for analytics dashboards requiring multi-dimensional analysis
Expected answer points:
- Configure cache autowarming in solrconfig.xml to pre-populate caches on reload
- Set `autowarmCount` to at least 50% of cache size for faster recovery
- Define `warmQueries` using representative query patterns from production traffic
- Schedule background warming jobs to continuously refresh warm caches
- Monitor `cache warmup time` metric — target under 30 seconds
Expected answer points:
- Standard parser (`dismax`): strict syntax, throws exceptions on parse errors
- `edismax` (extended dismax): more forgiving, handles malformed queries gracefully
- `edismax` supports field boosting (`qf`), phrase boosting (`pf`), and minimum match (`mm`)
- Recommendation: always use `edismax` for user-facing search interfaces
- Standard parser is appropriate for controlled query construction by developers
Expected answer points:
- Solr uses Apache Tika via the ExtractingRequestHandler (SolrCell)
- Tika extracts text and metadata from PDF, DOCX, PPTX, images (OCR), and other binary formats
- Configuration includes timeout (default 30s), max bytes extracted (50MB)
- Metadata fields can be mapped to schema fields for faceting
- Failure handling: set timeouts, fallback to raw text, skip password-protected files
Expected answer points:
- SolrCloud requires ZooKeeper for cluster coordination; Elasticsearch has built-in coordination with no ZooKeeper dependency
- Solr uses explicit XML configuration; Elasticsearch uses JSON REST API
- Solr has more mature faceting; Elasticsearch has more flexible aggregations
- Solr's operational complexity is higher; Elasticsearch is easier to operate
- Both sit on Apache Lucene — core search functionality is similar
Expected answer points:
- Function queries incorporate mathematical expressions into relevance scoring
- Functions include: `rord()`, `recip()`, `linear()`, `max()`, `product()`, `div()`
- Example: `product(rord(popularity),0.1)` incorporates document age in ranking
- Use cases: social signals (popularity decay), geographic distance scoring, custom business rules
- Can be used in `bf` (boost functions) parameter with edismax
Expected answer points:
- QTime metrics: avg < 100ms for interactive, p99 < 500ms for complex faceted
- Cluster health: numNodes, numShards, overseer_queue_size (target < 100)
- Index metrics: docs_count, size_in_bytes, deletion_ratio (target < 10%)
- GC pauses: monitor for pauses > 1 second
- Disk usage: alert at > 80%
- Replication lag: replication_failed_count > 0 indicates problems
Expected answer points:
- Enable authentication: BasicAuth plugin or Kerberos
- Configure authorization: rule-based or permission-based security
- Enable TLS for all node-to-node and client-to-node communication
- Protect ZooKeeper with ACLs to prevent cluster configuration access
- Restrict JMX/RMI endpoints to internal networks only
- Disable Admin UI in production or restrict access
- Validate all input in UpdateRequestProcessor chains to prevent injection
- Audit security events (authentication failures, permission denials)
Expected answer points:
- Doc-based routing: `hash(document_id) % numShards` — automatic but requires resharding to scale
- Custom routing: implement custom `ShardHandlerFactory` for specialized routing needs
- Composite ID routing: supports parent-child hierarchical documents
- Trade-off: doc-based is simpler but inflexible; custom routing adds complexity but enables tenant isolation or geo-partitioning
- Shard moves trigger re-routing overhead — avoid frequent rebalancing
Expected answer points:
- `hardCommit`: fsyncs segment files to disk — durable, visible to all replicas, slower
- `softCommit`: opens new searchers without fsync — faster NRT (near-real-time), not durable
- Use `softCommit` for NRT search when durability isn't critical (e.g., user-generated content)
- Use `hardCommit` for truly persistent data that must survive crashes
- Production often uses both: frequent `softCommit` for NRT, periodic `hardCommit` for durability
Expected answer points:
- The Overseer handles cluster state management: shard creation, replica assignment, leader election
- Every state change (shard move, node join/leave) goes through the Overseer queue
- `overseer_queue_size` backlog signals cluster instability or too many concurrent operations
- Large backlog causes delayed state propagation — nodes may serve stale state
- Fix: avoid unnecessary shard moves; split collections instead of frequent rebalancing
Expected answer points:
- Filter cache stores document IDs matching filter queries — cached independently of scoring
- Size: `numDocs * avgFiltersPerQuery` — monitor hit ratio, target > 70%
- Autowarm: copy top N entries from old cache on reload to maintain hit rate
- `filterCache` with `initialSize` pre-allocates to avoid lazy initialization stalls
- For high-cardinality facets, consider `facet.enum.cache.minDf` to limit enum iterations
Expected answer points:
- Configsets store `solrconfig.xml` and `schema.xml` in ZooKeeper — shared across cluster nodes
- Uploaded via `CREATE` action or `api/cluster/configs` endpoint
- When a collection is created, it references a configset — all nodes use the same configuration
- Prevents drift: changes pushed atomically via ZooKeeper instead of manual file sync
- Isolate configs per environment (dev, staging, prod) using separate configset names
Expected answer points:
- Default Lucene similarity uses TF (term frequency) and IDF (inverse document frequency)
- Long documents: high TF naturally — can drown out relevance signals
- BM25SimilarityFactory: normalizes TF based on document length — more sophisticated than default
- Custom similarity useful for: legal/document search (length normalization), e-commerce (boost by sales velocity), or custom scoring based on business signals
- Configure in `schema.xml` via `
` element or per-field similarity
Expected answer points:
- Cluster health: `zk_connections`, `overseer_queue_size` (critical alerts)
- Leader status: alert if any shard has no leader for > 1 minute
- Disk usage: alert at > 80%, critical at > 90%
- Query latency: QTime_avg < 100ms, QTime_p99 < 500ms for interactive queries
- GC pauses: alert if > 1 second (Solr is Java-heavy)
- Replication lag: `replication_failed_count > 0` indicates problems
- Index metrics: deletion_ratio < 10%, numSegments < 50 per shard
Expected answer points:
- Use a single shared `HttpSolrClient` with `HttpClient` connection pooling
- `maxConnPerRoute = numShards * replicationFactor * 2`
- `maxConnTotal = maxConnPerRoute * numNodes`
- Set `connectionTimeout` to 5000ms and `socketTimeout` to 30000ms
- Close client in `@PreDestroy` or try-with-resources to prevent leaks
- Monitor connection pool exhaustion via metrics — signs of overload
Expected answer points:
- Use `SecurityBasicAuth` plugin for authentication with hashed passwords
- Configure `rule-based` authorization: define roles and map to permissions
- Document-level security via `SecurityFilter` — add filter query based on user identity
- Example: field `tenant_id` filtered by `tenant_id:currentUser.tenantId`
- Use separate configsets per tenant to isolate schemas and request handlers
- Enable ZooKeeper ACLs to protect configuration from cross-tenant access
Expected answer points:
- Standard parser: strict Lucene syntax, throws exceptions on parse errors, supports all Lucene features
- `dismax`: simpler, handles user input more gracefully, no advanced Lucene features
- `edismax` (extended dismax): combines dismax simplicity with advanced features
- Key `edismax` parameters: `qf` (query fields), `pf` (phrase boost), `mm` (minimum match), `bf` (boost functions)
- Recommendation: use `edismax` for all user-facing search interfaces
- Use standard parser for programmatic, developer-controlled queries
Further Reading
- Apache Solr Official Documentation — official guides and API reference
- SolrRef Guide (Latest) — comprehensive reference
- Apache Lucene Documentation — underlying search library
- SolrCloud Architecture Overview — distributed search internals
- Tika Documentation — binary format extraction
- SolrJ API Reference — Java client documentation
- ZooKeeper Documentation — cluster coordination
- Search Technologies Blog — advanced Solr/Elasticsearch patterns
Conclusion
Solr has been around since 2007, and it still holds up. The explicit schema appeals to teams that want declarative control. Faceted search is mature and flexible in ways Elasticsearch’s aggregations still catch up to.
The tradeoff is operational complexity. SolrCloud with ZooKeeper adds overhead that Elasticsearch sidesteps entirely. If you already have Solr infrastructure or you need deep faceting for e-commerce or analytics, Solr is worth keeping. For new projects without existing Solr expertise, Elasticsearch’s easier operations and broader ecosystem usually win out.
Category
Related Posts
Elasticsearch: Full-Text Search at Scale
Learn how Elasticsearch powers search at scale with inverted indexes, sharding, replicas, and its powerful Query DSL for modern applications.
Skip Lists: Layered Linked Lists for Fast Search
Understand skip lists as probabilistic alternatives to balanced trees, providing O(log n) search with simple implementation and lock-free variants.
Search Scaling: Sharding, Routing, and Horizontal Growth
Learn how to scale search systems horizontally with index sharding strategies, query routing, replication patterns, and cluster management techniques.