Apache Solr: Enterprise Search Platform

Explore Apache Solr's powerful search capabilities including faceted search, relevance tuning, indexing strategies, and how it compares to Elasticsearch.

published: reading time: 9 min read

Apache Solr: Enterprise Search Platform

Apache Solr is an open-source search platform built on Apache Lucene. It has been around since 2007, predating Elasticsearch by a few years. While Elasticsearch dominates newer projects, Solr still holds its own, especially in enterprises with existing infrastructure or specific requirements around security, stability, and operational familiarity.

This post covers Solr’s core features: indexing, faceted search, relevance tuning, and how it compares to Elasticsearch in practice.

Indexing in Solr

Solr indexes data as documents, which are similar to Elasticsearch documents. Each document contains fields, and each field has a type that determines how it is analyzed and stored.

Document Structure

<add>
  <doc>
    <field name="id">1</field>
    <field name="title">Getting Started with Solr</field>
    <field name="content">Solr is a search platform built on Lucene</field>
    <field name="category">tutorials</field>
    <field name="publish_date">2024-01-15T00:00:00Z</field>
  </doc>
</add>

Solr accepts data in multiple formats: XML, JSON, CSV, and binary. The ExtractingRequestHandler (SolrCell) can also extract content from PDFs, Word documents, and other binary formats using Apache Tika.

Schema Configuration

Solr uses a schema.xml file to define field types and fields. Unlike Elasticsearch’s automatic mapping, Solr’s schema is typically managed explicitly, which gives you precise control over how data is indexed.

<schema name="example" version="1.6">
  <fieldType name="text_en" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPorterFilterFactory"/>
    </analyzer>
  </fieldType>

  <field name="id" type="string" indexed="true" stored="true"/>
  <field name="title" type="text_en" indexed="true" stored="true"/>
  <field name="content" type="text_en" indexed="true" stored="false"/>
</schema>

The explicit schema means you get predictable, reproducible index behavior across deployments.

Faceted search is where Solr really shines. It lets you categorize search results by field values, counts, and ranges. Users can then filter by these facets to narrow their search.

{
  "query": "search",
  "facet": {
    "categories": {
      "field": "category",
      "mincount": 1
    },
    "date_ranges": {
      "type": "range",
      "field": "publish_date",
      "start": "2020-01-01",
      "end": "2026-12-31",
      "gap": "+1YEAR"
    },
    "top_authors": {
      "type": "terms",
      "field": "author",
      "limit": 5
    }
  }
}

The response includes counts for each facet value, letting you display something like “Category: Tutorials (42), Technical (38), Opinion (15)” alongside search results.

Pivot Facets

Pivot facets let you nest facets. For example, you can find counts for “category x author” combinations:

{
  "facet": {
    "pivot": "category,author"
  }
}

This is useful for analytics dashboards where you want to see not just counts per category, but counts per author within each category.

Relevance Tuning

Solr offers extensive relevance controls. The edismax query parser is more forgiving than the standard parser and allows fine-tuning through parameters.

Boosting

You can boost fields or documents at query time:

q=search&qf=title^2 content&pf=title^3

This sets:

  • qf: Query fields with title weighted 2x over content
  • pf: Phrase fields (title) with 3x boost for phrase matches

Function Queries

Solr’s function queries let you incorporate mathematical expressions into relevance scoring:

{
  "query": "{!func}product(rord(popularity),0.1)",
  "boost": "recip(rord(last_modified),1,1000,1000)"
}

Functions like rord, recip, linear, and max give you precise control over how documents are ranked.

Term Frequency Normalization

If a field has very long documents, term frequency becomes less meaningful. Solr’s bogus similarity or custom implementations can normalize term frequencies differently:

<similarity class="solr.BM25SimilarityFactory"/>

Solr vs Elasticsearch

Both Solr and Elasticsearch sit on Apache Lucene, but they have different philosophies.

AspectSolrElasticsearch
SchemaExplicit schema.xmlDynamic mapping with explicit override
ConfigurationXML files, more verboseJSON-based REST API
Distributed searchSolrCloud (ZooKeeper required)Built-in, no ZooKeeper
Faceted searchMore mature facetingGood faceting, less flexible
Operational complexityHigherLower
Enterprise adoptionLegacy systemsModern cloud-native

Solr’s explicit schema and XML configuration appeal to teams that want declarative control and reproducibility. Elasticsearch’s REST-first approach and dynamic mapping make it more developer-friendly.

If you need deep faceting for e-commerce or analytics, Solr is the better choice. For simple full-text search with easier operations, Elasticsearch is usually the default.

When to Use / When Not to Use

When to Use Apache Solr

  • E-commerce faceted search with complex hierarchies (category, brand, price range, attributes)
  • Enterprise search with strict security requirements and legacy system integration
  • Data preprocessing pipelines needing ETL-style document enrichment via Update Request Processors
  • Deep analytics with pivot facets, function queries, and complex aggregations
  • Document-centric search where schema control and reproducibility matter
  • Strict consistency requirements where SolrCloud’s ZooKeeper-based leader election fits the use case

When Not to Use Apache Solr

  • Cloud-native microservices where ZooKeeper dependency is operational overhead
  • Real-time log analysis (use Elasticsearch with Beats/Logstash instead)
  • Simple CRUD with search where a database with full-text search suffices
  • Teams without Solr expertise — the learning curve is steeper than Elasticsearch
  • Rapid prototyping — Solr’s XML configuration slows down initial development
  • Kubernetes-first deployments where stateless pods and auto-scaling are priorities

Production Failure Scenarios

FailureImpactMitigation
ZooKeeper connection lossCluster goes read-only, leader election stallsDeploy ZooKeeper ensemble (3+ nodes), monitor zk_connections, set leader_volatile_ttl
Leader shard replica failureWrite operations halt until failover completesConfigure replicationFactor >= 2, use SolrCloud auto-recovery
Index corruptionQueries return empty results or throw exceptionsUse solr-admin UI to verify index integrity, run IndexFingerprint comparison
Tika extraction failureBinary attachments (PDF, DOCX) not indexedSet timeout on ExtractingRequestHandler, fallback to raw text extraction
Facet overflow (too many facet buckets)OOM or slow queriesSet facet.enum.cache.minDf to reduce enum iterations, paginate facets
Config drift across nodesInconsistent analyzers, different scoringStore config in ZooKeeper (configsets), push updates atomically

Observability Checklist

Metrics to Monitor

{
  "cluster_metrics": {
    "cluster_status": "active/standby/degraded",
    "num_shards": "total across collection",
    "num_nodes": "live nodes count",
    " overseer_queue_size": "< 100 typically"
  },
  "collection_metrics": {
    "docs_count": "total indexed documents",
    "size_in_bytes": "index size on disk",
    "num_segements": "< 50 per shard",
    "deletion_ratio": "< 10% ideal"
  },
  "query_metrics": {
    "QTime_avg": "< 100ms for interactive",
    "QTime_p99": "< 500ms for complex faceted queries",
    "request_times": "track by handler endpoint"
  }
}

Key Logs to Capture

  • Solr logs: logs/solr.log — shard leader elections, distributed search failures
  • GC logs: Solr is Java-heavy; long GC pauses cause query timeouts
  • Audit logs: security events, authentication failures, authorization denials
  • Slow query logs: enable trackQueryExecution and log queries exceeding thresholds
<!-- Enable slow query tracking in solrconfig.xml -->
<slowQueryThresholdMillis>5000</slowQueryThresholdMillis>
<trackQueryExecution>true</trackQueryExecution>

Alerts to Configure

AlertConditionSeverity
ZooKeeper downzk_connections == 0Critical
No leader for shardshard leader == null for > 1 minCritical
Disk usage highdisk usage > 80%Warning
GC pauseGC pause > 1sWarning
Query timeout ratetimeout_rate > 5% over 5 minWarning
Index replication lagreplication_failed_count > 0Warning

Security Checklist

  • Enable Solr’s authentication via BasicAuth or Kerberos plugin
  • Configure authorization using rule-based or permission-based security
  • Use ZooKeeper ACLs to protect cluster configuration from unauthorized access
  • Enable TLS for all node-to-node and client-to-node communication
  • Restrict JMX/RMI endpoints; expose only on internal networks
  • Validate input in UpdateRequestProcessor chains to prevent injection
  • Disable the Admin UI in production (authenticationclass: solr.DisableAuthentication)
  • Use separate configsets for multi-tenant deployments to isolate schemas
  • Audit security events — log authentication failures and permission denials
<!-- Example: Enable BasicAuth in security.json -->
{
  "authentication": {
    "class": "solr.BasicAuthPlugin",
    "credentials": {
      "admin": "encrypted_password_hash"
    }
  }
}

Common Pitfalls / Anti-Patterns

Over-Configuring solrconfig.xml

Solr’s XML configuration is powerful but verbose. Teams often copy-paste configurations with unused components, causing memory bloat and slower query parsing.

Fix: Audit your solrconfig.xml quarterly. Remove unused request handlers, caches, and query parsers.

Ignoring the Overseer Bottleneck

The Overseer handles cluster state management. In large clusters with frequent shard moves, the Overseer queue backs up, causing delayed state propagation.

Fix: Avoid unnecessary shard moves. If you need frequent rebalancing, consider splitting collections instead.

Not Warming Caches Properly

Solr’s caches (filterCache, queryResultCache, documentCache) start empty. A cold cache after restart causes slow initial queries.

Fix: Configure queryResultCache warming via warmQueries or use autowarming with named caches.

<!-- Configure autowarming for filter cache -->
<filterCache class="solr.LRUCache"
             size="10000"
             initialSize="5000"
             autowarmCount="2000"/>

Using the Wrong Parser for User Input

The standard query parser (defType=dismax) is strict and throws exceptions on syntax errors. User-facing queries should use edismax, which is more forgiving.

Fix: Always use edismax or lucene for user-generated query strings:

q=user_input&defType=edismax&qf=title content

Skipping Schema Validation

Solr’s schema is flexible but silent on type mismatches. Sending a string to an int field converts it without warning, which can cause range queries to behave unexpectedly.

Fix: Validate field types explicitly. Use schema.xml with required="true" and multiValued appropriately.

Quick Recap

Key Bullets

  • Solr’s explicit schema.xml gives precise control over field types and analyzers
  • Faceted search is Solr’s strength — use pivot facets for analytics dashboards
  • edismax query parser is more forgiving than standard dismax for user input
  • SolrCloud requires ZooKeeper for distributed coordination; Elasticsearch does not
  • Function queries enable mathematical expressions in relevance scoring
  • Document-level security via SecurityBasicAuth or rule-based authorization
  • Tika integration handles PDF, DOCX, and other binary formats out of the box

Copy/Paste Checklist

# Check cluster health
GET /admin/collections?action=CLUSTERSTATUS

# Create a collection with replication
GET /admin/collections?action=CREATE&name=my-collection&numShards=3&replicationFactor=2

# Reload collection after config change
GET /admin/collections?action=RELOAD&name=my-collection

# Force commit after bulk indexing
GET /my-collection/update?wt=json&commit=true

# Check facet counts
GET /my-collection/select?q=*:*&facet=true&facet.field=category&facet.limit=10

# Export schema field names
GET /my-collection/schema/fields

# Set cache warming query
GET /admin/cores?action=RELOAD&core=my-collection&warm=true

Conclusion

Apache Solr remains a capable search platform despite its age. Its explicit schema gives you control that Elasticsearch’s dynamic mapping does not. Faceted search is mature and flexible. The tradeoff is operational complexity: SolrCloud with ZooKeeper adds overhead that Elasticsearch avoids.

If you are working in an enterprise environment with existing Solr investments, or if you need deep faceting capabilities, Solr is worth a serious look. Otherwise, Elasticsearch’s easier operations and broader ecosystem make more sense for new projects.

Category

Related Posts

Elasticsearch: Full-Text Search at Scale

Learn how Elasticsearch powers search at scale with inverted indexes, sharding, replicas, and its powerful Query DSL for modern applications.

#search #elasticsearch #elastic

Search Scaling: Sharding, Routing, and Horizontal Growth

Learn how to scale search systems horizontally with index sharding strategies, query routing, replication patterns, and cluster management techniques.

#search #scaling #elasticsearch

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring