Apache Solr as a NoSQL Database: When Search Beats Storage

Most engineers reach for MongoDB when they need a flexible, document-oriented NoSQL database. That's a reasonable default. But for read-heavy workloads where search, filtering, and faceting are core operations, Apache Solr consistently outperforms MongoDB by a significant margin — and the architectural reason why is worth understanding.

I ran 150+ Solr nodes in production at TransUnion CIBIL for several years. The workload: credit bureau data at scale, under RBI regulatory compliance, with 99%+ uptime requirements. That experience taught me clearly where Solr wins and where it doesn't.

The Core Architectural Difference

MongoDB uses B-tree indexes — the same fundamental data structure as most relational databases. B-trees are excellent for point lookups and range queries on indexed fields. They're the right choice when you need to find a specific document by ID or retrieve documents where amount > 10000.

Solr uses inverted indexes (via Lucene). An inverted index maps terms to the documents containing them. For full-text search, this is dramatically more efficient — instead of scanning documents to find which ones contain "credit default swap," you look up the term and get a pre-built list of document IDs. Faceting, ranking, and relevance scoring are all native operations that inverted indexes handle with minimal overhead.

The practical implication: if your primary access pattern is "find documents matching these criteria, ranked by relevance, with counts by category," Solr is faster than MongoDB at every scale.

Schema Flexibility in Solr

A common misconception is that Solr requires a rigid schema. In reality, SolrCloud supports:

Dynamic fields: Define a pattern and Solr automatically handles new fields matching it:

<dynamicField name="*_txt" type="text_general" indexed="true" stored="true"/>
<dynamicField name="*_i" type="pint" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="pdate" indexed="true" stored="true"/>

Any field ending in _txt gets text analysis. Any field ending in _i gets integer handling. You can add new fields to documents without schema migrations.

Schema-less mode (Managed Schema): Solr infers field types from the first document that contains them. Useful for prototyping, though I'd recommend defined schemas in production for anything you care about.

SolrCloud Setup and Configuration

SolrCloud is Solr's distributed mode, built on ZooKeeper for coordination. For a production cluster:

# Start ZooKeeper ensemble (3 nodes for HA)
bin/zkServer.sh start
 
# Start Solr in SolrCloud mode
bin/solr start -c -z zk1:2181,zk2:2181,zk3:2181 -p 8983
 
# Create a collection with sharding and replication
bin/solr create -c my_collection \
  -shards 4 \
  -replicationFactor 2 \
  -confdir _default

Key SolrCloud concepts:

Collection: The logical index (equivalent to a database)
Shard: A horizontal partition of the collection
Replica: A copy of a shard for redundancy
Leader: The replica that accepts writes for a shard

For the TransUnion CIBIL deployment, we ran a 4-shard, 3-replica configuration across 12 physical nodes — giving us both horizontal scale and the ability to lose an entire rack without data loss or downtime.

Query Patterns That Shine

Where Solr's query language earns its reputation:

# Full-text search with field boosting
q=credit+default&defType=edismax&qf=description^2+tags^1.5+content^1

# Faceted search — counts by category in a single query
q=*:*&facet=true&facet.field=category&facet.field=status&rows=20

# Range filter with sorting
q=*:*&fq=amount:[10000 TO *]&fq=date:[2024-01-01T00:00:00Z TO NOW]&sort=score desc

# Geospatial search
q=*:*&fq={!geofilt sfield=location pt=28.6139,77.2090 d=10}

The faceting query deserves particular attention. Returning 20 search results plus counts across multiple categories in a single network round-trip is something MongoDB requires multiple queries or aggregation pipelines to achieve. At scale, that difference matters.

Production Scale: 150+ Nodes, RBI Compliance

At TransUnion CIBIL, we operated under Reserve Bank of India data handling requirements — which meant strict controls on data residency, access logging, and availability guarantees. Solr's architecture accommodated this well:

Zero-downtime rolling updates: We could update Solr versions and configuration by taking replicas offline one at a time. The leader-election mechanism in ZooKeeper ensured seamless failover. We ran planned maintenance windows with no user-visible downtime.

Audit logging: Solr's request logging captured every query with timestamp, user context, and response time — satisfying regulatory audit trail requirements.

Horizontal scale without application changes: Adding capacity meant adding nodes and re-balancing shards. The application layer saw no changes — it talked to the same collection endpoint.

We sustained 99.2% uptime over a 3-year period on that deployment, including several major Solr version upgrades.

When NOT to Use Solr

Solr is not a general-purpose database. Don't use it when:

Write-heavy workloads: Solr's indexing pipeline is not optimized for high write throughput. If you're ingesting 10,000 documents per second, consider Elasticsearch (which has a more write-optimized engine) or Kafka + batch indexing.

ACID transactions: Solr has no transaction support. If you need to update multiple documents atomically, Solr is the wrong tool.

Simple key-value lookups: If your access pattern is "get document by ID," MongoDB or even Redis is faster and simpler. Don't bring in Solr's complexity for use cases it's not suited for.

Small datasets: The operational complexity of SolrCloud is only justified at meaningful scale. For a few million documents with simple search requirements, Elasticsearch's easier onboarding may be preferable, or even PostgreSQL's full-text search capabilities.

The Right Question

The question isn't "Solr or MongoDB?" It's "what is my primary access pattern?" If search, faceting, and ranked retrieval are core — and you need to operate at scale — Solr's inverted index architecture gives you performance that document-store B-tree indexes fundamentally cannot match.

If you're designing a data architecture where search is critical and you want to talk through the trade-offs from production experience, get in touch.

Discussion