Zero-Downtime Apache Solr in Production: VIP + Keepalived Architecture
When your Solr cluster powers a national credit reporting platform regulated by the Reserve Bank of India, downtime is not an option. Not "we prefer no downtime" — it's a hard regulatory requirement. Every consumer credit query, every background check, every loan application flows through this infrastructure.
This is how we architected for zero-downtime patching, version upgrades, and node recovery across 150+ Solr nodes.
The Challenge
Standard Solr deployments use SolrCloud with ZooKeeper for coordination. SolrCloud handles shard routing, leader elections, and replication — but it doesn't protect you from the operational realities:
- OS patching requires node restarts
- Solr version upgrades require rolling restarts at minimum
- JVM tuning changes require restarts
- Config changes via ConfigSet API can cascade
At scale, "rolling restart" means hours of risk window where any one node failure could cascade.
The Pattern: VIP + Keepalived in Front of SolrCloud
The core insight is to decouple traffic routing from Solr's own cluster coordination.
┌─────────────────────┐
│ Virtual IP (VIP) │
│ 192.168.1.100 │
└──────────┬──────────┘
│
┌────────────────┴────────────────┐
│ Keepalived │
│ (VRRP-based VIP failover) │
└────────────────┬────────────────┘
│
┌────────────────┴────────────────┐
│ HAProxy │
│ (health-checked load balancer) │
└────┬────────────────────┬────────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ Solr Node │ │ Solr Node │
│ (Leader) │ │ (Replica) │
└─────────────┘ └─────────────┘
VIP is a floating IP address owned by exactly one server at a time. If that server goes down, Keepalived's VRRP protocol moves the VIP to a standby server in < 1 second — invisibly to clients.
Keepalived Configuration
vrrp_script chk_haproxy {
script "killall -0 haproxy"
interval 2
weight 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 101 # higher = preferred master
advert_int 1
authentication {
auth_type PASS
auth_pass your_secret_here
}
virtual_ipaddress {
192.168.1.100/24 # the VIP
}
track_script {
chk_haproxy
}
}On the standby node, set state BACKUP and priority 100. Keepalived handles promotion automatically.
HAProxy Health Checks for Solr
HAProxy sits behind the VIP and distributes queries across Solr nodes, with active health checking:
frontend solr_frontend
bind 192.168.1.100:8983
default_backend solr_backend
backend solr_backend
balance leastconn
option httpchk GET /solr/admin/ping
http-check expect status 200
# Drain a node before maintenance
server solr-01 10.0.1.1:8983 check inter 5s fall 2 rise 3
server solr-02 10.0.1.2:8983 check inter 5s fall 2 rise 3
server solr-03 10.0.1.3:8983 check inter 5s fall 2 rise 3 backupThe /solr/admin/ping endpoint returns 200 only when the node is healthy and ZooKeeper-connected. HAProxy removes unhealthy nodes before they become a problem.
The Zero-Downtime Patching Procedure
With this architecture, patching becomes a safe, scripted procedure:
#!/bin/bash
# Safe rolling Solr node maintenance
HAPROXY_SOCK="/var/run/haproxy/admin.sock"
NODE="solr-01"
HAPROXY_SERVER="solr_backend/${NODE}"
# Step 1: Remove node from HAProxy rotation (DRAIN state)
echo "Draining ${NODE} from load balancer..."
echo "set server ${HAPROXY_SERVER} state drain" | socat stdio "${HAPROXY_SOCK}"
# Step 2: Wait for in-flight requests to complete
echo "Waiting 30s for connections to drain..."
sleep 30
# Step 3: Verify zero active connections
ACTIVE=$(echo "show info" | socat stdio "${HAPROXY_SOCK}" | grep "CurrConns" | awk '{print $2}')
echo "Active connections: ${ACTIVE}"
# Step 4: Patch the OS / upgrade Solr
systemctl stop solr
apt-get upgrade -y solr
systemctl start solr
# Step 5: Wait for Solr to join ZooKeeper
./wait-for-solr-healthy.sh "${NODE}"
# Step 6: Re-enable in HAProxy
echo "Re-enabling ${NODE}..."
echo "set server ${HAPROXY_SERVER} state ready" | socat stdio "${HAPROXY_SOCK}"
echo "Done. ${NODE} is back in rotation."Observability: Grafana + Prometheus
No HA story is complete without observability. We scrape Solr's built-in metrics via the Prometheus exporter:
scrape_configs:
- job_name: 'solr'
static_configs:
- targets:
- 'solr-01:9854'
- 'solr-02:9854'
- 'solr-03:9854'
metrics_path: /metricsKey dashboards we monitor:
- Query latency p99 — alert if > 500ms sustained
- Index merge time — spikes signal resource pressure
- ZooKeeper session timeouts — early warning for split-brain
- GC pause duration — JVM tuning signal
What We Achieved
After full implementation across the 150+ node cluster:
- Zero unplanned downtime over 18 months of operation
- Patch cycles reduced from 4-hour maintenance windows to < 15 minutes per node
- L0/L1 toil reduced by ~60% — automated restart/sanity scripts replaced manual intervention
- Formal commendation from India Head VP for eliminating the "rolling restart risk window"
The combination of VIP + Keepalived + HAProxy health checking + Jenkins automation creates a platform that is far more resilient than SolrCloud alone — and makes the ops team's life significantly more manageable.
Questions about Solr HA patterns or this architecture? Connect on LinkedIn.