Zero-Downtime Apache Solr in Production: VIP + Keepalived Architecture

When your Solr cluster powers a national credit reporting platform regulated by the Reserve Bank of India, downtime is not an option. Not "we prefer no downtime" — it's a hard regulatory requirement. Every consumer credit query, every background check, every loan application flows through this infrastructure.

This is how we architected for zero-downtime patching, version upgrades, and node recovery across 150+ Solr nodes.

The Challenge

Standard Solr deployments use SolrCloud with ZooKeeper for coordination. SolrCloud handles shard routing, leader elections, and replication — but it doesn't protect you from the operational realities:

OS patching requires node restarts
Solr version upgrades require rolling restarts at minimum
JVM tuning changes require restarts
Config changes via ConfigSet API can cascade

At scale, "rolling restart" means hours of risk window where any one node failure could cascade.

The Pattern: VIP + Keepalived in Front of SolrCloud

The core insight is to decouple traffic routing from Solr's own cluster coordination.

                    ┌─────────────────────┐
                    │   Virtual IP (VIP)   │
                    │   192.168.1.100      │
                    └──────────┬──────────┘
                               │
              ┌────────────────┴────────────────┐
              │          Keepalived              │
              │   (VRRP-based VIP failover)      │
              └────────────────┬────────────────┘
                               │
              ┌────────────────┴────────────────┐
              │           HAProxy               │
              │   (health-checked load balancer) │
              └────┬────────────────────┬────────┘
                   │                    │
            ┌──────┴──────┐      ┌──────┴──────┐
            │  Solr Node  │      │  Solr Node  │
            │  (Leader)   │      │  (Replica)  │
            └─────────────┘      └─────────────┘

VIP is a floating IP address owned by exactly one server at a time. If that server goes down, Keepalived's VRRP protocol moves the VIP to a standby server in < 1 second — invisibly to clients.

Keepalived Configuration

/etc/keepalived/keepalived.conf

vrrp_script chk_haproxy {
    script "killall -0 haproxy"
    interval 2
    weight 2
}
 
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101          # higher = preferred master
    advert_int 1
    
    authentication {
        auth_type PASS
        auth_pass your_secret_here
    }
    
    virtual_ipaddress {
        192.168.1.100/24   # the VIP
    }
    
    track_script {
        chk_haproxy
    }
}

On the standby node, set state BACKUP and priority 100. Keepalived handles promotion automatically.

HAProxy Health Checks for Solr

HAProxy sits behind the VIP and distributes queries across Solr nodes, with active health checking:

/etc/haproxy/haproxy.cfg

frontend solr_frontend
    bind 192.168.1.100:8983
    default_backend solr_backend
 
backend solr_backend
    balance leastconn
    option httpchk GET /solr/admin/ping
    http-check expect status 200
    
    # Drain a node before maintenance
    server solr-01 10.0.1.1:8983 check inter 5s fall 2 rise 3
    server solr-02 10.0.1.2:8983 check inter 5s fall 2 rise 3
    server solr-03 10.0.1.3:8983 check inter 5s fall 2 rise 3 backup

The /solr/admin/ping endpoint returns 200 only when the node is healthy and ZooKeeper-connected. HAProxy removes unhealthy nodes before they become a problem.

The Zero-Downtime Patching Procedure

With this architecture, patching becomes a safe, scripted procedure:

rolling-patch.sh

#!/bin/bash
# Safe rolling Solr node maintenance
 
HAPROXY_SOCK="/var/run/haproxy/admin.sock"
NODE="solr-01"
HAPROXY_SERVER="solr_backend/${NODE}"
 
# Step 1: Remove node from HAProxy rotation (DRAIN state)
echo "Draining ${NODE} from load balancer..."
echo "set server ${HAPROXY_SERVER} state drain" | socat stdio "${HAPROXY_SOCK}"
 
# Step 2: Wait for in-flight requests to complete
echo "Waiting 30s for connections to drain..."
sleep 30
 
# Step 3: Verify zero active connections
ACTIVE=$(echo "show info" | socat stdio "${HAPROXY_SOCK}" | grep "CurrConns" | awk '{print $2}')
echo "Active connections: ${ACTIVE}"
 
# Step 4: Patch the OS / upgrade Solr
systemctl stop solr
apt-get upgrade -y solr
systemctl start solr
 
# Step 5: Wait for Solr to join ZooKeeper
./wait-for-solr-healthy.sh "${NODE}"
 
# Step 6: Re-enable in HAProxy
echo "Re-enabling ${NODE}..."
echo "set server ${HAPROXY_SERVER} state ready" | socat stdio "${HAPROXY_SOCK}"
echo "Done. ${NODE} is back in rotation."

Observability: Grafana + Prometheus

No HA story is complete without observability. We scrape Solr's built-in metrics via the Prometheus exporter:

prometheus.yml

scrape_configs:
  - job_name: 'solr'
    static_configs:
      - targets:
          - 'solr-01:9854'
          - 'solr-02:9854'
          - 'solr-03:9854'
    metrics_path: /metrics

Key dashboards we monitor:

Query latency p99 — alert if > 500ms sustained
Index merge time — spikes signal resource pressure
ZooKeeper session timeouts — early warning for split-brain
GC pause duration — JVM tuning signal

What We Achieved

After full implementation across the 150+ node cluster:

Zero unplanned downtime over 18 months of operation
Patch cycles reduced from 4-hour maintenance windows to < 15 minutes per node
L0/L1 toil reduced by ~60% — automated restart/sanity scripts replaced manual intervention
Formal commendation from India Head VP for eliminating the "rolling restart risk window"

The combination of VIP + Keepalived + HAProxy health checking + Jenkins automation creates a platform that is far more resilient than SolrCloud alone — and makes the ops team's life significantly more manageable.

Questions about Solr HA patterns or this architecture? Connect on LinkedIn.