High Availability (HA)
High Availability (HA)
UVP
Most embedded databases force you to choose: simple-but-single-node, or pay the operational cost of a full distributed cluster. HeliosDB Nano ships three composable HA tiers as Cargo features, all in the same binary. Tier 1 (warm standby) is on by default — async WAL replication with automatic failover, no extra flag needed. Tier 2 adds branch-based active-active with vector-clock conflict resolution. Tier 3 adds consistent-hash sharding. Combine them or stay on Tier 1; for connection routing and failover at the wire layer, point your clients at the standalone HeliosProxy binary. Pay only for the HA you use.
Tier Matrix
| Tier | Cargo feature | Default? | Architecture | Conflict resolution | Best for |
|---|---|---|---|---|---|
| Tier 1 | ha-tier1 | yes | Warm standby (async/sync WAL) | N/A (single primary) | DR, read replicas, blue-green |
| Tier 2 | ha-tier2 | no | Branch-based active-active | Vector clock + field-level merge | Multi-region writes, edge sync |
| Tier 3 | ha-tier3 | no | Consistent hash ring sharding | Per-shard | Horizontal scaling beyond one host |
Optional add-ons
| Feature | Adds | Implies |
|---|---|---|
ha-dedup | Content-addressed deduplication across nodes | — |
ha-ab-testing | Branch-based experiment routing | — |
ha-branch-replication | Selective branch sync to remote servers | ha-tier2 |
ha-full | Bundle of all of the above | ha-tier1+2+3 + add-ons |
Build recipes
# Default — Tier 1 only (warm standby + DR)cargo build --release
# Add Tier 2 (multi-primary)cargo build --release --features ha-tier2
# Full HA stackcargo build --release --features ha-full[dependencies]heliosdb-nano = { version = "3.19", features = ["ha-tier2"] }Tier 1: Warm Standby
Active-passive replication with automatic failover.
Architecture
┌─────────────┐ WAL Stream ┌─────────────┐│ Primary │ ───────────────→ │ Standby ││ (Active) │ │ (Passive) │└─────────────┘ └─────────────┘ ↓ ↓ Read/Write Read-OnlyComponents
| Component | Description |
|---|---|
WalReplicator | Streams WAL from primary |
WalApplicator | Applies WAL on standby |
FailoverWatcher | Monitors primary health |
LsnManager | Tracks replication position |
SplitBrainProtector | Prevents dual-primary scenarios |
Configuration
use heliosdb_nano::replication::{ReplicationConfig, SyncMode};
let config = ReplicationConfig::builder() .primary_endpoint("primary.example.com:5432") .sync_mode(SyncMode::Synchronous) // or Asynchronous .build();Sync Modes
| Mode | Description | Durability | Latency |
|---|---|---|---|
| Synchronous | Wait for standby ACK | Strong | Higher |
| Asynchronous | Fire-and-forget | Eventual | Lower |
| Quorum | Wait for N/2+1 ACKs | Configurable | Medium |
Failover
use heliosdb_nano::replication::FailoverWatcher;
let watcher = FailoverWatcher::new(config);watcher.on_failover(|event| { println!("Failover triggered: {:?}", event); // Promote standby to primary});Split-Brain Protection
use heliosdb_nano::replication::{SplitBrainProtector, ObserverConfig};
let protector = SplitBrainProtector::new(ObserverConfig { observers: vec!["observer1.example.com", "observer2.example.com"], quorum_size: 2,});
protector.start();Tier 2: Multi-Primary
Active-active replication with conflict resolution.
Architecture
┌─────────────┐ Branch Sync ┌─────────────┐│ Region A │ ←─────────────→ │ Region B ││ (Primary) │ │ (Primary) │└─────────────┘ └─────────────┘ ↓ ↓ ↓ ↓ Writes Reads Writes ReadsComponents
| Component | Description |
|---|---|
MultiPrimarySyncManager | Coordinates multi-region sync |
ConflictMergeEngine | Resolves write conflicts |
RegionCoordinator | Manages region topology |
Conflict Resolution Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Last-Write-Wins | Timestamp-based | Simple, no conflicts visible |
| Branch-Wins | Prefer local changes | Low-latency local writes |
| Merge | Combine changes | Collaborative editing |
| Custom | User-defined logic | Complex business rules |
Configuration
use heliosdb_nano::replication::{ MultiPrimarySyncManager, ConflictResolution,};
let sync = MultiPrimarySyncManager::new() .add_region("us-east", "us-east.example.com:5432") .add_region("eu-west", "eu-west.example.com:5432") .conflict_resolution(ConflictResolution::LastWriteWins) .build();Branch-Based Replication
Multi-primary uses HeliosDB Nano’s branching for conflict-free merges:
-- Each region maintains its own branch-- Sync merges branches across regions
-- Region A writesINSERT INTO orders (id, total) VALUES (1, 100);
-- Region B writes (concurrent)INSERT INTO orders (id, total) VALUES (2, 200);
-- After sync: both rows present in all regionsTier 3: Sharding
Horizontal scaling with consistent hashing.
Architecture
┌─────────────┐ │ Router │ └──────┬──────┘ ┌───────────────┼───────────────┐ ↓ ↓ ↓ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ │ (0-33%) │ │ (34-66%) │ │ (67-100%)│ └──────────┘ └──────────┘ └──────────┘Components
| Component | Description |
|---|---|
HashRing | Consistent hashing for key distribution |
ShardRouter | Routes queries to correct shard |
ReshardManager | Online resharding with minimal downtime |
VectorPartitioner | Special partitioning for vector data |
Sharding Strategies
| Strategy | Description | Best For |
|---|---|---|
| Hash | Consistent hash of shard key | Even distribution |
| Range | Key ranges per shard | Time-series data |
| Geographic | Location-based routing | Multi-region |
| Vector | Centroid-based partitioning | Vector search |
Configuration
use heliosdb_nano::replication::{HashRing, ShardRouter};
let ring = HashRing::new() .add_node("shard1.example.com:5432", 100) // weight: 100 .add_node("shard2.example.com:5432", 100) .add_node("shard3.example.com:5432", 100) .build();
let router = ShardRouter::new(ring) .shard_key("tenant_id") // Shard by tenant .build();Vector Partitioning
Special support for vector workloads:
use heliosdb_nano::replication::{VectorPartitioner, CentroidManager};
let partitioner = VectorPartitioner::new() .dimensions(768) .num_centroids(16) // 16 partitions based on vector similarity .build();
// Vectors routed to shard containing nearest centroidResharding
Online resharding without downtime:
use heliosdb_nano::replication::ReshardManager;
let reshard = ReshardManager::new(ring) .target_shards(6) // Scale from 3 to 6 shards .parallel_streams(4) .build();
reshard.execute().await?; // Non-blocking migrationLogical Replication
For selective table replication:
use heliosdb_nano::replication::{ LogicalReplicationPipeline, TableFilter, ColumnMapping,};
let pipeline = LogicalReplicationPipeline::new() .source("source.example.com:5432") .destination("dest.example.com:5432") .table_filter(TableFilter::include(&["users", "orders"])) .column_mapping(ColumnMapping::new() .rename("old_name", "new_name") .exclude("sensitive_column")) .build();
pipeline.start().await?;CLI Options
Start HeliosDB Nano in HA mode:
# Primary modeheliosdb-nano server --ha-mode primary --ha-bind 0.0.0.0:5433
# Standby modeheliosdb-nano server --ha-mode standby --ha-primary primary.example.com:5433
# Multi-primary modeheliosdb-nano server --ha-mode multi-primary \ --ha-region us-east \ --ha-peers eu-west.example.com:5433Docker Support
Docker Compose for HA cluster:
version: '3.8'services: primary: image: heliosdb/heliosdb-nano:latest command: server --ha-mode primary ports: - "5432:5432" - "5433:5433" environment: - HA_SYNC_MODE=synchronous
standby: image: heliosdb/heliosdb-nano:latest command: server --ha-mode standby --ha-primary primary:5433 depends_on: - primaryTransparent Write Routing (TWR)
HeliosDB Nano implements Transparent Write Routing (TWR) - an innovative feature that allows applications to connect to any node (primary or standby) and have writes automatically routed to the primary.
How It Works
Application → Standby → (DML/DDL forwarded) → Primary ↓ (SELECT executed locally)Behavior by Sync Mode
| Sync Mode | DQL (SELECT) | DML (INSERT/UPDATE/DELETE) |
|---|---|---|
| sync | Execute locally on standby | Forward to primary, return result |
| semi-sync | Execute locally on standby | Forward to primary, return result |
| async | Execute locally on standby | Reject (traditional read-only) |
Operations Subject to Routing
When connected to a standby in sync/semi-sync mode:
| Operation | Behavior |
|---|---|
SELECT | Execute locally (DQL) |
INSERT | Forward to primary (DML) |
UPDATE | Forward to primary (DML) |
DELETE | Forward to primary (DML) |
CREATE | Forward to primary (DDL) |
DROP | Forward to primary (DDL) |
ALTER | Forward to primary (DDL) |
TRUNCATE | Forward to primary (DDL) |
Example: Transparent Routing
-- Connect to STANDBY and execute INSERT (forwarded to primary)INSERT INTO users VALUES (3, 'Charlie');-- Result: INSERT 0 1 (success - executed on primary)
-- SELECT always executes locally on the connected standbySELECT * FROM users;Benefits
- Load Distribution: Applications can connect to any node; reads distributed, writes auto-routed
- Simplified Application Logic: No need for separate read/write connection strings
- High Availability: Application continues working if it connects to standby
- Transparent Failover: Combined with connection pooling, provides seamless failover
Monitoring
HA System Views
HeliosDB Nano provides SQL system views for monitoring HA configuration and replication metrics.
pg_replication_status
View node configuration and role:
SELECT * FROM pg_replication_status;| Column | Description |
|---|---|
node_id | Unique identifier for this node |
role | primary, standby, observer, or standalone |
sync_mode | async, semi-sync, or sync |
listen_address | Host and port |
replication_port | WAL streaming port |
current_lsn | Current log sequence number |
is_read_only | true/false |
standby_count | Number of connected standbys (primary only) |
uptime_seconds | Time since node started |
pg_replication_standbys (Primary Only)
View connected standbys:
SELECT * FROM pg_replication_standbys;| Column | Description |
|---|---|
node_id | Standby’s unique identifier |
address | Standby’s connection address |
sync_mode | Replication mode for this standby |
state | connecting, streaming, catching_up, synced, disconnected |
current_lsn | Standby’s current LSN position |
flush_lsn | Flushed LSN |
apply_lsn | Applied LSN |
lag_bytes | Replication lag in bytes |
lag_ms | Replication lag in milliseconds |
connected_at | Connection timestamp |
last_heartbeat | Last heartbeat received |
pg_replication_primary (Standby Only)
View primary connection status:
SELECT * FROM pg_replication_primary;| Column | Description |
|---|---|
node_id | Primary’s unique identifier |
address | Primary’s address |
state | disconnected, connecting, connected, streaming, error |
primary_lsn | Primary’s current LSN |
local_lsn | Local LSN position |
lag_bytes | Replication lag in bytes |
lag_ms | Replication lag in milliseconds |
fencing_token | Split-brain protection token |
connected_at | Connection timestamp |
last_heartbeat | Last heartbeat received |
pg_replication_metrics
View performance metrics:
SELECT * FROM pg_replication_metrics;| Column | Description |
|---|---|
wal_writes | Total WAL write operations |
wal_bytes_written | Total WAL bytes written |
records_replicated | Records sent to standbys |
bytes_replicated | Bytes sent to standbys |
heartbeats_sent | Health check counts sent |
heartbeats_received | Health check counts received |
reconnect_count | Number of reconnections |
last_wal_write | Timestamp of last WAL write |
last_replication | Timestamp of last replication |
Monitoring Examples
-- Check if standbys are in syncSELECT node_id, CASE WHEN lag_ms < 1000 THEN 'IN_SYNC' WHEN lag_ms < 60000 THEN 'CATCHING_UP' ELSE 'LAGGING' END as status, lag_msFROM pg_replication_standbys;
-- View all nodes in clusterSELECT node_id, role, current_lsnFROM pg_replication_status;Best Practices
- Network: Use dedicated replication network
- Monitoring: Alert on replication lag > threshold
- Testing: Regularly test failover procedures
- Backups: Continue point-in-time backups even with HA
- Quorum: Use odd number of nodes for consensus
HeliosProxy — Wire-Level Routing & Failover
For PostgreSQL-wire connection routing, read/write splitting, and transparent failover between nodes, deploy the standalone HeliosProxy binary in front of your cluster:
- Repo: github.com/dimensigon/heliosdb-proxy
- Topology: sits between clients and the Nano fleet; speaks the PostgreSQL wire protocol on the south side, routes writes to the current primary and reads to standbys.
- Failover: detects primary loss, promotes a standby, retargets active sessions.
- Compatible with: every Tier (Tier 1 standby promotion, Tier 2 region pinning, Tier 3 shard fan-out).
Inside the database, Transparent Write Routing (TWR) covers the same need at the protocol layer when you’re connecting directly without a proxy. Use TWR for simple deployments, HeliosProxy for production fleets.
See Also
- Configuration
- Deployment Modes
- Multi-Tenancy
- BACKUP_RESTORE_TUTORIAL — keep PITR running alongside HA
- FIPS_COMPLIANCE_TUTORIAL — HA in regulated deployments