High Availability (HA)

UVP

Most embedded databases force you to choose: simple-but-single-node, or pay the operational cost of a full distributed cluster. HeliosDB Nano ships three composable HA tiers as Cargo features, all in the same binary. Tier 1 (warm standby) is on by default — async WAL replication with automatic failover, no extra flag needed. Tier 2 adds branch-based active-active with vector-clock conflict resolution. Tier 3 adds consistent-hash sharding. Combine them or stay on Tier 1; for connection routing and failover at the wire layer, point your clients at the standalone HeliosProxy binary. Pay only for the HA you use.

Tier Matrix

Tier	Cargo feature	Default?	Architecture	Conflict resolution	Best for
Tier 1	`ha-tier1`	yes	Warm standby (async/sync WAL)	N/A (single primary)	DR, read replicas, blue-green
Tier 2	`ha-tier2`	no	Branch-based active-active	Vector clock + field-level merge	Multi-region writes, edge sync
Tier 3	`ha-tier3`	no	Consistent hash ring sharding	Per-shard	Horizontal scaling beyond one host

Optional add-ons

Feature	Adds	Implies
`ha-dedup`	Content-addressed deduplication across nodes	—
`ha-ab-testing`	Branch-based experiment routing	—
`ha-branch-replication`	Selective branch sync to remote servers	`ha-tier2`
`ha-full`	Bundle of all of the above	`ha-tier1+2+3` + add-ons

Build recipes

# Default — Tier 1 only (warm standby + DR)
cargo build --release

# Add Tier 2 (multi-primary)
cargo build --release --features ha-tier2

# Full HA stack
cargo build --release --features ha-full

[dependencies]
heliosdb-nano = { version = "3.19", features = ["ha-tier2"] }

Tier 1: Warm Standby

Active-passive replication with automatic failover.

Architecture

┌─────────────┐     WAL Stream    ┌─────────────┐
│   Primary   │ ───────────────→ │   Standby   │
│   (Active)  │                   │  (Passive)  │
└─────────────┘                   └─────────────┘
       ↓                                 ↓
   Read/Write                       Read-Only

Components

Component	Description
`WalReplicator`	Streams WAL from primary
`WalApplicator`	Applies WAL on standby
`FailoverWatcher`	Monitors primary health
`LsnManager`	Tracks replication position
`SplitBrainProtector`	Prevents dual-primary scenarios

Configuration

use heliosdb_nano::replication::{ReplicationConfig, SyncMode};

let config = ReplicationConfig::builder()
    .primary_endpoint("primary.example.com:5432")
    .sync_mode(SyncMode::Synchronous)  // or Asynchronous
    .build();

Sync Modes

Mode	Description	Durability	Latency
Synchronous	Wait for standby ACK	Strong	Higher
Asynchronous	Fire-and-forget	Eventual	Lower
Quorum	Wait for N/2+1 ACKs	Configurable	Medium

Failover

use heliosdb_nano::replication::FailoverWatcher;

let watcher = FailoverWatcher::new(config);
watcher.on_failover(|event| {
    println!("Failover triggered: {:?}", event);
    // Promote standby to primary
});

Split-Brain Protection

use heliosdb_nano::replication::{SplitBrainProtector, ObserverConfig};

let protector = SplitBrainProtector::new(ObserverConfig {
    observers: vec!["observer1.example.com", "observer2.example.com"],
    quorum_size: 2,
});

protector.start();

Tier 2: Multi-Primary

Active-active replication with conflict resolution.

Architecture

┌─────────────┐   Branch Sync   ┌─────────────┐
│  Region A   │ ←─────────────→ │  Region B   │
│  (Primary)  │                 │  (Primary)  │
└─────────────┘                 └─────────────┘
    ↓     ↓                       ↓     ↓
  Writes Reads                  Writes Reads

Components

Component	Description
`MultiPrimarySyncManager`	Coordinates multi-region sync
`ConflictMergeEngine`	Resolves write conflicts
`RegionCoordinator`	Manages region topology

Conflict Resolution Strategies

Strategy	Description	Use Case
Last-Write-Wins	Timestamp-based	Simple, no conflicts visible
Branch-Wins	Prefer local changes	Low-latency local writes
Merge	Combine changes	Collaborative editing
Custom	User-defined logic	Complex business rules

Configuration

use heliosdb_nano::replication::{
    MultiPrimarySyncManager,
    ConflictResolution,
};

let sync = MultiPrimarySyncManager::new()
    .add_region("us-east", "us-east.example.com:5432")
    .add_region("eu-west", "eu-west.example.com:5432")
    .conflict_resolution(ConflictResolution::LastWriteWins)
    .build();

Branch-Based Replication

Multi-primary uses HeliosDB Nano’s branching for conflict-free merges:

-- Each region maintains its own branch
-- Sync merges branches across regions

-- Region A writes
INSERT INTO orders (id, total) VALUES (1, 100);

-- Region B writes (concurrent)
INSERT INTO orders (id, total) VALUES (2, 200);

-- After sync: both rows present in all regions

Tier 3: Sharding

Horizontal scaling with consistent hashing.

Architecture

                    ┌─────────────┐
                    │   Router    │
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           ↓               ↓               ↓
    ┌──────────┐    ┌──────────┐    ┌──────────┐
    │ Shard 1  │    │ Shard 2  │    │ Shard 3  │
    │  (0-33%) │    │ (34-66%) │    │ (67-100%)│
    └──────────┘    └──────────┘    └──────────┘

Components

Component	Description
`HashRing`	Consistent hashing for key distribution
`ShardRouter`	Routes queries to correct shard
`ReshardManager`	Online resharding with minimal downtime
`VectorPartitioner`	Special partitioning for vector data

Sharding Strategies

Strategy	Description	Best For
Hash	Consistent hash of shard key	Even distribution
Range	Key ranges per shard	Time-series data
Geographic	Location-based routing	Multi-region
Vector	Centroid-based partitioning	Vector search

Configuration

use heliosdb_nano::replication::{HashRing, ShardRouter};

let ring = HashRing::new()
    .add_node("shard1.example.com:5432", 100)  // weight: 100
    .add_node("shard2.example.com:5432", 100)
    .add_node("shard3.example.com:5432", 100)
    .build();

let router = ShardRouter::new(ring)
    .shard_key("tenant_id")  // Shard by tenant
    .build();

Vector Partitioning

Special support for vector workloads:

use heliosdb_nano::replication::{VectorPartitioner, CentroidManager};

let partitioner = VectorPartitioner::new()
    .dimensions(768)
    .num_centroids(16)  // 16 partitions based on vector similarity
    .build();

// Vectors routed to shard containing nearest centroid

Resharding

Online resharding without downtime:

use heliosdb_nano::replication::ReshardManager;

let reshard = ReshardManager::new(ring)
    .target_shards(6)  // Scale from 3 to 6 shards
    .parallel_streams(4)
    .build();

reshard.execute().await?;  // Non-blocking migration

Logical Replication

For selective table replication:

use heliosdb_nano::replication::{
    LogicalReplicationPipeline,
    TableFilter,
    ColumnMapping,
};

let pipeline = LogicalReplicationPipeline::new()
    .source("source.example.com:5432")
    .destination("dest.example.com:5432")
    .table_filter(TableFilter::include(&["users", "orders"]))
    .column_mapping(ColumnMapping::new()
        .rename("old_name", "new_name")
        .exclude("sensitive_column"))
    .build();

pipeline.start().await?;

CLI Options

Start HeliosDB Nano in HA mode:

# Primary mode
heliosdb-nano server --ha-mode primary --ha-bind 0.0.0.0:5433

# Standby mode
heliosdb-nano server --ha-mode standby --ha-primary primary.example.com:5433

# Multi-primary mode
heliosdb-nano server --ha-mode multi-primary \
    --ha-region us-east \
    --ha-peers eu-west.example.com:5433

Docker Support

Docker Compose for HA cluster:

version: '3.8'
services:
  primary:
    image: heliosdb/heliosdb-nano:latest
    command: server --ha-mode primary
    ports:
      - "5432:5432"
      - "5433:5433"
    environment:
      - HA_SYNC_MODE=synchronous

  standby:
    image: heliosdb/heliosdb-nano:latest
    command: server --ha-mode standby --ha-primary primary:5433
    depends_on:
      - primary

Transparent Write Routing (TWR)

HeliosDB Nano implements Transparent Write Routing (TWR) - an innovative feature that allows applications to connect to any node (primary or standby) and have writes automatically routed to the primary.

How It Works

Application → Standby → (DML/DDL forwarded) → Primary
                ↓
         (SELECT executed locally)

Behavior by Sync Mode

Sync Mode	DQL (SELECT)	DML (INSERT/UPDATE/DELETE)
sync	Execute locally on standby	Forward to primary, return result
semi-sync	Execute locally on standby	Forward to primary, return result
async	Execute locally on standby	Reject (traditional read-only)

Operations Subject to Routing

When connected to a standby in sync/semi-sync mode:

Operation	Behavior
`SELECT`	Execute locally (DQL)
`INSERT`	Forward to primary (DML)
`UPDATE`	Forward to primary (DML)
`DELETE`	Forward to primary (DML)
`CREATE`	Forward to primary (DDL)
`DROP`	Forward to primary (DDL)
`ALTER`	Forward to primary (DDL)
`TRUNCATE`	Forward to primary (DDL)

Example: Transparent Routing

-- Connect to STANDBY and execute INSERT (forwarded to primary)
INSERT INTO users VALUES (3, 'Charlie');
-- Result: INSERT 0 1 (success - executed on primary)

-- SELECT always executes locally on the connected standby
SELECT * FROM users;

Benefits

Load Distribution: Applications can connect to any node; reads distributed, writes auto-routed
Simplified Application Logic: No need for separate read/write connection strings
High Availability: Application continues working if it connects to standby
Transparent Failover: Combined with connection pooling, provides seamless failover

Monitoring

HA System Views

HeliosDB Nano provides SQL system views for monitoring HA configuration and replication metrics.

pg_replication_status

View node configuration and role:

SELECT * FROM pg_replication_status;

Column	Description
`node_id`	Unique identifier for this node
`role`	primary, standby, observer, or standalone
`sync_mode`	async, semi-sync, or sync
`listen_address`	Host and port
`replication_port`	WAL streaming port
`current_lsn`	Current log sequence number
`is_read_only`	true/false
`standby_count`	Number of connected standbys (primary only)
`uptime_seconds`	Time since node started

pg_replication_standbys (Primary Only)

View connected standbys:

SELECT * FROM pg_replication_standbys;

Column	Description
`node_id`	Standby’s unique identifier
`address`	Standby’s connection address
`sync_mode`	Replication mode for this standby
`state`	connecting, streaming, catching_up, synced, disconnected
`current_lsn`	Standby’s current LSN position
`flush_lsn`	Flushed LSN
`apply_lsn`	Applied LSN
`lag_bytes`	Replication lag in bytes
`lag_ms`	Replication lag in milliseconds
`connected_at`	Connection timestamp
`last_heartbeat`	Last heartbeat received

pg_replication_primary (Standby Only)

View primary connection status:

SELECT * FROM pg_replication_primary;

Column	Description
`node_id`	Primary’s unique identifier
`address`	Primary’s address
`state`	disconnected, connecting, connected, streaming, error
`primary_lsn`	Primary’s current LSN
`local_lsn`	Local LSN position
`lag_bytes`	Replication lag in bytes
`lag_ms`	Replication lag in milliseconds
`fencing_token`	Split-brain protection token
`connected_at`	Connection timestamp
`last_heartbeat`	Last heartbeat received

pg_replication_metrics

View performance metrics:

SELECT * FROM pg_replication_metrics;

Column	Description
`wal_writes`	Total WAL write operations
`wal_bytes_written`	Total WAL bytes written
`records_replicated`	Records sent to standbys
`bytes_replicated`	Bytes sent to standbys
`heartbeats_sent`	Health check counts sent
`heartbeats_received`	Health check counts received
`reconnect_count`	Number of reconnections
`last_wal_write`	Timestamp of last WAL write
`last_replication`	Timestamp of last replication

Monitoring Examples

-- Check if standbys are in sync
SELECT
    node_id,
    CASE
        WHEN lag_ms < 1000 THEN 'IN_SYNC'
        WHEN lag_ms < 60000 THEN 'CATCHING_UP'
        ELSE 'LAGGING'
    END as status,
    lag_ms
FROM pg_replication_standbys;

-- View all nodes in cluster
SELECT node_id, role, current_lsn
FROM pg_replication_status;

Best Practices

Network: Use dedicated replication network
Monitoring: Alert on replication lag > threshold
Testing: Regularly test failover procedures
Backups: Continue point-in-time backups even with HA
Quorum: Use odd number of nodes for consensus

HeliosProxy — Wire-Level Routing & Failover

For PostgreSQL-wire connection routing, read/write splitting, and transparent failover between nodes, deploy the standalone HeliosProxy binary in front of your cluster:

Repo: github.com/dimensigon/heliosdb-proxy
Topology: sits between clients and the Nano fleet; speaks the PostgreSQL wire protocol on the south side, routes writes to the current primary and reads to standbys.
Failover: detects primary loss, promotes a standby, retargets active sessions.
Compatible with: every Tier (Tier 1 standby promotion, Tier 2 region pinning, Tier 3 shard fan-out).

Inside the database, Transparent Write Routing (TWR) covers the same need at the protocol layer when you’re connecting directly without a proxy. Use TWR for simple deployments, HeliosProxy for production fleets.

High Availability (HA)

High Availability (HA)

UVP

Tier Matrix

Optional add-ons

Build recipes

Tier 1: Warm Standby

Architecture

Components

Configuration

Sync Modes

Failover

Split-Brain Protection

Tier 2: Multi-Primary

Architecture

Components

Conflict Resolution Strategies

Configuration

Branch-Based Replication

Tier 3: Sharding

Architecture

Components

Sharding Strategies

Configuration

Vector Partitioning

Resharding

Logical Replication

CLI Options

Docker Support

Transparent Write Routing (TWR)

How It Works

Behavior by Sync Mode

Operations Subject to Routing

Example: Transparent Routing

Benefits

Monitoring

HA System Views

pg_replication_status

pg_replication_standbys (Primary Only)

pg_replication_primary (Standby Only)

pg_replication_metrics

Monitoring Examples

Best Practices

HeliosProxy — Wire-Level Routing & Failover

See Also