Multi-User Transactions Architecture
Multi-User Transactions Architecture
Version: 3.4.0 Status: Production-Ready Last Updated: 2026-01-04
Table of Contents
- System Architecture Overview
- Session Lifecycle
- Transaction Flow with Locking
- Lock Manager Deadlock Detection
- Isolation Level Implementation
- Dump/Restore Data Flow
- Memory Layout and Data Structures
- Performance Optimization Strategies
System Architecture Overview
HeliosDB Nano v3.1.0 implements a layered architecture for multi-user ACID transactions:
┌────────────────────────────────────────────────────────────────┐│ Client Applications ││ (psql, JDBC, Python drivers, HTTP API, CLI tools) │└─────────────────────────────┬──────────────────────────────────┘ │ │ PostgreSQL Wire Protocol / HTTP │┌─────────────────────────────▼──────────────────────────────────┐│ Protocol Layer (PgServer) ││ ┌───────────────┐ ┌───────────────┐ ┌──────────────────┐ ││ │ Connection │ │ Session │ │ Authentication │ ││ │ Handler │ │ Manager │ │ (Trust/Pass/JWT)│ ││ └───────────────┘ └───────────────┘ └──────────────────┘ │└─────────────────────────────┬──────────────────────────────────┘ │ │ Session API │┌─────────────────────────────▼──────────────────────────────────┐│ Session Management Layer ││ ┌────────────────────┐ ┌──────────────┐ ┌────────────────┐ ││ │ SessionManager │ │ UserManager │ │ ResourceQuota │ ││ │ - create() │ │ - auth() │ │ Manager │ ││ │ - begin_txn() │ │ - authorize()│ │ - check() │ ││ │ - commit() │ │ - audit() │ │ - enforce() │ ││ └────────────────────┘ └──────────────┘ └────────────────┘ │└─────────────────────────────┬──────────────────────────────────┘ │ │ Transaction API │┌─────────────────────────────▼──────────────────────────────────┐│ Transaction Management Layer ││ ┌────────────────────┐ ┌────────────────┐ ┌──────────────┐ ││ │ Transaction │ │ LockManager │ │MVCC Snapshot │ ││ │ - get() │ │ - acquire() │ │ Manager │ ││ │ - put() │ │ - release() │ │ - create() │ ││ │ - commit() │ │ - detect_dl() │ │ - read_at() │ ││ └────────────────────┘ └────────────────┘ └──────────────┘ │└─────────────────────────────┬──────────────────────────────────┘ │ │ Storage API │┌─────────────────────────────▼──────────────────────────────────┐│ Storage Engine Layer ││ ┌────────────────────┐ ┌────────────────┐ ┌──────────────┐ ││ │ StorageEngine │ │ DumpManager │ │ DirtyTracker │ ││ │ - insert() │ │ - dump() │ │ - is_dirty() │ ││ │ - scan() │ │ - restore() │ │ - mark() │ ││ │ - update() │ │ - history() │ │ - clear() │ ││ └────────────────────┘ └────────────────┘ └──────────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ RocksDB (In-Memory Backend) │ ││ │ - MemTable (hash-based, lock-free) │ ││ │ - No SST files (pure RAM) │ ││ │ - Optional persistence to dump files │ ││ └──────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────┘Component Responsibilities
| Component | Responsibility | Thread-Safety |
|---|---|---|
| PgServer | Handle PostgreSQL wire protocol, parse queries | Per-connection thread |
| SessionManager | Create/destroy sessions, transaction lifecycle | Arc<DashMap> |
| LockManager | Acquire/release locks, deadlock detection | Arc<DashMap> |
| Transaction | Execute operations, maintain MVCC snapshot | Isolated per session |
| StorageEngine | Read/write data, manage persistence | Thread-safe (RocksDB) |
| DumpManager | Export/import database state | Read-write locks |
Session Lifecycle
Session State Machine
┌─────────────┐│ Disconnected│└──────┬──────┘ │ connect() ▼┌─────────────┐ authenticate() ┌──────────────┐│ Connecting │────────────────────────>│ Authenticated│└─────────────┘ └──────┬───────┘ │ │ │ auth_failed() │ create_session() ▼ ▼┌─────────────┐ ┌──────────────┐│ Closed │<────────────────────────│ Active │└─────────────┘ timeout/close └──────┬───────┘ │ │ begin_transaction() ▼ ┌──────────────┐ │ In Transaction│ └──────┬───────┘ │ │ commit/rollback ▼ ┌──────────────┐ │ Active │ └──────────────┘Session Creation Flow
Client SessionManager UserManager ResourceQuota │ │ │ │ │──create_session()────────>│ │ │ │ │ │ │ │ │──authenticate()────────>│ │ │ │ │ │ │ │<──auth_result──────────│ │ │ │ │ │ │ │──check_quota()────────────────────────────>│ │ │ │ │ │ │<──quota_ok()───────────────────────────────│ │ │ │ │ │ │──allocate_session()────│ │ │ │ │ │ │<──SessionId──────────────│ │ │ │ │ │ │Session Data Structure
pub struct Session { /// Unique session identifier pub id: SessionId,
/// User who owns this session pub user_id: UserId,
/// Transaction isolation level pub isolation_level: IsolationLevel,
/// Currently active transaction (if any) pub active_txn: Option<TransactionId>,
/// Session creation timestamp pub created_at: u64,
/// Last activity timestamp (for timeout detection) pub last_activity: u64,
/// Session statistics pub stats: SessionStats,}
pub struct SessionStats { /// Number of transactions committed pub transactions_committed: u64,
/// Number of transactions rolled back pub transactions_rolled_back: u64,
/// Total queries executed pub queries_executed: u64,
/// Total rows read pub rows_read: u64,
/// Total rows written pub rows_written: u64,
/// Memory allocated by this session (bytes) pub memory_bytes: u64,}Transaction Flow with Locking
Transaction Lifecycle
┌─────────────────────────────────────────────────────────────────┐│ 1. BEGIN TRANSACTION │├─────────────────────────────────────────────────────────────────┤│ ││ SessionManager::begin_transaction(session_id) ││ │ ││ ├─> Validate session exists and is authenticated ││ ├─> Check no active transaction ││ ├─> Create MVCC snapshot based on isolation level: ││ │ - READ COMMITTED: latest committed snapshot ││ │ - REPEATABLE READ: consistent point-in-time snapshot ││ │ - SERIALIZABLE: serializable snapshot with tracking ││ ├─> Allocate Transaction context ││ └─> Return TransactionId ││ │└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐│ 2. EXECUTE QUERIES (SELECT, INSERT, UPDATE, DELETE) │├─────────────────────────────────────────────────────────────────┤│ ││ For SELECT (read operation): ││ │ ││ ├─> LockManager::acquire_lock(txn_id, key, SHARED) ││ │ │ ││ │ ├─> Check lock compatibility with existing locks ││ │ ├─> If compatible: grant immediately ││ │ ├─> If not compatible: wait or timeout ││ │ └─> Detect deadlock (DFS on wait-for graph) ││ │ ││ ├─> Transaction::get(key) ││ │ │ ││ │ ├─> Read from MVCC snapshot ││ │ ├─> Apply visibility rules based on isolation level ││ │ └─> Return value (if visible) ││ │ ││ └─> Return result set to client ││ ││ For INSERT/UPDATE/DELETE (write operation): ││ │ ││ ├─> LockManager::acquire_lock(txn_id, key, EXCLUSIVE) ││ │ │ ││ │ ├─> Check lock compatibility (X locks are exclusive) ││ │ ├─> Wait for conflicting locks to release ││ │ └─> Grant lock or timeout/deadlock ││ │ ││ ├─> Transaction::put(key, value) / delete(key) ││ │ │ ││ │ ├─> Buffer write in transaction-local write set ││ │ ├─> Do NOT write to storage yet (2PC) ││ │ └─> Mark key as modified ││ │ ││ └─> Return affected rows count ││ │└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐│ 3. COMMIT TRANSACTION │├─────────────────────────────────────────────────────────────────┤│ ││ SessionManager::commit_transaction(session_id) ││ │ ││ ├─> Get active transaction ││ ├─> If SERIALIZABLE: validate_serializability() ││ │ │ ││ │ ├─> Check for read-write conflicts ││ │ ├─> Check for write-write conflicts ││ │ └─> Abort if conflict detected ││ │ ││ ├─> Acquire commit locks (brief exclusive lock) ││ ├─> Apply buffered writes to storage atomically ││ ├─> Update MVCC version numbers ││ ├─> Release all locks held by transaction ││ ├─> Mark transaction as committed ││ ├─> Update dirty state tracker ││ └─> Free transaction resources ││ │└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐│ 4. ROLLBACK TRANSACTION (on error or explicit ROLLBACK) │├─────────────────────────────────────────────────────────────────┤│ ││ SessionManager::rollback_transaction(session_id) ││ │ ││ ├─> Get active transaction ││ ├─> Discard buffered writes (no storage changes) ││ ├─> Release all locks held by transaction ││ ├─> Mark transaction as rolled back ││ └─> Free transaction resources ││ │└─────────────────────────────────────────────────────────────────┘Lock Acquisition Algorithm
fn acquire_lock( txn_id: TransactionId, key: Key, mode: LockMode) -> Result<(), Error> { let start_time = now();
loop { // 1. Try to acquire lock match try_acquire_lock(txn_id, key.clone(), mode) { Ok(true) => return Ok(()), // Lock acquired Ok(false) => { // Lock not available, check for issues } Err(e) => return Err(e), }
// 2. Detect deadlock using wait-for graph if detect_deadlock(txn_id)? { return Err(Error::Deadlock( format!("Deadlock detected for txn {}", txn_id) )); }
// 3. Check timeout if now() - start_time > LOCK_TIMEOUT_MS { return Err(Error::LockTimeout); }
// 4. Wait briefly and retry sleep_ms(10); }}
fn try_acquire_lock( txn_id: TransactionId, key: Key, mode: LockMode) -> Result<bool, Error> { let mut lock_table = LOCK_TABLE.entry(key).or_default();
// Check compatibility with all existing locks for existing_lock in lock_table.iter() { if existing_lock.txn_id == txn_id { // Same transaction already holds lock if existing_lock.mode.is_compatible(mode) { return Ok(true); // Already have compatible lock } else { // Need to upgrade lock continue; } }
if !mode.is_compatible(existing_lock.mode) { // Conflict with other transaction WAIT_FOR_GRAPH .entry(txn_id) .or_default() .insert(existing_lock.txn_id); return Ok(false); // Cannot acquire yet } }
// No conflicts - acquire lock lock_table.push(LockEntry { txn_id, mode, acquired_at: now(), });
Ok(true)}Lock Manager Deadlock Detection
Wait-For Graph
The lock manager maintains a directed graph where:
- Nodes: Active transactions
- Edges: Transaction A → Transaction B means A is waiting for B to release a lock
Deadlock: Cycle in the wait-for graph
Example: Three transactions waiting on each other
Txn 101 holds lock on Row A, waits for Row B (held by Txn 102)Txn 102 holds lock on Row B, waits for Row C (held by Txn 103)Txn 103 holds lock on Row C, waits for Row A (held by Txn 101)
Wait-For Graph: 101 ──> 102 ▲ │ │ ▼ 103 <──
Cycle detected: 101 -> 102 -> 103 -> 101 (DEADLOCK!)Deadlock Detection Algorithm (DFS)
fn detect_deadlock(txn_id: TransactionId) -> Result<bool, Error> { let mut visited = HashSet::new(); let mut stack = vec![txn_id]; let mut rec_stack = HashSet::new(); // Recursion stack
while let Some(current) = stack.pop() { if rec_stack.contains(¤t) { // Cycle detected - current transaction in recursion stack return Ok(true); }
if visited.contains(¤t) { continue; }
visited.insert(current); rec_stack.insert(current);
// Explore all transactions that current is waiting for if let Some(waiting_on) = WAIT_FOR_GRAPH.get(¤t) { for &next_txn in waiting_on.iter() { stack.push(next_txn); } } }
Ok(false) // No cycle}Deadlock Resolution
When a deadlock is detected:
- Choose Victim: Select transaction to abort (youngest transaction)
- Abort Transaction: Release all locks, rollback changes
- Notify Client: Return error
Error::Deadlock - Client Retries: Application retries transaction
Txn 101 ──X──> Txn 102 ▲ │ │ ▼Txn 103 <──────
Deadlock detected!Abort Txn 101 (youngest): - Release lock on Row A - Rollback changes - Return Error::Deadlock to client
Now Txn 103 can acquire lock on Row A and proceed.Isolation Level Implementation
READ COMMITTED
Guarantees:
- No dirty reads (uncommitted data invisible)
- Fresh snapshot per SQL statement
- May see non-repeatable reads and phantom rows
Implementation:
impl Transaction { pub fn execute_statement(&mut self, stmt: &Statement) -> Result<ResultSet, Error> { // Create fresh snapshot for each statement self.snapshot = self.snapshot_manager.latest_snapshot();
// Execute statement using fresh snapshot self.execute_with_snapshot(stmt, &self.snapshot) }}Visibility Rules:
fn is_visible(tuple_version: &TupleVersion, snapshot: &Snapshot) -> bool { // Tuple created by committed transaction before snapshot? if tuple_version.created_by_txn < snapshot.xmin { // Check if not deleted, or deleted after snapshot return tuple_version.deleted_by_txn == 0 || tuple_version.deleted_by_txn > snapshot.xmax; }
// Tuple created by this transaction? if tuple_version.created_by_txn == snapshot.txn_id { return tuple_version.deleted_by_txn == 0 || tuple_version.deleted_by_txn != snapshot.txn_id; }
false // Not visible}REPEATABLE READ
Guarantees:
- Consistent snapshot throughout transaction
- No dirty reads, no non-repeatable reads
- May still see phantom rows
Implementation:
impl Transaction { pub fn begin(&mut self, isolation: IsolationLevel) -> Result<(), Error> { if isolation == IsolationLevel::RepeatableRead { // Create snapshot once at transaction start self.snapshot = self.snapshot_manager.create_snapshot(); self.snapshot.freeze(); // Immutable for entire transaction } Ok(()) }}SERIALIZABLE
Guarantees:
- Full serializability
- Transactions appear to execute sequentially
- No anomalies (dirty read, non-repeatable read, phantom, serialization)
Implementation:
struct SerializableContext { /// Read set: keys read during transaction read_set: HashSet<Key>,
/// Write set: keys written during transaction write_set: HashSet<Key>,
/// Snapshot at transaction start snapshot: Snapshot,}
impl Transaction { pub fn commit_serializable(&self) -> Result<(), Error> { // Check for read-write conflicts (other transactions wrote to our read set) for key in &self.read_set { if let Some(latest_version) = self.storage.get_latest_version(key) { if latest_version.created_by_txn > self.snapshot.xmax { // Someone wrote to a key we read - conflict! return Err(Error::SerializationFailure( format!("Read-write conflict on key {:?}", key) )); } } }
// Check for write-write conflicts (handled by exclusive locks) // Locks ensure only one transaction can write to a key at a time
// Commit atomically self.apply_writes()?; Ok(()) }}Dump/Restore Data Flow
Dump Operation Flow
┌──────────────────────────────────────────────────────────────┐│ Dump Process (Non-Blocking) │├──────────────────────────────────────────────────────────────┤│ ││ DumpManager::dump(options) ││ │ ││ ├─> Create dump metadata ││ │ - dump_id (UUID) ││ │ - version (3.1.0) ││ │ - mode (Full/Incremental) ││ │ - current LSN ││ │ ││ ├─> Open dump file writer ││ │ - Write magic header "HELIODMP" ││ │ - Write metadata ││ │ ││ ├─> For each table: ││ │ │ ││ │ ├─> Acquire read-only snapshot (no locks) ││ │ ├─> Write table schema ││ │ │ - Table name ││ │ │ - Column definitions ││ │ │ - Indexes ││ │ │ - Constraints ││ │ │ ││ │ ├─> Scan table rows in batches (10,000 rows) ││ │ │ │ ││ │ │ ├─> Read batch from storage ││ │ │ ├─> Serialize to binary format ││ │ │ ├─> Compress if enabled (zstd/lz4) ││ │ │ ├─> Write compressed batch to file ││ │ │ └─> Update statistics (rows, bytes) ││ │ │ ││ │ └─> Release snapshot ││ │ ││ ├─> Write footer ││ │ - CRC32 checksum ││ │ - Statistics (tables, rows, bytes) ││ │ ││ ├─> Flush and close file ││ ├─> Record dump in history ││ ├─> Clear dirty state (mark LSN as persisted) ││ └─> Return DumpReport ││ │└──────────────────────────────────────────────────────────────┘Restore Operation Flow
┌──────────────────────────────────────────────────────────────┐│ Restore Process (Transactional) │├──────────────────────────────────────────────────────────────┤│ ││ DumpManager::restore(options) ││ │ ││ ├─> Open dump file reader ││ ├─> Read and validate metadata ││ │ - Check version compatibility ││ │ - Verify checksum ││ │ ││ ├─> Begin global transaction ││ │ ││ ├─> For each table in dump: ││ │ │ ││ │ ├─> Read table schema ││ │ ├─> If mode == Clean: DROP TABLE IF EXISTS ││ │ ├─> CREATE TABLE with schema ││ │ │ ││ │ ├─> Read row batches: ││ │ │ │ ││ │ │ ├─> Read compressed batch ││ │ │ ├─> Decompress if needed ││ │ │ ├─> Deserialize rows ││ │ │ ├─> INSERT rows into table ││ │ │ │ - Handle conflicts per options: ││ │ │ │ * Error: abort on duplicate ││ │ │ │ * Skip: ignore duplicates ││ │ │ │ * Update: upsert (ON CONFLICT UPDATE) ││ │ │ │ ││ │ │ └─> Update statistics ││ │ │ ││ │ ├─> Recreate indexes ││ │ └─> Recreate constraints ││ │ ││ ├─> COMMIT transaction (all-or-nothing) ││ ├─> Update current LSN to dump LSN ││ └─> Return RestoreReport ││ │└──────────────────────────────────────────────────────────────┘Memory Layout and Data Structures
SessionManager Data Structure
pub struct SessionManager { /// Map of session_id -> Session /// DashMap for lock-free concurrent access sessions: Arc<DashMap<SessionId, Arc<RwLock<Session>>>>,
/// User manager for authentication user_manager: Arc<UserManager>,
/// Lock manager (shared across all sessions) lock_manager: Arc<LockManager>,
/// Resource quota enforcement resource_quotas: Arc<ResourceQuotaManager>,
/// Session timeout in seconds session_timeout_secs: u64,
/// Next session ID (atomic counter) next_session_id: Arc<AtomicU64>,}
// Estimated memory per session: ~2-4 KB// For 10,000 sessions: ~20-40 MBLockManager Data Structure
pub struct LockManager { /// Lock table: Key -> Vec<LockEntry> /// DashMap for lock-free concurrent access locks: Arc<DashMap<Key, Vec<LockEntry>>>,
/// Wait-for graph: TransactionId -> Set<TransactionId> wait_for: Arc<DashMap<TransactionId, HashSet<TransactionId>>>,
/// Lock timeout configuration (milliseconds) lock_timeout_ms: u64,}
pub struct LockEntry { pub txn_id: TransactionId, pub mode: LockMode, pub acquired_at: u64,}
// Memory per lock entry: ~32 bytes// For 100,000 active locks: ~3.2 MBTransaction Data Structure
pub struct Transaction { /// Transaction ID id: TransactionId,
/// MVCC snapshot (point-in-time view of database) snapshot: Snapshot,
/// Storage engine reference storage: Arc<StorageEngine>,
/// Lock manager reference lock_manager: Arc<LockManager>,
/// Write buffer (changes not yet committed) write_buffer: HashMap<Vec<u8>, Option<Vec<u8>>>, // None = delete
/// Read set (for SERIALIZABLE isolation) read_set: HashSet<Key>,
/// Write set (for SERIALIZABLE isolation) write_set: HashSet<Key>,
/// Isolation level isolation_level: IsolationLevel,
/// Transaction start time start_time: u64,}
// Memory per transaction: ~500 KB - 2 MB (depends on write buffer size)MVCC Snapshot Structure
pub struct Snapshot { /// Snapshot transaction ID pub txn_id: TransactionId,
/// Minimum active transaction (xmin) pub xmin: TransactionId,
/// Maximum transaction + 1 (xmax) pub xmax: TransactionId,
/// Set of active transactions at snapshot creation pub active_txns: HashSet<TransactionId>,
/// Frozen flag (immutable for REPEATABLE READ) pub frozen: bool,}
// Memory per snapshot: ~1-2 KBPerformance Optimization Strategies
1. Lock-Free Data Structures
DashMap instead of Mutex<HashMap>:
// ❌ Slow: Global lock contentionlet sessions = Arc::new(Mutex::new(HashMap::new()));
// ✅ Fast: Lock-free concurrent hash maplet sessions = Arc::new(DashMap::new());Benefits:
- No global lock contention
- Read operations don’t block each other
- Write operations lock only specific shards
- 10-100x faster for highly concurrent workloads
2. Adaptive Lock Acquisition
fn acquire_lock_adaptive( txn_id: TransactionId, key: Key, mode: LockMode) -> Result<(), Error> { let mut backoff = 10; // Start with 10ms
for attempt in 0..MAX_RETRIES { match try_acquire_lock(txn_id, key.clone(), mode) { Ok(true) => return Ok(()), Ok(false) => { // Exponential backoff: 10ms, 20ms, 40ms, 80ms, ... sleep_ms(backoff); backoff = std::cmp::min(backoff * 2, 1000); // Cap at 1 second } Err(e) => return Err(e), } }
Err(Error::LockTimeout)}3. Batch Operations
// ❌ Slow: Individual insertsfor row in rows { db.execute(&format!("INSERT INTO table VALUES ({}, {})", row.id, row.value))?;}
// ✅ Fast: Batch insertdb.execute_batch("INSERT INTO table VALUES ($1, $2)", &rows)?;Benefits:
- Amortize lock acquisition overhead
- Reduce network roundtrips
- Better CPU cache utilization
- 10-50x faster for bulk operations
4. Early Lock Release
impl Transaction { pub fn commit(&mut self) -> Result<(), Error> { // Apply writes self.apply_writes()?;
// Release locks ASAP (before cleanup) self.lock_manager.release_all_locks(self.id)?;
// Do expensive cleanup after releasing locks self.cleanup()?;
Ok(()) }}5. Read-Heavy Optimization
For 80%+ read workloads:
// Use MVCC snapshot reads (no locks needed)let snapshot = snapshot_manager.create_snapshot();let value = storage.get_at_snapshot(&key, &snapshot)?;Benefits:
- Readers never block writers
- Writers never block readers
- No lock contention for read-only queries
- Scales linearly with cores
6. Memory Pool Allocation
pub struct TransactionPool { pool: Vec<Transaction>, allocated: Arc<AtomicUsize>,}
impl TransactionPool { pub fn acquire(&self) -> Transaction { // Reuse pre-allocated transaction self.pool.pop().unwrap_or_else(|| Transaction::new()) }
pub fn release(&self, txn: Transaction) { // Return to pool for reuse if self.pool.len() < POOL_SIZE { self.pool.push(txn); } }}Benefits:
- Reduce allocation overhead
- Better memory locality
- Predictable memory usage
- 2-5x faster transaction creation
Performance Benchmarks
| Operation | Throughput | Latency (p99) | Notes |
|---|---|---|---|
| Read-only query | 50,000 QPS | 0.5ms | No lock contention |
| Single row update | 20,000 TPS | 1.2ms | Exclusive lock |
| Multi-row update | 8,000 TPS | 5.0ms | Multiple locks |
| Serializable TXN | 3,000 TPS | 15ms | Conflict detection |
| Session create | 10,000/sec | 0.1ms | DashMap insert |
| Lock acquire (no conflict) | 100,000/sec | 0.05ms | Lock-free path |
| Deadlock detection | 50,000/sec | 0.2ms | DFS on wait-for graph |
See Also
- In-Memory Mode Guide - User guide
- Session Management API - API documentation
- Configuration Reference - Configuration options
- Upgrade Plan - Implementation plan
Version: 3.4.0 Last Updated: 2026-01-04 Maintained by: HeliosDB Team