Skip to content

Per-Column Storage & Compression

Per-Column Storage & Compression

HeliosDB Nano supports per-column compression configuration, allowing you to optimize storage efficiency for different data types.

Overview

Per-column storage enables:

  • Type-Specific Compression: Apply optimal algorithms per column type
  • Storage Efficiency: Reduce disk usage by 50-90%
  • Query Performance: Compressed data requires less I/O
  • Flexible Configuration: Column-level or table-level settings

Storage Modes

HeliosDB Nano supports four per-column storage modes that can be specified inline at table creation:

ModeBest ForBenefit
DEFAULTGeneral dataRocksDB LZ4 compression (3-7x)
DICTIONARYLow-cardinality strings50-95% space reduction
CONTENT_ADDRESSEDLarge duplicate values~100% deduplication
COLUMNARNumeric sequences20-50% compression + faster aggregates

Inline STORAGE Syntax

Specify storage mode directly in CREATE TABLE:

CREATE TABLE table_name (
column_name TYPE STORAGE MODE,
...
);

Dictionary Encoding Example

Best for columns with limited unique values (status codes, categories, countries):

CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
status TEXT STORAGE DICTIONARY,
shipping_country TEXT STORAGE DICTIONARY,
payment_method TEXT STORAGE DICTIONARY,
order_total FLOAT8,
created_at INT8
);

Content-Addressed Storage Example

Best for large values (>1KB) that repeat across rows (templates, JSON payloads):

CREATE TABLE email_templates (
template_id INT PRIMARY KEY,
template_name TEXT,
html_content TEXT STORAGE CONTENT_ADDRESSED,
category TEXT STORAGE DICTIONARY
);

Columnar Storage Example

Best for numeric columns used in aggregations (timestamps, metrics, measurements):

CREATE TABLE sensor_readings (
reading_id INT PRIMARY KEY,
sensor_id INT,
sensor_type TEXT STORAGE DICTIONARY,
location TEXT STORAGE DICTIONARY,
timestamp INT8 STORAGE COLUMNAR,
temperature FLOAT8 STORAGE COLUMNAR,
humidity FLOAT8 STORAGE COLUMNAR,
pressure FLOAT8 STORAGE COLUMNAR
);

Mixed Modes Example

Combine storage modes for optimal performance:

CREATE TABLE events (
event_id INT PRIMARY KEY,
-- Dictionary for low-cardinality strings
event_type TEXT STORAGE DICTIONARY,
source_app TEXT STORAGE DICTIONARY,
user_country TEXT STORAGE DICTIONARY,
device_type TEXT STORAGE DICTIONARY,
-- Columnar for numeric aggregations
timestamp INT8 STORAGE COLUMNAR,
response_time_ms INT STORAGE COLUMNAR,
payload_size INT STORAGE COLUMNAR,
-- Content-addressed for large duplicated values
error_stack TEXT STORAGE CONTENT_ADDRESSED,
-- Default for unique/mixed data
user_id INT,
session_id TEXT,
request_path TEXT
);

ALTER TABLE Migration

Change storage mode on existing columns:

-- Migrate to dictionary encoding
ALTER TABLE products ALTER COLUMN category SET STORAGE DICTIONARY;
-- Migrate to columnar storage
ALTER TABLE products ALTER COLUMN price SET STORAGE COLUMNAR;
-- Revert to default
ALTER TABLE products ALTER COLUMN category SET STORAGE DEFAULT;

Compression Codecs

FSST (Fast Static Symbol Table)

Optimized for string/text columns:

FeatureDescription
TypeDictionary-based compression
Best ForVARCHAR, TEXT, CHAR columns
Ratio2-10x compression
Speed~1GB/s encode, ~2GB/s decode

ALP (Adaptive Lossless Floating-Point)

Optimized for numeric columns:

FeatureDescription
TypeFloating-point compression
Best ForFLOAT4, FLOAT8, NUMERIC
Ratio2-4x compression
Speed~800MB/s encode, ~1.5GB/s decode

LZ4 (Default)

General-purpose fast compression:

FeatureDescription
TypeBlock compression
Best ForMixed data, general use
Ratio2-3x compression
Speed~400MB/s encode, ~2GB/s decode

ZSTD

High-ratio compression:

FeatureDescription
TypeDictionary compression
Best ForCold data, archives
Ratio3-5x compression
Speed~200MB/s encode, ~500MB/s decode

Configuration

Table-Level Compression

-- Set default compression for new tables
SET default_compression = 'fsst';
CREATE TABLE logs (
id INT,
message TEXT,
level VARCHAR(10)
);

Per-Column Configuration

-- Via table options (proposed syntax)
CREATE TABLE sensors (
id INT,
timestamp TIMESTAMP,
temperature FLOAT8, -- Will use ALP compression
description TEXT -- Will use FSST compression
) WITH (
compression = 'auto' -- Auto-select per column type
);

Session Settings

-- Configure compression globally
SET compression_level = 'high'; -- low, medium, high
SET fsst_enabled = true; -- Enable FSST for strings
SET alp_enabled = true; -- Enable ALP for floats

Codec Selection by Type

Data TypeRecommended CodecCompression Ratio
TEXT, VARCHARFSST3-10x
FLOAT4, FLOAT8ALP2-4x
INT, BIGINTDelta + LZ42-3x
TIMESTAMPDelta + LZ43-5x
BYTEALZ4/ZSTD2-3x
JSONBFSST3-8x
VECTORProduct Quantization4-32x

Auto-Detection

HeliosDB Nano automatically selects compression based on:

  1. Column data type: Maps type to optimal codec
  2. Data patterns: Analyzes sample data
  3. Configuration: Respects user overrides
-- Enable automatic codec selection
SET compression_auto_detect = true;

Compression Statistics

View Compression Ratios

-- Check compression stats via system views
SELECT
table_name,
column_name,
codec,
raw_bytes,
compressed_bytes,
ROUND(raw_bytes::float / compressed_bytes, 2) as ratio
FROM pg_compression_stats();

Storage Analysis

-- Analyze table storage
SELECT * FROM pg_table_stats() WHERE table_name = 'logs';

Performance Tuning

FSST Tuning

-- Symbol table size (affects compression ratio)
SET fsst_symbol_table_size = 256; -- Default: 256
-- Training sample size
SET fsst_training_samples = 1000; -- Default: 1000

ALP Tuning

-- Exception threshold (tradeoff ratio vs speed)
SET alp_exception_threshold = 0.05; -- Default: 5%
-- Adaptive mode
SET alp_adaptive_mode = true; -- Default: true

Block Configuration

-- Compression applies at block level
SET storage_block_size = 16384; -- Default: 16KB
-- Larger blocks = better compression, worse random access
SET storage_block_size = 65536; -- 64KB for analytics

Use Cases

Time-Series Data

CREATE TABLE metrics (
timestamp TIMESTAMP, -- Delta compression
sensor_id INT, -- Dictionary encoding
value FLOAT8, -- ALP compression
tags TEXT -- FSST compression
);
-- Achieves 5-10x overall compression

Log Storage

CREATE TABLE application_logs (
id BIGINT,
timestamp TIMESTAMP,
level VARCHAR(10), -- FSST: high repetition
message TEXT, -- FSST: string patterns
metadata JSONB -- FSST: JSON strings
);
-- Achieves 4-8x overall compression

Vector Embeddings

CREATE TABLE documents (
id INT,
content TEXT, -- FSST compression
embedding VECTOR(768) -- Product Quantization
);
CREATE INDEX ON documents USING hnsw (embedding)
WITH (quantization = 'product');
-- PQ reduces vector storage by 32x

Compression Pipeline

Write Path:
Raw Data → Type Codec (FSST/ALP) → Block Codec (LZ4) → Storage
Read Path:
Storage → Block Decode → Type Decode → Raw Data

Benchmarks

FSST Performance (TEXT columns)

DatasetRaw SizeCompressedRatioEncode Speed
Logs1 GB120 MB8.3x950 MB/s
URLs500 MB80 MB6.3x1.1 GB/s
JSON2 GB300 MB6.7x850 MB/s

ALP Performance (FLOAT columns)

DatasetRaw SizeCompressedRatioEncode Speed
Sensors800 MB200 MB4.0x780 MB/s
Financial1 GB350 MB2.9x820 MB/s
Scientific2 GB550 MB3.6x750 MB/s

Integration

With Branching

Compression is preserved across branches:

CREATE BRANCH dev FROM main AS OF NOW;
-- Dev branch inherits compression settings

With Time-Travel

Historical queries decompress on-the-fly:

SELECT * FROM logs
AS OF TIMESTAMP '2025-01-01 00:00:00';
-- Transparent decompression

With SMFI

Compression works with storage-level filtering:

-- Zone maps and bloom filters work on compressed blocks
SELECT * FROM logs WHERE timestamp > '2025-01-01';

See Also