Storage 13 min read

Big JSON in 2026: Compression, Data Lakes, and the 'Token Tax' of Large-Scale Storage

Learn the JSON storage hierarchy: NDJSON for hot data, Zstandard compression, Parquet for analytics, and how to avoid the 'Token Tax' in AI-first systems.

#compression #zstd #parquet #data-lake #ndjson #ai-storage

As of 2026, we're producing more JSON than ever. Every AI agent interaction, every microservice trace, every IoT heartbeat is a JSON blob. At terabyte-per-day scale, the "verbosity" of JSON becomes a multi-million dollar storage and compute problem. Here's how to solve it.

TL;DR

  • JSON is the "raw" tier standard: Almost all data starts as JSON, but it shouldn't stay as raw JSON long-term
  • Zstandard (Zstd) is the winner: Best balance of compression ratio and decompression speed for 2026 workloads
  • The "Token Tax": In AI-first systems, JSON's structural overhead wastes storage and LLM context tokens
  • Parquet/Iceberg for Analytics: Convert your "hot" JSON into columnar formats for 10-100x faster queries

The JSON Lifecycle: Ingest → Store → Query

Senior engineers are now focusing on the JSON Lifecycle: how to ingest JSON quickly, store it efficiently, and query it cheaply. The key insight is that different stages of the lifecycle call for different formats.

JSON Storage Hierarchy: Hot → Warm → Cold → Archive ACCESS FREQUENCY LOW HIGH 🔥 HOT TIER — NDJSON / JSONL Real-time streams, logs, events. Append-friendly, easy to parse in chunks. Optimization: Zstd dictionary compression (30-50% extra savings) 🌡️ WARM TIER — CBOR / MessagePack Service-to-service caches, temporary storage. Binary = faster parsing. Format: CBOR (RFC 8949) — constrained devices, high-throughput ❄️ COLD TIER — Parquet / Iceberg Analytics, data lake queries. Columnar = 10-100x faster aggregations. Compression: Snappy or Zstd inside Parquet files 📦 ARCHIVE — Glacier / Cold Storage Compliance, long-term backup. Rarely accessed, cheapest storage. Lifecycle: Auto-move after 30 days once converted to Parquet Compression Comparison (100MB JSON) Gzip ~14MB (86% reduction) Zstd ~10MB (90% reduction) Zstd+Dict ~6MB (94% reduction) ⭐ Parquet ~4MB (96% + fast queries) ⚠️ The "Token Tax" Problem (AI Storage) JSON's structural overhead wastes LLM context tokens: {"name": "Alice", "age": 25} ↑ 8 tokens just for syntax: { } " " : , Solution for RAG storage: • Store minified JSON (no whitespace) • Consider YAML-lite for retrieval (fewer tokens) • Use structured summaries, not raw JSON
The JSON storage hierarchy: Choose format and compression based on access patterns and query requirements

The Storage Hierarchy: Hot → Warm → Cold

1. Hot Tier: NDJSON (Newline Delimited JSON)

For real-time streams and logs, NDJSON (or JSONL) remains the champion. It's append-friendly and easy to parse in chunks.

events.ndjson
text
{"event":"click","userId":"u123","timestamp":"2026-01-04T10:00:00Z"}
{"event":"purchase","userId":"u123","amount":99.99,"timestamp":"2026-01-04T10:01:00Z"}
{"event":"click","userId":"u456","timestamp":"2026-01-04T10:02:00Z"}

Optimization: Use Zstd dictionary compression. If your JSON objects share the same keys (which they usually do), Zstd dictionaries can reduce size by an extra 30-50% over generic compression.

zstd-dictionary.sh
bash
# Train a dictionary on sample data
zstd --train samples/*.json -o events.dict

# Compress with the dictionary
zstd -D events.dict events.ndjson -o events.ndjson.zst

# Result: 30-50% smaller than generic zstd compression

2. Warm Tier: Compact Binary (CBOR / MessagePack)

For internal service-to-service storage or temporary caches, binary formats reduce the parsing overhead.

  • CBOR (RFC 8949): Highly efficient for storage on constrained devices and high-throughput caches
  • MessagePack: Popular in gaming and real-time systems
cbor-encoding.ts
typescript
import { encode, decode } from 'cbor-x';

const jsonData = {
  event: 'purchase',
  userId: 'u123',
  amount: 99.99,
  items: ['item1', 'item2', 'item3']
};

// JSON: 89 bytes
const jsonSize = JSON.stringify(jsonData).length;

// CBOR: ~60 bytes (30% smaller)
const cborData = encode(jsonData);
const cborSize = cborData.length;

console.log(`JSON: ${jsonSize} bytes, CBOR: ${cborSize} bytes`);

3. Cold Tier: Columnar (Parquet / Iceberg)

Raw JSON is terrible for analytical queries like SELECT AVG(price) FROM events.

  • The Pattern: Use a "Dead Letter Queue" for raw JSON, but ingest the "Happy Path" data into Apache Parquet or Apache Iceberg tables
  • Compression: Use Snappy or Zstd inside your Parquet files
json-to-parquet.py
python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Read NDJSON
df = pd.read_json('events.ndjson', lines=True)

# Convert to Parquet with Zstd compression
table = pa.Table.from_pandas(df)
pq.write_table(
    table, 
    'events.parquet',
    compression='zstd',
    compression_level=3
)

# Result: 10-20x smaller than raw JSON
# Query performance: 10-100x faster for analytics

The "Token Tax" and AI Storage

In 2026, AI agents are the primary consumers of our data. JSON's structural overhead ({}, "", :) wastes precious context window tokens.

token-tax-example.json
json
{"name": "Alice", "age": 25, "city": "Tokyo"}
// ↑ 8 tokens just for syntax: { } " " : , " "

Solutions for RAG Storage

  • Store minified JSON: No whitespace, no pretty-printing
  • Consider YAML-lite for retrieval: Fewer tokens for the same data
  • Use structured summaries: Don't feed raw JSON to LLMs when a summary works
token-efficient-storage.ts
typescript
// For RAG: Store token-efficient representations
const tokenEfficientFormat = {
  // Instead of verbose JSON keys
  // {"user_name": "Alice", "user_age": 25, "user_city": "Tokyo"}
  
  // Use compact format
  // "Alice,25,Tokyo" with schema stored separately
  
  compact: (user: User) => `${user.name},${user.age},${user.city}`,
  
  // Or use a schema-aware minifier
  minified: (data: object) => JSON.stringify(data), // No whitespace
};

// Token savings: 30-50% fewer tokens for the same semantic content

Security & Reliability at Scale

1. Schema Evolution in Data Lakes

What happens when you change a JSON key in your 2PB data lake?

⚠️ Recommendation: Always store the JSON Schema ID or Version in the metadata of the storage blob. Never trust that the JSON "stays the same."
versioned-storage.json
json
{
  "_meta": {
    "schemaId": "events.v3",
    "schemaVersion": "3.2.1",
    "producedAt": "2026-01-04T10:00:00Z"
  },
  "event": "purchase",
  "userId": "u123",
  "amount": 99.99
}

2. PII Masking at the Source

Don't wait until the data is in the lake to mask PII.

Strategy: Use a "Validation & Redaction" proxy that parses the incoming JSON stream, masks fields based on schema annotations, and then persists it.
pii-masking-proxy.ts
typescript
import { z } from 'zod';

// Define schema with PII annotations
const EventSchema = z.object({
  event: z.string(),
  userId: z.string().transform(maskPII), // Mask before storage
  email: z.string().email().transform(hashPII), // Hash PII
  amount: z.number(),
});

function maskPII(value: string): string {
  return value.slice(0, 2) + '***' + value.slice(-2);
}

function hashPII(value: string): string {
  return crypto.createHash('sha256').update(value).digest('hex').slice(0, 16);
}

// Process incoming JSON
function processAndStore(rawJson: string) {
  const parsed = JSON.parse(rawJson);
  const masked = EventSchema.parse(parsed);
  return store(masked); // PII already masked
}

Implementation Checklist for Large-Scale JSON

  • Zstd Everywhere: Switch from Gzip to Zstd for all JSON compression
  • Dictionary Training: If your JSON logs are high-volume, train a Zstd dictionary on a sample of your data
  • Partitioning: Partition your JSON blobs by dt (date) and event_type to avoid scanning the whole bucket
  • Lifecycle Rules: Automatically move raw JSON to "Archive" (Glacier/Cold storage) after 30 days once it's been converted to Parquet
  • Schema Registry: Use a centralized registry (Confluent, AWS Glue, or a git-based repo) to version your JSON storage formats

Common Pitfalls

The "JSON-in-JSON" Disaster

Escaping JSON strings inside another JSON object makes parsing 2x slower and storage 2x larger.

json-in-json-bad.json
json
{
  "event": "api_response",
  "payload": "{\"status\":\"ok\",\"data\":{\"id\":123}}"
}
// ❌ Don't do this - escaped JSON is a nightmare
json-in-json-good.json
json
{
  "event": "api_response",
  "payload": {
    "status": "ok",
    "data": { "id": 123 }
  }
}
// ✅ Do this - nested JSON, not escaped strings

Missing Timestamps

Always include a standardized UTC ISO-8601 timestamp at the top level of your storage JSON.

The "Small File" Problem

Storing millions of 1KB JSON files will kill your storage performance. Batch them into 128MB+ compressed chunks.

batch-json-files.ts
typescript
// Batch small JSON objects into larger files
const BATCH_SIZE = 10_000;
const MAX_FILE_SIZE = 128 * 1024 * 1024; // 128MB

class JsonBatcher {
  private buffer: object[] = [];
  
  add(obj: object) {
    this.buffer.push(obj);
    
    if (this.buffer.length >= BATCH_SIZE) {
      this.flush();
    }
  }
  
  flush() {
    const ndjson = this.buffer.map(o => JSON.stringify(o)).join('\n');
    const compressed = zstd.compress(ndjson);
    
    // Write as single file: events-2026-01-04-001.ndjson.zst
    writeFile(`events-${Date.now()}.ndjson.zst`, compressed);
    
    this.buffer = [];
  }
}

Compression Comparison (100MB JSON)

Method Compressed Size Reduction Decompress Speed
Gzip ~14MB 86% Medium
Zstd (level 3) ~10MB 90% Fast
Zstd + Dictionary ~6MB 94% Fast
Parquet + Zstd ~4MB 96% Very Fast (columnar)

References

Continue Learning

About the Author

AT

Adam Tse

Founder & Lead Developer · 10+ years experience

Full-stack engineer with 10+ years of experience building developer tools and APIs. Previously worked on data infrastructure at scale, processing billions of JSON documents daily. Passionate about creating privacy-first tools that don't compromise on functionality.

JavaScript/TypeScript Web Performance Developer Tools Data Processing