Advanced 12 min read

Handling Large JSON Files: Streams vs Buffers

Learn how to process massive JSON files without crashing Node.js. Compare streaming parsers like stream-json and ndjson for production use.

#performance #node.js #streaming #memory

TL;DR

  • Problem: JSON.parse() loads entire file into memory — crashes on large files
  • Solution: Use streaming parsers like stream-json or bfj
  • Best for logs: NDJSON (Newline Delimited JSON) — one object per line
  • Rule of thumb: If file > 100MB, always stream
  • Memory savings: From 2GB+ to ~50MB for a 500MB file

The "Heap Out of Memory" Problem

You've probably seen this error at 3 AM when your production server decides to give up:

terminal
bash
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

<--- Last few GCs --->
[12345:0x5555555] 12000 ms: Mark-sweep 1398.2 (1425.6) -> 1398.0 (1425.6) MB, 
1520.0 / 0.0 ms (average mu = 0.089, current mu = 0.002)

This happens because JSON.parse() is synchronous and greedy. It reads the entire file into memory, parses it all at once, and then hands you the result. For a 500MB JSON file, you need at least 1-2GB of RAM just for parsing.

The Math: A 500MB JSON file can easily require 2GB+ of heap memory. Node.js defaults to ~1.5GB heap limit on 64-bit systems. Do the math — it crashes.

The Naive Approach (Don't Do This)

Here's what most tutorials show you — and what will eventually break in production:

naive-approach.js
javascript
const fs = require('fs');

// ❌ This loads the ENTIRE file into memory
const data = fs.readFileSync('massive-file.json', 'utf8');
const parsed = JSON.parse(data);

// By the time you get here, you've already used 2GB of RAM
parsed.forEach(item => processItem(item));

This works fine for files under 50MB. Beyond that, you're playing Russian roulette with your server's memory.

The Streaming Solution

Streaming parsers read the file in chunks, parse incrementally, and emit objects one at a time. Your memory usage stays constant regardless of file size.

Option 1: stream-json (Most Popular)

stream-json is the gold standard for streaming JSON in Node.js. It handles nested structures, arrays, and complex objects.

terminal
bash
npm install stream-json
stream-json-example.js
javascript
const fs = require('fs');
const { parser } = require('stream-json');
const { streamArray } = require('stream-json/streamers/StreamArray');
const { chain } = require('stream-chain');

// ✅ Process a massive array of objects with constant memory
const pipeline = chain([
  fs.createReadStream('massive-file.json'),
  parser(),
  streamArray(),
]);

let count = 0;

pipeline.on('data', ({ key, value }) => {
  // 'value' is a single parsed object from the array
  processItem(value);
  count++;
  
  if (count % 10000 === 0) {
    console.log(`Processed ${count} items...`);
  }
});

pipeline.on('end', () => {
  console.log(`Done! Processed ${count} items total.`);
});

pipeline.on('error', (err) => {
  console.error('Parsing error:', err);
});
Memory comparison:
- JSON.parse() on 500MB file: ~2GB RAM
- stream-json on 500MB file: ~50MB RAM (constant)

Option 2: bfj (Big Friendly JSON)

bfj provides a simpler API if you just need to read or write large JSON files.

bfj-example.js
javascript
const bfj = require('bfj');
const fs = require('fs');

// Read large JSON file as a stream
const stream = bfj.walk(fs.createReadStream('massive-file.json'));

stream.on('value', (value) => {
  // Called for each value in the JSON
  if (typeof value === 'object' && value !== null) {
    processItem(value);
  }
});

stream.on('end', () => {
  console.log('Done parsing!');
});

// Or use the promise-based API for simpler cases
async function readLargeFile() {
  const data = await bfj.read('massive-file.json');
  // Note: This still loads into memory, but does so asynchronously
  // Use bfj.walk() for true streaming
}

NDJSON: The Better Format for Large Data

If you control the data format, NDJSON (Newline Delimited JSON) is the way to go. Instead of one giant array, you have one JSON object per line:

data.ndjson
json
{"id": 1, "name": "Alice", "score": 95}
{"id": 2, "name": "Bob", "score": 87}
{"id": 3, "name": "Charlie", "score": 92}
{"id": 4, "name": "Diana", "score": 88}

Why is this better? Because you can process it line by line with zero parsing overhead:

ndjson-processing.js
javascript
const fs = require('fs');
const readline = require('readline');

async function processNDJSON(filename) {
  const fileStream = fs.createReadStream(filename);
  
  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });

  let count = 0;
  
  for await (const line of rl) {
    if (line.trim()) {
      const obj = JSON.parse(line);
      processItem(obj);
      count++;
    }
  }
  
  console.log(`Processed ${count} records`);
}

processNDJSON('data.ndjson');
NDJSON advantages:
  • Each line is independent — can process in parallel
  • Easy to append new records (just add a line)
  • Simple error recovery — skip bad lines, continue processing
  • Used by: Elasticsearch, BigQuery, many logging systems

Converting JSON Array to NDJSON

convert-to-ndjson.js
javascript
const fs = require('fs');
const { parser } = require('stream-json');
const { streamArray } = require('stream-json/streamers/StreamArray');
const { chain } = require('stream-chain');

// Convert large JSON array to NDJSON
const input = chain([
  fs.createReadStream('input.json'),
  parser(),
  streamArray(),
]);

const output = fs.createWriteStream('output.ndjson');

input.on('data', ({ value }) => {
  output.write(JSON.stringify(value) + '\n');
});

input.on('end', () => {
  output.end();
  console.log('Conversion complete!');
});

Parallel Processing with Worker Threads

For CPU-intensive processing, combine streaming with Worker Threads:

parallel-processing.js
javascript
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
const fs = require('fs');
const readline = require('readline');

if (isMainThread) {
  // Main thread: distribute work to workers
  const NUM_WORKERS = 4;
  const workers = [];
  let lineCount = 0;
  
  for (let i = 0; i < NUM_WORKERS; i++) {
    workers.push(new Worker(__filename));
  }
  
  const rl = readline.createInterface({
    input: fs.createReadStream('huge-data.ndjson'),
    crlfDelay: Infinity
  });
  
  rl.on('line', (line) => {
    // Round-robin distribution to workers
    const workerIndex = lineCount % NUM_WORKERS;
    workers[workerIndex].postMessage(line);
    lineCount++;
  });
  
  rl.on('close', () => {
    workers.forEach(w => w.postMessage('DONE'));
  });
  
} else {
  // Worker thread: process items
  parentPort.on('message', (line) => {
    if (line === 'DONE') {
      process.exit(0);
    }
    
    const obj = JSON.parse(line);
    // Heavy processing here...
    const result = expensiveOperation(obj);
    parentPort.postMessage(result);
  });
}

Performance Benchmarks

Real-world benchmarks on a 500MB JSON file (1 million records):

Method Time Peak Memory Notes
JSON.parse() 8.2s 2.1 GB Crashes on default heap
stream-json 12.5s 52 MB Constant memory
bfj.walk() 14.1s 48 MB Simpler API
NDJSON + readline 6.8s 35 MB Fastest, if you control format
NDJSON + 4 Workers 2.1s 180 MB Best for CPU-heavy work

When to Use What

Decision flowchart:
  • File < 50MB? → Just use JSON.parse(), you're fine
  • File 50-500MB? → Use stream-json for safety
  • File > 500MB? → NDJSON if possible, otherwise stream-json
  • Real-time logs? → Always NDJSON
  • Need parallel processing? → NDJSON + Worker Threads

Common Pitfalls

1. Forgetting Backpressure

If you're writing to a database or file while streaming, you need to handle backpressure:

backpressure.js
javascript
const pipeline = chain([
  fs.createReadStream('data.json'),
  parser(),
  streamArray(),
]);

pipeline.on('data', async ({ value }) => {
  // ❌ BAD: This doesn't wait, can overwhelm the database
  db.insert(value);
});

// ✅ GOOD: Use a transform stream with proper async handling
const { Transform } = require('stream');

const dbWriter = new Transform({
  objectMode: true,
  async transform(chunk, encoding, callback) {
    try {
      await db.insert(chunk.value);
      callback();
    } catch (err) {
      callback(err);
    }
  }
});

pipeline.pipe(dbWriter);

2. Not Handling Partial Parses

With NDJSON, a line might be incomplete if the file is being written to:

safe-ndjson.js
javascript
for await (const line of rl) {
  if (!line.trim()) continue;
  
  try {
    const obj = JSON.parse(line);
    processItem(obj);
  } catch (err) {
    // Log the error but continue processing
    console.error('Skipping malformed line:', line.substring(0, 100));
  }
}

Production Tips

  • Monitor memory: Use process.memoryUsage() to track heap usage
  • Set heap limits explicitly: node --max-old-space-size=4096 script.js
  • Use compression: GZIP your JSON files — streaming works with compressed files too
  • Consider alternatives: For truly massive datasets, look at Parquet, Avro, or databases

What's Next?

Now you can handle JSON files of any size without breaking a sweat. Here's where to go next:

Go stream some data. Your server's RAM will thank you.

About the Author

AT

Adam Tse

Founder & Lead Developer · 10+ years experience

Full-stack engineer with 10+ years of experience building developer tools and APIs. Previously worked on data infrastructure at scale, processing billions of JSON documents daily. Passionate about creating privacy-first tools that don't compromise on functionality.

JavaScript/TypeScript Web Performance Developer Tools Data Processing