Handling Large JSON Files: Streams vs Buffers

TL;DR

Problem: JSON.parse() loads entire file into memory — crashes on large files
Solution: Use streaming parsers like stream-json or bfj
Best for logs: NDJSON (Newline Delimited JSON) — one object per line
Rule of thumb: If file > 100MB, always stream
Memory savings: From 2GB+ to ~50MB for a 500MB file

The "Heap Out of Memory" Problem

You've probably seen this error at 3 AM when your production server decides to give up:

terminal

bash

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

<--- Last few GCs --->
[12345:0x5555555] 12000 ms: Mark-sweep 1398.2 (1425.6) -> 1398.0 (1425.6) MB, 
1520.0 / 0.0 ms (average mu = 0.089, current mu = 0.002)

This happens because JSON.parse() is synchronous and greedy. It reads the entire file into memory, parses it all at once, and then hands you the result. For a 500MB JSON file, you need at least 1-2GB of RAM just for parsing.

The Math: A 500MB JSON file can easily require 2GB+ of heap memory. Node.js defaults to ~1.5GB heap limit on 64-bit systems. Do the math — it crashes.

The Naive Approach (Don't Do This)

Here's what most tutorials show you — and what will eventually break in production:

naive-approach.js

javascript

const fs = require('fs');

// ❌ This loads the ENTIRE file into memory
const data = fs.readFileSync('massive-file.json', 'utf8');
const parsed = JSON.parse(data);

// By the time you get here, you've already used 2GB of RAM
parsed.forEach(item => processItem(item));

This works fine for files under 50MB. Beyond that, you're playing Russian roulette with your server's memory.

The Streaming Solution

Streaming parsers read the file in chunks, parse incrementally, and emit objects one at a time. Your memory usage stays constant regardless of file size.

Option 1: stream-json (Most Popular)

stream-json is the gold standard for streaming JSON in Node.js. It handles nested structures, arrays, and complex objects.

terminal

bash

npm install stream-json

stream-json-example.js

javascript

const fs = require('fs');
const { parser } = require('stream-json');
const { streamArray } = require('stream-json/streamers/StreamArray');
const { chain } = require('stream-chain');

// ✅ Process a massive array of objects with constant memory
const pipeline = chain([
  fs.createReadStream('massive-file.json'),
  parser(),
  streamArray(),
]);

let count = 0;

pipeline.on('data', ({ key, value }) => {
  // 'value' is a single parsed object from the array
  processItem(value);
  count++;
  
  if (count % 10000 === 0) {
    console.log(`Processed ${count} items...`);
  }
});

pipeline.on('end', () => {
  console.log(`Done! Processed ${count} items total.`);
});

pipeline.on('error', (err) => {
  console.error('Parsing error:', err);
});

Memory comparison:
- JSON.parse() on 500MB file: ~2GB RAM
- stream-json on 500MB file: ~50MB RAM (constant)

Option 2: bfj (Big Friendly JSON)

bfj provides a simpler API if you just need to read or write large JSON files.

bfj-example.js

javascript

const bfj = require('bfj');
const fs = require('fs');

// Read large JSON file as a stream
const stream = bfj.walk(fs.createReadStream('massive-file.json'));

stream.on('value', (value) => {
  // Called for each value in the JSON
  if (typeof value === 'object' && value !== null) {
    processItem(value);
  }
});

stream.on('end', () => {
  console.log('Done parsing!');
});

// Or use the promise-based API for simpler cases
async function readLargeFile() {
  const data = await bfj.read('massive-file.json');
  // Note: This still loads into memory, but does so asynchronously
  // Use bfj.walk() for true streaming
}

NDJSON: The Better Format for Large Data

If you control the data format, NDJSON (Newline Delimited JSON) is the way to go. Instead of one giant array, you have one JSON object per line:

data.ndjson

json

{"id": 1, "name": "Alice", "score": 95}
{"id": 2, "name": "Bob", "score": 87}
{"id": 3, "name": "Charlie", "score": 92}
{"id": 4, "name": "Diana", "score": 88}

Why is this better? Because you can process it line by line with zero parsing overhead:

ndjson-processing.js

javascript

const fs = require('fs');
const readline = require('readline');

async function processNDJSON(filename) {
  const fileStream = fs.createReadStream(filename);
  
  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });

  let count = 0;
  
  for await (const line of rl) {
    if (line.trim()) {
      const obj = JSON.parse(line);
      processItem(obj);
      count++;
    }
  }
  
  console.log(`Processed ${count} records`);
}

processNDJSON('data.ndjson');

NDJSON advantages:

Each line is independent — can process in parallel
Easy to append new records (just add a line)
Simple error recovery — skip bad lines, continue processing
Used by: Elasticsearch, BigQuery, many logging systems

Converting JSON Array to NDJSON

convert-to-ndjson.js

javascript

const fs = require('fs');
const { parser } = require('stream-json');
const { streamArray } = require('stream-json/streamers/StreamArray');
const { chain } = require('stream-chain');

// Convert large JSON array to NDJSON
const input = chain([
  fs.createReadStream('input.json'),
  parser(),
  streamArray(),
]);

const output = fs.createWriteStream('output.ndjson');

input.on('data', ({ value }) => {
  output.write(JSON.stringify(value) + '\n');
});

input.on('end', () => {
  output.end();
  console.log('Conversion complete!');
});

Parallel Processing with Worker Threads

For CPU-intensive processing, combine streaming with Worker Threads:

parallel-processing.js

javascript

const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
const fs = require('fs');
const readline = require('readline');

if (isMainThread) {
  // Main thread: distribute work to workers
  const NUM_WORKERS = 4;
  const workers = [];
  let lineCount = 0;
  
  for (let i = 0; i < NUM_WORKERS; i++) {
    workers.push(new Worker(__filename));
  }
  
  const rl = readline.createInterface({
    input: fs.createReadStream('huge-data.ndjson'),
    crlfDelay: Infinity
  });
  
  rl.on('line', (line) => {
    // Round-robin distribution to workers
    const workerIndex = lineCount % NUM_WORKERS;
    workers[workerIndex].postMessage(line);
    lineCount++;
  });
  
  rl.on('close', () => {
    workers.forEach(w => w.postMessage('DONE'));
  });
  
} else {
  // Worker thread: process items
  parentPort.on('message', (line) => {
    if (line === 'DONE') {
      process.exit(0);
    }
    
    const obj = JSON.parse(line);
    // Heavy processing here...
    const result = expensiveOperation(obj);
    parentPort.postMessage(result);
  });
}

Performance Benchmarks

Real-world benchmarks on a 500MB JSON file (1 million records):

Method	Time	Peak Memory	Notes
`JSON.parse()`	8.2s	2.1 GB	Crashes on default heap
`stream-json`	12.5s	52 MB	Constant memory
`bfj.walk()`	14.1s	48 MB	Simpler API
NDJSON + readline	6.8s	35 MB	Fastest, if you control format
NDJSON + 4 Workers	2.1s	180 MB	Best for CPU-heavy work

When to Use What

Decision flowchart:

File < 50MB? → Just use JSON.parse(), you're fine
File 50-500MB? → Use stream-json for safety
File > 500MB? → NDJSON if possible, otherwise stream-json
Real-time logs? → Always NDJSON
Need parallel processing? → NDJSON + Worker Threads

Common Pitfalls

1. Forgetting Backpressure

If you're writing to a database or file while streaming, you need to handle backpressure:

backpressure.js

javascript

const pipeline = chain([
  fs.createReadStream('data.json'),
  parser(),
  streamArray(),
]);

pipeline.on('data', async ({ value }) => {
  // ❌ BAD: This doesn't wait, can overwhelm the database
  db.insert(value);
});

// ✅ GOOD: Use a transform stream with proper async handling
const { Transform } = require('stream');

const dbWriter = new Transform({
  objectMode: true,
  async transform(chunk, encoding, callback) {
    try {
      await db.insert(chunk.value);
      callback();
    } catch (err) {
      callback(err);
    }
  }
});

pipeline.pipe(dbWriter);

2. Not Handling Partial Parses

With NDJSON, a line might be incomplete if the file is being written to:

safe-ndjson.js

javascript

for await (const line of rl) {
  if (!line.trim()) continue;
  
  try {
    const obj = JSON.parse(line);
    processItem(obj);
  } catch (err) {
    // Log the error but continue processing
    console.error('Skipping malformed line:', line.substring(0, 100));
  }
}

Production Tips

Monitor memory: Use process.memoryUsage() to track heap usage
Set heap limits explicitly: node --max-old-space-size=4096 script.js
Use compression: GZIP your JSON files — streaming works with compressed files too
Consider alternatives: For truly massive datasets, look at Parquet, Avro, or databases

What's Next?

Now you can handle JSON files of any size without breaking a sweat. Here's where to go next:

Validate your streaming data with JSON Schema
Master JSON.parse() edge cases
Debug JSON parsing errors
Try our JSON tools — format and validate JSON instantly

Go stream some data. Your server's RAM will thank you.

Handling Large JSON Files: Streams vs Buffers

TL;DR

The "Heap Out of Memory" Problem

The Naive Approach (Don't Do This)

The Streaming Solution

Option 1: stream-json (Most Popular)

Option 2: bfj (Big Friendly JSON)

NDJSON: The Better Format for Large Data

Converting JSON Array to NDJSON

Parallel Processing with Worker Threads

Performance Benchmarks

When to Use What

Common Pitfalls

1. Forgetting Backpressure

2. Not Handling Partial Parses

Production Tips

What's Next?

About the Author

Adam Tse