Dead Letter Notes
Notes on data pipelines, message queues, and the failure modes in between.

The small files problem, and four ways I've dealt with it

If you run a data lake long enough, you meet the small files problem. A streaming job or a per-event writer produces millions of tiny objects — a few KB each — and suddenly a query that should take seconds takes minutes. The data didn't grow; the number of files did.

Why it hurts

Every file carries fixed overhead: a metadata lookup, an object-store request, a task to schedule. On columnar formats like Parquet you also lose the whole point of the format — a 4 KB Parquet file has row groups too small to skip or vectorize efficiently. Ten thousand 4 KB files are dramatically slower to scan than one 40 MB file with the same rows, and your object store bill quietly reflects all those extra requests.

Four things that have worked

1. Compaction jobs. The blunt, reliable fix: a scheduled job that reads a partition's small files and rewrites them into a few large ones. Target 128–512 MB per file. Boring, effective, and easy to reason about.

2. Bigger write batches. Often the small files come from the writer flushing too eagerly. Increasing the batch size or trigger interval on a streaming sink — say from every 30 seconds to every 5 minutes — cuts the file count by an order of magnitude before compaction ever runs. Fix the source before you clean up after it.

3. Table formats that compact for you. Iceberg, Delta, and Hudi all have some notion of automatic file compaction and manifest management. Letting the table format handle it removes a whole class of bespoke jobs, at the cost of adopting the format.

4. Sensible partitioning. Over-partitioning is a small files factory. Partitioning by day/hour/minute when you only ever query by day guarantees tiny files. Partition at the granularity you actually filter on, and no finer.

The cheapest small file is the one you never wrote. Fix the writer first, compact second, re-partition third.
sparkparquetdata-lake