Squishy

One number for how well a compressor does on real data.

Squishy is a fixed set of real, freely-shareable files — prose, code, logs, genomes, tables, images, binaries — picked to cover the range of things people actually compress, from a few megabytes to several gigabytes. Run your tool over it and you get a single Squishy Score you can cite and compare. It's the 2026 successor to Silesia.

Score your tool

One command. Hand it your compressor as a plain stdin → stdout command — it streams the corpus, runs your tool over every file, and prints your score:

uv run squishy-calculate --cmd "zstd -19 -c"

Works with any codec the same way: --cmd "xz -9 -c", --cmd "brotli -q 11 -c", or your own --cmd "./mytool -c". Add --verify --decompress "zstd -dc" to prove it's lossless; use --cmd "mytool -o {out} {in}" for tools that read/write files instead of pipes. It caches as it goes, so re-runs are instant.

The coverage map

Each dot is one artifact, placed by properties of its bytes — measured directly, never from how a compressor performs: how random (entropy), how repetitive, and how far back the repeats sit (local vs long-range); dot size = file size. The files are sparse — not a dense grid — but representative of the whole; these are the dimensions along which compressors are known to behave differently, so spanning them is a principled reason each file is here (they describe coverage, they don't predict a ratio). Drag to rotate · scroll to zoom · hover for detail.

Reference board — draft (partial)

toolSquishy Score (×)corpus bpbProseCode & WebStructuredTabular / DBBinary & Media
zpaq v7.157.85×2.6205.32×7.34×19.26×8.37×4.74×
xz -9 v5.8.35.76×2.9774.15×5.37×11.91×5.74×4.15×
brotli -11 v1.2.05.69×3.0214.11×5.40×12.60×5.32×4.01×
zstd -22 v1.5.75.46×3.0924.10×5.27×11.96×4.99×3.78×
zstd -19 v1.5.75.40×3.1064.07×5.22×11.56×4.95×3.76×
bzip2 -9 v1.0.85.08×3.2784.02×5.18×12.21×4.11×3.24×
gzip -9 3.99×3.4952.84×4.00×8.38×3.53×3.00×

Draft, partial: these run only the small members of the corpus — the large rungs are pending, so this is not yet a Squishy Score. Click a column to sort; scales to any number of tool versions.

Every dataset

Prose

1

dickens

12.2 MB · Public-Domain

Nine novels by Charles Dickens — English prose.

source ↗
compression — 7 tools
zpaq v7.154.96×
bzip2 -9 v1.0.83.73×
xz -9 v5.8.33.66×
brotli -11 v1.2.03.66×
zstd -22 v1.5.73.63×
zstd -19 v1.5.73.63×
gzip -9 2.69×
2

aozora

12.0 MB · Public-Domain

Collected works of Natsume Sōseki — Japanese literary prose.

source ↗
compression — 7 tools
zpaq v7.155.71×
xz -9 v5.8.34.69×
zstd -22 v1.5.74.62×
brotli -11 v1.2.04.62×
zstd -19 v1.5.74.56×
bzip2 -9 v1.0.84.34×
gzip -9 3.00×

Code & Web

3

monorepo

50.9 MB · Apache-2.0-LLVM

The lib/ source tree of the LLVM Clang C++ compiler.

source ↗
compression — 7 tools
zpaq v7.1510.79×
xz -9 v5.8.37.54×
brotli -11 v1.2.07.52×
zstd -22 v1.5.77.47×
zstd -19 v1.5.77.29×
bzip2 -9 v1.0.86.80×
gzip -9 5.18×
4

minjs

3.6 MB · MIT

The minified Plotly.js charting library — one big line of JavaScript.

source ↗
compression — 7 tools
zpaq v7.155.32×
brotli -11 v1.2.04.29×
xz -9 v5.8.34.26×
zstd -19 v1.5.74.06×
zstd -22 v1.5.74.06×
bzip2 -9 v1.0.83.81×
gzip -9 3.31×
5

markup

8.0 MB · Freely-distributable

Shakespeare's plays, marked up in XML.

source ↗
compression — 7 tools
zpaq v7.156.90×
bzip2 -9 v1.0.85.36×
brotli -11 v1.2.04.87×
xz -9 v5.8.34.83×
zstd -19 v1.5.74.82×
zstd -22 v1.5.74.82×
gzip -9 3.71×

Structured

6

json

14.2 MB · Public-Domain

20,000 magnitude-4.5+ earthquakes, 2010–2024 (USGS GeoJSON).

source ↗
compression — 7 tools
zpaq v7.1524.16×
bzip2 -9 v1.0.814.83×
brotli -11 v1.2.013.90×
zstd -22 v1.5.713.23×
zstd -19 v1.5.713.21×
xz -9 v5.8.312.82×
gzip -9 9.49×
7

log

26.2 MB · Public-Domain

A NASA web server's access log from July 1995.

source ↗
compression — 7 tools
zpaq v7.1530.67×
bzip2 -9 v1.0.816.97×
brotli -11 v1.2.016.09×
zstd -22 v1.5.715.71×
zstd -19 v1.5.715.56×
xz -9 v5.8.315.00×
gzip -9 10.40×
8

genome

26.2 MB · Public-Domain

Sequencing reads from an E. coli genome (FASTQ).

source ↗
compression — 7 tools
zpaq v7.159.64×
brotli -11 v1.2.08.94×
xz -9 v5.8.38.78×
zstd -22 v1.5.78.23×
zstd -19 v1.5.77.52×
bzip2 -9 v1.0.87.25×
gzip -9 5.97×

Tabular / DB

9

csv

26.5 MB · Public-Domain-USGov

Daily weather observations from NOAA's global climate network, 2024 (CSV).

source ↗
compression — 7 tools
zpaq v7.1526.11×
xz -9 v5.8.313.54×
brotli -11 v1.2.013.33×
zstd -22 v1.5.712.45×
zstd -19 v1.5.712.20×
bzip2 -9 v1.0.89.69×
gzip -9 7.96×
10

parquet

20.9 MB · Public-Domain-USGov

U.S. airline on-time flight records (Bureau of Transportation Statistics) — stored column-wise as Apache Parquet.

source ↗
compression — 7 tools
zpaq v7.151.94×
brotli -11 v1.2.01.87×
xz -9 v5.8.31.85×
zstd -19 v1.5.71.81×
zstd -22 v1.5.71.81×
bzip2 -9 v1.0.81.51×
gzip -9 1.48×
11

sqlite

48.3 MB · Public-Domain-USGov

USDA's nutrition database — foods, nutrients, and portions across 17 related tables (SR Legacy).

source ↗
compression — 7 tools
zpaq v7.1511.57×
xz -9 v5.8.37.57×
brotli -11 v1.2.06.06×
zstd -22 v1.5.75.52×
zstd -19 v1.5.75.51×
bzip2 -9 v1.0.84.73×
gzip -9 3.75×

Binary & Media

12

exe

62.5 MB · Apache-2.0

A compiled Linux executable — the Hugo static-site generator.

source ↗
compression — 7 tools
zpaq v7.154.74×
xz -9 v5.8.34.15×
brotli -11 v1.2.04.01×
zstd -22 v1.5.73.78×
zstd -19 v1.5.73.76×
bzip2 -9 v1.0.83.24×
gzip -9 3.00×
13

photo

6.5 MB · Public-Domain

NASA's “Blue Marble” — Earth photographed from Apollo 17.

source ↗
compression — 7 tools
zpaq v7.151.06×
bzip2 -9 v1.0.81.02×
brotli -11 v1.2.01.01×
zstd -19 v1.5.71.01×
zstd -22 v1.5.71.01×
xz -9 v5.8.31.01×
gzip -9 1.00×
14

movie

12.9 MB · CC-BY-3.0

A clip from the open film Big Buck Bunny (H.264 video).

source ↗
compression — 7 tools
zpaq v7.151.02×
brotli -11 v1.2.01.01×
zstd -19 v1.5.71.01×
zstd -22 v1.5.71.01×
xz -9 v5.8.31.00×
gzip -9 1.00×
bzip2 -9 v1.0.81.00×
15

weights

90.9 MB · Apache-2.0

The trained weights of a small neural network (safetensors).

source ↗
compression — 7 tools
zpaq v7.151.23×
xz -9 v5.8.31.12×
brotli -11 v1.2.01.11×
zstd -22 v1.5.71.10×
zstd -19 v1.5.71.09×
gzip -9 1.09×
bzip2 -9 v1.0.81.07×

Scale tier — the large members

Large files spanning the kinds and the size axis (~0.3–3 GB). The GB rungs of compressible kinds (csv, columnar, genome, text) are scored members of the corpus; the model-weights ladder (135M → 0.5B → 1.5B params) and large media are near-incompressible throughput / behavior diagnostics, not scored. This tier is still being assembled — see the readiness plan.

scale

weights-smollm2-135m.safetensors

269 MB · Apache-2.0

SmolLM2-135M — a small (135M-parameter) language model's weights (Apache-2.0). The middle rung of the weights size-ladder.

sha256 5af571cbf074e6d2… · source ↗
scale

nasa-http-jul-aug-1995.log

373 MB · Public-Domain

Scale-tier file — for throughput / large-window testing (not scored).

sha256 35c38d9465a8ed27… · source ↗
scale

big-buck-bunny-1080p.mov

725 MB · CC-BY-3.0

Scale-tier file — for throughput / large-window testing (not scored).

sha256 dc2146a2b1172def… · source ↗
scale

bts-ontime-2022-2024.parquet

772 MB · Public-Domain-USGov

Scale-tier file — for throughput / large-window testing (not scored).

sha256 acb6eeb73e9c4449… · source ↗
scale

weights-qwen2.5-0.5b.safetensors

988 MB · Apache-2.0

Qwen2.5-0.5B — a 0.5B-parameter language model's weights (Apache-2.0). The second rung of the weights size-ladder.

sha256 fdf756fa7fcbe740… · source ↗
scale

enwik9.txt

1000 MB · CC-BY-SA-3.0

Scale-tier file — for throughput / large-window testing (not scored).

sha256 159b85351e5f76e6… · source ↗
scale

ecoli-DRR002013-full.fastq

1073 MB · Public-Domain

Scale-tier file — for throughput / large-window testing (not scored).

sha256 ff3de7024de4f45e… · source ↗
scale

noaa-ghcn-daily-2024-full.csv

1331 MB · Public-Domain-USGov

Scale-tier file — for throughput / large-window testing (not scored).

sha256 70baf8b1fe829889… · source ↗
scale

clang-releases-16-17-18-19.tar

1504 MB · Apache-2.0-LLVM

Scale-tier file — for throughput / large-window testing (not scored).

sha256 e8518848a41185c7… · source ↗
scale

llvm-project-19.1.0.src.tar

1772 MB · Apache-2.0-LLVM

Scale-tier file — for throughput / large-window testing (not scored).

sha256 bb4ae7add97894e6… · source ↗
scale

weights-qwen2.5-1.5b.safetensors

3087 MB · Apache-2.0

Qwen2.5-1.5B — a larger (1.5B-parameter) language model's weights (Apache-2.0). The top rung of the ladder; multi-GB, for large-window and throughput work.

sha256 dd924a11b4c220f3… · source ↗
scale

noaa-ghcn-daily-2021-2023.csv

4069 MB · Public-Domain-USGov

Scale-tier file — for throughput / large-window testing (not scored).

sha256 9111537b27d9ed83… · source ↗