Squishy

GitHub ↗

A corpus of realistic stuff you might compress. Squishy is a fixed set of real files: novels, source code, server logs, genome reads, weather tables, databases, executables, a photo, a film clip, the weights of a neural network. They range from a few megabytes to a few gigabytes, and every one is freely redistributable, with its source and checksum published. Run any compressor over the set and you get a single Squishy Score you can cite and compare. It's a successor to the Silesia corpus.

The shape of the corpus

Each dot is one file, placed by three measurements of its bytes. No compressor is involved in the placement.

Color is the kind of data. The corpus is chosen so the dots spread across the whole space instead of piling up in one corner.

Drag to rotate · scroll to zoom · hover a dot for details · keyboard: arrows rotate, +/− zoom, 0 resets, Enter steps through.

View the data as a table (no 3D required)
categoryfileentropy (bpb)coveragematch distancesize
Binary & Mediaweights7.360%224 B90.9 MB
Binary & Mediaexe6.3526%3 KB62.5 MB
Binary & Mediamovie7.980%25 KB12.9 MB
Binary & Mediaphoto7.950%16 B6.5 MB
Binary & Mediawinexe6.2621%1 KB4.1 MB
Binary & Mediaarmexe6.5811%1 KB1.2 MB
Binary & Mediasymbols4.5410%3 KB1.1 MB
Binary & Mediawasm5.682%272 B0.9 MB
Code & Webllvm-project-19.1.0.src.tar5.6968%10 KB1.8 GB
Code & Webclang-releases-16-17-18-19.tar5.6087%21 KB1.5 GB
Code & Webenwik9.txt5.1630%1.1 MB1.0 GB
Code & Webmonorepo5.1534%20 KB50.9 MB
Code & Webmarkup5.2234%14 KB8.0 MB
Code & Webminjs5.479%33 KB3.6 MB
Prosedickens4.652%860 KB12.2 MB
Proseaozora4.8010%1.1 MB12.0 MB
Scale tiernoaa-ghcn-daily-2024-full.csv4.0778%1.5 MB1.3 GB
Scale tierbig-buck-bunny-1080p.mov7.990%53 KB725.1 MB
Structuredecoli-DRR002013-full.fastq3.3167%1 KB1.1 GB
Structurednasa-http-jul-aug-1995.log5.2483%37 KB373.1 MB
Structuredlog5.2273%21 KB26.2 MB
Structuredgenome3.2849%16 B26.2 MB
Structuredjson5.0570%12 KB14.2 MB
Tabular / DBnoaa-ghcn-daily-2021-2023.csv4.0580%2.9 MB4.1 GB
Tabular / DBbts-ontime-2022-20247.498%227 KB772.3 MB
Tabular / DBsqlite5.639%4 KB48.3 MB
Tabular / DBcsv4.0357%152 KB26.5 MB
Tabular / DBparquet7.478%293 KB20.9 MB

The Squishy Score is the geomean of compression across all files - a stable number reference number.

Score your tool

Give it any compressor that reads stdin and writes stdout, such as "xz -9 -c" or your own "./mytool -c". It streams the corpus, runs your tool on every file, and prints the score:

uv run squishy-calculate --cmd "zstd -19 -c"

The Squishy Score is the geometric mean of each file's compression ratio (original size ÷ compressed size). Every file counts once; nothing is weighted, excluded, or tuned. Beside it the runner prints corpus bpb, the plain bits-per-byte over all bytes, where the big files dominate.

Reference board · draft

Familiar tools on the small files only, so not an official Squishy Score yet. Click a column to sort.

toolSquishy Score (×)corpus bpbProseCode & WebStructuredTabular / DBBinary & Media
xz -9 v5.8.34.07×2.9554.15×5.37×11.91×5.74×2.14×
brotli -11 v1.2.04.02×3.0004.11×5.40×12.60×5.32×2.10×
zstd -22 v1.5.73.87×3.0734.10×5.27×11.96×4.99×2.02×
zstd -19 v1.5.73.83×3.0864.07×5.22×11.56×4.95×2.02×
bzip2 -9 v1.0.83.62×3.2634.02×5.18×12.21×4.11×1.87×
gzip -9 3.04×3.4792.84×4.00×8.38×3.53×1.80×

Every file

What each file is, a peek inside, and how each tool compresses it.

Prose

1

dickens

12.2 MB · Public-Domain

Nine novels by Charles Dickens — English prose.

download dickens · source ↗
compression — 6 tools
bzip2 -9 v1.0.83.73×
xz -9 v5.8.33.66×
brotli -11 v1.2.03.66×
zstd -22 v1.5.73.63×
zstd -19 v1.5.73.63×
gzip -9 2.69×
2

aozora

12.0 MB · Public-Domain

Collected works of Natsume Sōseki — Japanese literary prose.

download aozora.txt · source ↗
compression — 6 tools
xz -9 v5.8.34.69×
zstd -22 v1.5.74.62×
brotli -11 v1.2.04.62×
zstd -19 v1.5.74.56×
bzip2 -9 v1.0.84.34×
gzip -9 3.00×

Code & Web

3

monorepo

50.9 MB · Apache-2.0-LLVM

The lib/ source tree of the LLVM Clang C++ compiler.

download monorepo.tar · source ↗
compression — 6 tools
xz -9 v5.8.37.54×
brotli -11 v1.2.07.52×
zstd -22 v1.5.77.47×
zstd -19 v1.5.77.29×
bzip2 -9 v1.0.86.80×
gzip -9 5.18×
4

minjs

3.6 MB · MIT

The minified Plotly.js charting library — one big line of JavaScript.

download minjs.min.js · source ↗
compression — 6 tools
brotli -11 v1.2.04.29×
xz -9 v5.8.34.26×
zstd -19 v1.5.74.06×
zstd -22 v1.5.74.06×
bzip2 -9 v1.0.83.81×
gzip -9 3.31×
5

markup

8.0 MB · Freely-distributable

Shakespeare's plays, marked up in XML.

download markup.xml · source ↗
compression — 6 tools
bzip2 -9 v1.0.85.36×
brotli -11 v1.2.04.87×
xz -9 v5.8.34.83×
zstd -19 v1.5.74.82×
zstd -22 v1.5.74.82×
gzip -9 3.71×

Structured

6

json

14.2 MB · Public-Domain

20,000 magnitude-4.5+ earthquakes, 2010–2024 (USGS GeoJSON).

download data.json · source ↗
compression — 6 tools
bzip2 -9 v1.0.814.83×
brotli -11 v1.2.013.90×
zstd -22 v1.5.713.23×
zstd -19 v1.5.713.21×
xz -9 v5.8.312.82×
gzip -9 9.49×
7

log

26.2 MB · Public-Domain

A NASA web server's access log from July 1995.

download access.log · source ↗
compression — 6 tools
bzip2 -9 v1.0.816.97×
brotli -11 v1.2.016.09×
zstd -22 v1.5.715.71×
zstd -19 v1.5.715.56×
xz -9 v5.8.315.00×
gzip -9 10.40×
8

genome

26.2 MB · Public-Domain

Sequencing reads from an E. coli genome (FASTQ).

download ecoli.fastq · source ↗
compression — 6 tools
brotli -11 v1.2.08.94×
xz -9 v5.8.38.78×
zstd -22 v1.5.78.23×
zstd -19 v1.5.77.52×
bzip2 -9 v1.0.87.25×
gzip -9 5.97×

Tabular / DB

9

csv

26.5 MB · Public-Domain-USGov

Daily weather observations from NOAA's global climate network, 2024 (CSV).

download data.csv · source ↗
compression — 6 tools
xz -9 v5.8.313.54×
brotli -11 v1.2.013.33×
zstd -22 v1.5.712.45×
zstd -19 v1.5.712.20×
bzip2 -9 v1.0.89.69×
gzip -9 7.96×
10

parquet

20.9 MB · Public-Domain-USGov

U.S. airline on-time flight records (Bureau of Transportation Statistics) — stored column-wise as Apache Parquet.

download data.parquet · source ↗
compression — 6 tools
brotli -11 v1.2.01.87×
xz -9 v5.8.31.85×
zstd -19 v1.5.71.81×
zstd -22 v1.5.71.81×
bzip2 -9 v1.0.81.51×
gzip -9 1.48×
11

sqlite

48.3 MB · Public-Domain-USGov

USDA's nutrition database — foods, nutrients, and portions across 17 related tables (SR Legacy).

download data.sqlite · source ↗
compression — 6 tools
xz -9 v5.8.37.57×
brotli -11 v1.2.06.06×
zstd -22 v1.5.75.52×
zstd -19 v1.5.75.51×
bzip2 -9 v1.0.84.73×
gzip -9 3.75×

Binary & Media

12

exe

62.5 MB · Apache-2.0

A compiled Linux executable — the Hugo static-site generator.

download tool.bin · source ↗
compression — 6 tools
xz -9 v5.8.34.15×
brotli -11 v1.2.04.01×
zstd -22 v1.5.73.78×
zstd -19 v1.5.73.76×
bzip2 -9 v1.0.83.24×
gzip -9 3.00×
13

photo

6.5 MB · Public-Domain

NASA's “Blue Marble” — Earth photographed from Apollo 17.

download photo.jpg · source ↗
compression — 6 tools
bzip2 -9 v1.0.81.02×
brotli -11 v1.2.01.01×
zstd -19 v1.5.71.01×
zstd -22 v1.5.71.01×
xz -9 v5.8.31.01×
gzip -9 1.00×
14

movie

12.9 MB · CC-BY-3.0

A clip from the open film Big Buck Bunny (H.264 video).

download movie.mp4 · source ↗
compression — 6 tools
brotli -11 v1.2.01.01×
zstd -19 v1.5.71.01×
zstd -22 v1.5.71.01×
xz -9 v5.8.31.00×
gzip -9 1.00×
bzip2 -9 v1.0.81.00×
15

weights

90.9 MB · Apache-2.0

The trained weights of a small neural network (safetensors).

download weights.safetensors · source ↗
compression — 6 tools
xz -9 v5.8.31.12×
brotli -11 v1.2.01.11×
zstd -22 v1.5.71.10×
zstd -19 v1.5.71.09×
gzip -9 1.09×
bzip2 -9 v1.0.81.07×
16

symbols

1.1 MB · MIT

DWARF debug symbols from a Lua 5.4.8 build compiled with -g (a debug-info file, not a runnable program).

download symbols.dwarf · source ↗
compression — 6 tools
xz -9 v5.8.33.78×
brotli -11 v1.2.03.66×
zstd -19 v1.5.73.50×
zstd -22 v1.5.73.50×
bzip2 -9 v1.0.82.96×
gzip -9 2.91×
17

wasm

868 KB · Public Domain

The SQLite engine compiled to WebAssembly — stack-machine bytecode.

download engine.wasm · source ↗
compression — 6 tools
xz -9 v5.8.32.54×
brotli -11 v1.2.02.49×
zstd -19 v1.5.72.40×
zstd -22 v1.5.72.40×
bzip2 -9 v1.0.82.35×
gzip -9 2.16×
18

winexe

4.1 MB · MIT OR Apache-2.0

The fd file-finder as a Windows PE executable.

download winexe.exe · source ↗
compression — 6 tools
xz -9 v5.8.33.61×
brotli -11 v1.2.03.52×
zstd -22 v1.5.73.31×
zstd -19 v1.5.73.31×
bzip2 -9 v1.0.82.80×
gzip -9 2.67×
19

armexe

1.2 MB · MIT

The hyperfine benchmarking tool as an ARM64 Linux executable.

download armexe.elf · source ↗
compression — 6 tools
xz -9 v5.8.32.69×
brotli -11 v1.2.02.56×
zstd -19 v1.5.72.34×
zstd -22 v1.5.72.34×
bzip2 -9 v1.0.82.15×
gzip -9 2.02×

The big ones

The same kinds of data at gigabyte scale, where long-range matching and big compression windows start to matter. Still being assembled.

scale

weights-smollm2-135m.safetensors

269.1 MB · Apache-2.0

SmolLM2-135M — a small (135M-parameter) language model's weights (Apache-2.0). The middle rung of the weights size-ladder.

download ↓ · sha256 5af571cbf074e6d2… · source ↗
scale

nasa-http-jul-aug-1995.log

373.1 MB · Public-Domain

Scale-tier file — a large rung for large-window and throughput testing.

download ↓ · sha256 35c38d9465a8ed27… · source ↗
scale

big-buck-bunny-1080p.mov

725.1 MB · CC-BY-3.0

The full open film Big Buck Bunny in 1080p H.264 video.

download ↓ · sha256 dc2146a2b1172def… · source ↗
scale

bts-ontime-2022-2024.parquet

772.3 MB · Public-Domain-USGov

Scale-tier file — a large rung for large-window and throughput testing.

download ↓ · sha256 acb6eeb73e9c4449… · source ↗
scale

weights-qwen2.5-0.5b.safetensors

988.1 MB · Apache-2.0

Qwen2.5-0.5B — a 0.5B-parameter language model's weights (Apache-2.0). The second rung of the weights size-ladder.

download ↓ · sha256 fdf756fa7fcbe740… · source ↗
scale

enwik9.txt

1.00 GB · CC-BY-SA-3.0

The first billion bytes of an English Wikipedia XML dump (the Hutter-Prize text).

download ↓ · sha256 159b85351e5f76e6… · source ↗
scale

ecoli-DRR002013-full.fastq

1.07 GB · Public-Domain

Scale-tier file — a large rung for large-window and throughput testing.

download ↓ · sha256 ff3de7024de4f45e… · source ↗
scale

noaa-ghcn-daily-2024-full.csv

1.33 GB · Public-Domain-USGov

Scale-tier file — a large rung for large-window and throughput testing.

download ↓ · sha256 70baf8b1fe829889… · source ↗
scale

clang-releases-16-17-18-19.tar

1.50 GB · Apache-2.0-LLVM

Four LLVM/Clang release source trees concatenated — a real software archive.

download ↓ · sha256 e8518848a41185c7… · source ↗
scale

llvm-project-19.1.0.src.tar

1.77 GB · Apache-2.0-LLVM

Scale-tier file — a large rung for large-window and throughput testing.

download ↓ · sha256 bb4ae7add97894e6… · source ↗
scale

weights-qwen2.5-1.5b.safetensors

3.09 GB · Apache-2.0

Qwen2.5-1.5B — a larger (1.5B-parameter) language model's weights (Apache-2.0). The top rung of the ladder; multi-GB, for large-window and throughput work.

download ↓ · sha256 dd924a11b4c220f3… · source ↗
scale

noaa-ghcn-daily-2021-2023.csv

4.07 GB · Public-Domain-USGov

Scale-tier file — a large rung for large-window and throughput testing.

download ↓ · sha256 9111537b27d9ed83… · source ↗