A corpus of realistic stuff you might compress. Squishy is a fixed set of real files: novels, source code, server logs, genome reads, weather tables, databases, executables, a photo, a film clip, the weights of a neural network. They range from a few megabytes to a few gigabytes, and every one is freely redistributable, with its source and checksum published. Run any compressor over the set and you get a single Squishy Score you can cite and compare. It's a successor to the Silesia corpus.
Each dot is one file, placed by three measurements of its bytes. No compressor is involved in the placement.
Color is the kind of data. The corpus is chosen so the dots spread across the whole space instead of piling up in one corner.
Drag to rotate · scroll to zoom · hover a dot for details · keyboard: arrows rotate, +/− zoom, 0 resets, Enter steps through.
| category | file | entropy (bpb) | coverage | match distance | size |
|---|---|---|---|---|---|
| Binary & Media | weights | 7.36 | 0% | 224 B | 90.9 MB |
| Binary & Media | exe | 6.35 | 26% | 3 KB | 62.5 MB |
| Binary & Media | movie | 7.98 | 0% | 25 KB | 12.9 MB |
| Binary & Media | photo | 7.95 | 0% | 16 B | 6.5 MB |
| Binary & Media | winexe | 6.26 | 21% | 1 KB | 4.1 MB |
| Binary & Media | armexe | 6.58 | 11% | 1 KB | 1.2 MB |
| Binary & Media | symbols | 4.54 | 10% | 3 KB | 1.1 MB |
| Binary & Media | wasm | 5.68 | 2% | 272 B | 0.9 MB |
| Code & Web | llvm-project-19.1.0.src.tar | 5.69 | 68% | 10 KB | 1.8 GB |
| Code & Web | clang-releases-16-17-18-19.tar | 5.60 | 87% | 21 KB | 1.5 GB |
| Code & Web | enwik9.txt | 5.16 | 30% | 1.1 MB | 1.0 GB |
| Code & Web | monorepo | 5.15 | 34% | 20 KB | 50.9 MB |
| Code & Web | markup | 5.22 | 34% | 14 KB | 8.0 MB |
| Code & Web | minjs | 5.47 | 9% | 33 KB | 3.6 MB |
| Prose | dickens | 4.65 | 2% | 860 KB | 12.2 MB |
| Prose | aozora | 4.80 | 10% | 1.1 MB | 12.0 MB |
| Scale tier | noaa-ghcn-daily-2024-full.csv | 4.07 | 78% | 1.5 MB | 1.3 GB |
| Scale tier | big-buck-bunny-1080p.mov | 7.99 | 0% | 53 KB | 725.1 MB |
| Structured | ecoli-DRR002013-full.fastq | 3.31 | 67% | 1 KB | 1.1 GB |
| Structured | nasa-http-jul-aug-1995.log | 5.24 | 83% | 37 KB | 373.1 MB |
| Structured | log | 5.22 | 73% | 21 KB | 26.2 MB |
| Structured | genome | 3.28 | 49% | 16 B | 26.2 MB |
| Structured | json | 5.05 | 70% | 12 KB | 14.2 MB |
| Tabular / DB | noaa-ghcn-daily-2021-2023.csv | 4.05 | 80% | 2.9 MB | 4.1 GB |
| Tabular / DB | bts-ontime-2022-2024 | 7.49 | 8% | 227 KB | 772.3 MB |
| Tabular / DB | sqlite | 5.63 | 9% | 4 KB | 48.3 MB |
| Tabular / DB | csv | 4.03 | 57% | 152 KB | 26.5 MB |
| Tabular / DB | parquet | 7.47 | 8% | 293 KB | 20.9 MB |
The Squishy Score is the geomean of compression across all files - a stable number reference number.
Give it any compressor that reads stdin and writes stdout, such as
"xz -9 -c" or your own "./mytool -c". It streams the corpus,
runs your tool on every file, and prints the score:
uv run squishy-calculate --cmd "zstd -19 -c"
The Squishy Score is the geometric mean of each file's compression ratio (original size ÷ compressed size). Every file counts once; nothing is weighted, excluded, or tuned. Beside it the runner prints corpus bpb, the plain bits-per-byte over all bytes, where the big files dominate.
"mytool -o {out} {in}".--verify --decompress "zstd -dc" to prove the round trip is lossless.Familiar tools on the small files only, so not an official Squishy Score yet. Click a column to sort.
| tool | Squishy Score (×) | corpus bpb | Prose | Code & Web | Structured | Tabular / DB | Binary & Media |
|---|---|---|---|---|---|---|---|
| xz -9 v5.8.3 | 4.07× | 2.955 | 4.15× | 5.37× | 11.91× | 5.74× | 2.14× |
| brotli -11 v1.2.0 | 4.02× | 3.000 | 4.11× | 5.40× | 12.60× | 5.32× | 2.10× |
| zstd -22 v1.5.7 | 3.87× | 3.073 | 4.10× | 5.27× | 11.96× | 4.99× | 2.02× |
| zstd -19 v1.5.7 | 3.83× | 3.086 | 4.07× | 5.22× | 11.56× | 4.95× | 2.02× |
| bzip2 -9 v1.0.8 | 3.62× | 3.263 | 4.02× | 5.18× | 12.21× | 4.11× | 1.87× |
| gzip -9 | 3.04× | 3.479 | 2.84× | 4.00× | 8.38× | 3.53× | 1.80× |
What each file is, a peek inside, and how each tool compresses it.
dickensNine novels by Charles Dickens — English prose.
A TALE OF TWO CITIES
A STORY OF THE FRENCH REVOLUTION
By Charles Dickens
CONTENTS
Book the First--Recalled to Life
CHAPTER I The Period
CHAPTER II The Mail
CHAPTER III The Night Shadows
CHAPTER IV The Preparation
CHAPTER V The Wine-shop
CHAPTER VI The Shoemaker
Book the Second--the Golden Thread
CHAPTER I Five Years Later
CHAPTER II A SightaozoraCollected works of Natsume Sōseki — Japanese literary prose.
夏目漱石 カーライル博物館 カーライル博物館 夏目漱石 公園の片隅に通りがかりの人を相手に演説をしている者がある。向うから来た釜形の尖った帽子を被ずいて古ぼけた外套を猫背に着た爺さんがそこへ歩みを佇めて演説者を見る。演説者はぴたりと演説をやめてつかつかとこの村夫子のたたずめる前に出て来る。二人の視線がひたと行き当る。演説者は濁りたる田舎調子にて御前はカ 余は晩餐前に公園を散歩するたびに川縁の椅子に腰を卸して向側を眺める。倫敦に固有なる濃霧はことに岸辺に多い。余が桜の杖に頤を支えて真正面を見ていると、遥かに対岸の往来を這い廻る霧の影は次第に濃くなって五階立の町続きの下からぜんぜんこの揺曳くものの裏に薄れ去って来る。しまいには遠き未来の世を眼前に引き カーライルはおらぬ。演説者も死んだであろう。しかしチェルシーは以前のごとく存在している。否彼の多年住み古した家屋敷さえ今なお儼然と保存せられてある。千七百八年チェイン・ロウが出来てより以来幾多の主人を迎え幾多の主人を送ったかは知らぬがとにかく今日まで昔のままで残っている。カーライルの歿後は有志家の 文学者でチェルシーに縁故のあるものを挙げると昔しはトマス・モア、下ってスモレット、なお下ってカーライルと同時代にはリ・ハントなどがもっとも著名である。ハントの家はカーライルの直近傍で、現にカーライルがこの家に引き移った晩尋ねて来たという事がカーライルの記録に書いてある。またハントがカーライルの細君 チェイン・ローは河岸端の往来を南に折れる小路でカーライルの家はその右側の中頃に在る。番地は二十四番地だ。 毎日のように川を隔てて霧の中にチェルシーを眺めた余はある朝ついに橋を渡ってその有名なる庵りを叩いた。 庵りというと物寂びた感じがある。少なくとも瀟洒とか風流とかいう念と伴う。しかしカーライルの庵はそんな脂っこい華奢なものではない。往来から直ちに戸が敲けるほどの道傍に建てられた四階造の真四角な家である。 出張った所も引き込んだ所もないのべつに真直に立っている。まるで大製造場の煙突の根本を切ってきてこれに天井を張って窓をつけたように見える。 これが彼が北の田舎から始めて倫敦へ出て来て探しに探し抜いて漸々の事で探し宛てた家である。彼は西を探し南を探しハンプステッドの北まで探してついに恰好の家を探し出す事が出来ず、最後にチェイン・ローへ来てこの家を見てもまだすぐに取きめるほどの勇気はなかったのである。四千万の愚物と天下を罵った彼も住家には 余は今この四角な家の石階の上に立って鬼の面のノッカーをコツコツと敲く。しばらくすると内から五十恰好の肥った婆さんが出て来て御這入りと云う。最初から見物人と思っているらしい。婆さんはやがて名簿のようなものを出して御名前をと云う。余は倫敦滞留中四たびこの家に入り四たびこの名簿に余が名を記録した覚えがあ 案内者はいずれの国でも同じものと見える。先っきから婆さんは室内の絵画器具について一々説明を与える。五十年間案内者を専門に修業したものでもあるまいが非常に熟練したものである。何年何月何日にどうしたこうしたとあたかも口から出任せに喋舌っているようである。しかもその流暢な弁舌に抑揚があり節奏がある。調子
monorepoThe lib/ source tree of the LLVM Clang C++ compiler.
lib/CMakeLists.txt lib/CIR/CMakeLists.txt lib/CIR/Dialect/CMakeLists.txt lib/CIR/Dialect/IR/CMakeLists.txt lib/CIR/Dialect/IR/CIRDialect.cpp lib/CrossTU/CrossTranslationUnit.cpp lib/CrossTU/CMakeLists.txt lib/Index/IndexBody.cpp lib/Index/CMakeLists.txt lib/Index/IndexingContext.cpp lib/Index/IndexingAction.cpp lib/Index/CommentToXML.cpp … 1433 files total
//===- CIRDialect.cpp - MLIR CIR ops implementation -----------------------===// // // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // See https://llvm.org/LICENSE.txt for license information. // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception // //===----------------------------------------------------------------------===// // // This file implements the CIR dialect and its operations. // //===----------------------------------------------------------------------===// #include <clang/CIR/Dialect/IR/CIRDialect.h>
minjsThe minified Plotly.js charting library — one big line of JavaScript.
/**
* plotly.js v2.27.0
* Copyright 2012-2023, Plotly, Inc.
* All rights reserved.
* Licensed under the MIT license
*/
/*! For license information please see plotly.min.js.LICENSE.txt */
!function(t,e){"object"==typeof exports&&"object"==typeof module?module.exports=e():"function"==typeof define&&define.amd?define([],e):"object"==typeomarkupShakespeare's plays, marked up in XML.
a_and_c.xml all_well.xml as_you.xml catalog com_err.xml coriolan.xml cymbelin.xml dream.xml dsssl.dtd fot.dtd hamlet.xml hen_iv_1.xml … 48 files total
<?xml version="1.0"?> <!DOCTYPE PLAY SYSTEM "play.dtd"> <PLAY> <TITLE>The Tragedy of Antony and Cleopatra</TITLE> <FM> <P>ASCII text placed in the public domain by Moby Lexical Tools, 1992.</P> <P>SGML markup by Jon Bosak, 1992-1994.</P> <P>XML version by Jon Bosak, 1996-1999.</P> <P>The XML markup in this version is Copyright © 1999 Jon Bosak. This work may freely be distributed on condition that it not be modified or altered in any way.</P> </FM> <PERSONAE>
json20,000 magnitude-4.5+ earthquakes, 2010–2024 (USGS GeoJSON).
{"type":"FeatureCollection","metadata":{"generated":1780043074000,"url":"https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2010-01-01&endtime=2024-01-01&minmagnitude=4.5&orderb
{"type":"Feature","properties":{"mag":4.6,"place":"south of the Fiji Islands","time":1704042288597,"updated":1709415575040,"tz":null,"url":"https://earthquake.usgs.gov/earthquakes/eventpage/us6000m0urlogA NASA web server's access log from July 1995.
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985 199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085 burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0 199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179 burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0 burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0 205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985 d104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985 129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074 unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310 unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 d104.aa.net - - [01/Jul/1995:00:00:15 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310
genomeSequencing reads from an E. coli genome (FASTQ).
@DRR002013.1 HWUSI-EAS679_0026:1:1:5823:1110/1 CATCGCGATCCACGCTCGCTGGCGTTGTCCGCCAGAAAGGGTATCCACGCTTTGANCTGCCAGATGAGTTATTCCCGCGGNCTGCNTTGCTTTCGTTACC + ???????????????????????????????????????????????????????????????????????????????????????????????????? @DRR002013.2 HWUSI-EAS679_0026:1:1:6733:1111/1 ATTGCGCAACTGCCATCACCACCGTGCATGTCAGCGATCGTGGTCACGCTGGATTNGTCACCCTCGCCGCAGAAGATTACNAACTNGGCGCCCGGGGGCG + ???????????????????????????????????????????????????????????????????????????????????????????????????? @DRR002013.3 HWUSI-EAS679_0026:1:1:7437:1115/1 TTTGTAACAGAATACCATAATGTTGGTGTGTGTGTTCTTATCTGGTTAAGAGAAAGTGAAAAAAACACAGCGAAAAGAAANCGAANATGTGACAAATATC + ???????????????????????????????????????????????????????????????????????????????????????????????????? @DRR002013.4 HWUSI-EAS679_0026:1:1:8755:1109/1 GTGAAGATTCAGTTTCAGTCCTTCATCCTGCTCTGCACACCAGGCTTCCAGATCCNTCGCTGGACGGATTTCCGGCACCCNGTTANGACCACACTGCTCA
csvDaily weather observations from NOAA's global climate network, 2024 (CSV).
| STATION | DATE | ELEMENT | VALUE | M_FLAG | Q_FLAG | S_FLAG | OBS_TIME |
|---|---|---|---|---|---|---|---|
| ASN00009647 | 20240101 | PRCP | 30 | a | |||
| ASN00009678 | 20240101 | PRCP | 0 | a | |||
| ASN00009692 | 20240101 | PRCP | 0 | a | |||
| ASN00009710 | 20240101 | PRCP | 0 | a | |||
| ASN00009714 | 20240101 | PRCP | 0 | a | |||
| ASN00009738 | 20240101 | PRCP | 0 | a | |||
| ASN00009741 | 20240101 | TMAX | 209 | S | |||
| ASN00009741 | 20240101 | PRCP | 0 | S |
parquetU.S. airline on-time flight records (Bureau of Transportation Statistics) — stored column-wise as Apache Parquet.
| Year | Quarter | Month | DayofMonth | DayOfWeek | FlightDate | Reporting_Airline |
|---|---|---|---|---|---|---|
| 2024 | 1 | 1 | 8 | 1 | 2024-01-08 | 9E |
| 2024 | 1 | 1 | 9 | 2 | 2024-01-09 | 9E |
| 2024 | 1 | 1 | 10 | 3 | 2024-01-10 | 9E |
| 2024 | 1 | 1 | 11 | 4 | 2024-01-11 | 9E |
| 2024 | 1 | 1 | 12 | 5 | 2024-01-12 | 9E |
| 2024 | 1 | 1 | 15 | 1 | 2024-01-15 | 9E |
sqliteUSDA's nutrition database — foods, nutrients, and portions across 17 related tables (SR Legacy).
food (5 columns)| fdc_id | data_type | description | food_category_id | publication_date |
|---|---|---|---|---|
| 167512 | sr_legacy_food | Pillsbury Golden L | 18 | 2019-04-01 |
| 167513 | sr_legacy_food | Pillsbury, Cinnamo | 18 | 2019-04-01 |
| 167514 | sr_legacy_food | Kraft Foods, Shake | 18 | 2019-04-01 |
| 167515 | sr_legacy_food | George Weston Bake | 18 | 2019-04-01 |
| 167516 | sr_legacy_food | Waffles, buttermil | 18 | 2019-04-01 |
| 167517 | sr_legacy_food | Waffle, buttermilk | 18 | 2019-04-01 |
exeA compiled Linux executable — the Hugo static-site generator.
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 405b 4400 0000 0000 ..>.....@[D..... 00000020: 4000 0000 0000 0000 e055 b903 0000 0000 @........U...... 00000030: 0000 0000 4000 3800 0e00 4000 2800 2700 ....@.8...@.(.'. 00000040: 0600 0000 0400 0000 4000 0000 0000 0000 ........@....... 00000050: 4000 4000 0000 0000 4000 4000 0000 0000 @.@.....@.@..... 00000060: 1003 0000 0000 0000 1003 0000 0000 0000 ................ 00000070: 0800 0000 0000 0000 0300 0000 0400 0000 ................ 00000080: 5003 0000 0000 0000 5003 4000 0000 0000 P.......P.@..... 00000090: 5003 4000 0000 0000 1c00 0000 0000 0000 P.@.............
photoNASA's “Blue Marble” — Earth photographed from Apollo 17.

movieA clip from the open film Big Buck Bunny (H.264 video).

weightsThe trained weights of a small neural network (safetensors).
| tensor | dtype | shape |
|---|---|---|
| embeddings.position_ids | I64 | 1×512 |
| embeddings.LayerNorm.bias | F32 | 384 |
| embeddings.LayerNorm.weight | F32 | 384 |
| embeddings.position_embeddings.weight | F32 | 512×384 |
| embeddings.token_type_embeddings.weight | F32 | 2×384 |
| embeddings.word_embeddings.weight | F32 | 30522×384 |
| encoder.layer.0.attention.output.LayerNorm.bias | F32 | 384 |
| encoder.layer.0.attention.output.LayerNorm.weight | F32 | 384 |
00000000: 902c 0000 0000 0000 7b22 5f5f 6d65 7461 .,......{"__meta
00000010: 6461 7461 5f5f 223a 7b22 666f 726d 6174 data__":{"format
00000020: 223a 2270 7422 7d2c 2265 6d62 6564 6469 ":"pt"},"embeddi
00000030: 6e67 732e 706f 7369 7469 6f6e 5f69 6473 ngs.position_ids
00000040: 223a 7b22 6474 7970 6522 3a22 4936 3422 ":{"dtype":"I64"
00000050: 2c22 7368 6170 6522 3a5b 312c 3531 325d ,"shape":[1,512]
symbolsDWARF debug symbols from a Lua 5.4.8 build compiled with -g (a debug-info file, not a runnable program).
00000000: cffa edfe 0700 0001 0300 0000 0a00 0000 ................ 00000010: 0900 0000 3808 0000 0000 0000 0000 0000 ....8........... 00000020: 1b00 0000 1800 0000 9805 68b6 c829 3ba7 ..........h..);. 00000030: af32 1008 12e6 c0e2 3200 0000 1800 0000 .2......2....... 00000040: 0100 0000 0000 0d00 0005 1a00 0000 0000 ................ 00000050: 0200 0000 1800 0000 0010 0000 db02 0000 ................ 00000060: b03d 0000 9127 0000 1900 0000 4800 0000 .=...'......H... 00000070: 5f5f 5041 4745 5a45 524f 0000 0000 0000 __PAGEZERO...... 00000080: 0000 0000 0000 0000 0000 0000 0100 0000 ................ 00000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................
wasmThe SQLite engine compiled to WebAssembly — stack-machine bytecode.
00000000: 0061 736d 0100 0000 01db 055e 6002 7f7f .asm.......^`... 00000010: 017f 6001 7f01 7f60 037f 7f7f 017f 6001 ..`....`......`. 00000020: 7f00 6003 7f7f 7f00 6002 7f7f 0060 047f ..`.....`....`.. 00000030: 7f7f 7f01 7f60 057f 7f7f 7f7f 017f 6004 .....`........`. 00000040: 7f7f 7f7f 0060 047f 7f7f 7e01 7f60 067f .....`....~..`.. 00000050: 7f7f 7f7f 7f01 7f60 027f 7e01 7f60 0000 .......`..~..`.. 00000060: 6005 7f7f 7f7f 7f00 6001 7c01 7c60 0001 `.......`.|.|`.. 00000070: 7f60 017f 017e 6007 7f7f 7f7f 7f7f 7f01 .`...~`......... 00000080: 7f60 067f 7f7f 7f7f 7f00 6002 7f7e 0060 .`........`..~.` 00000090: 087f 7f7f 7f7f 7f7f 7f01 7f60 037f 7e7f ...........`..~.
winexeThe fd file-finder as a Windows PE executable.
00000000: 4d5a 9000 0300 0000 0400 0000 ffff 0000 MZ.............. 00000010: b800 0000 0000 0000 4000 0000 0000 0000 ........@....... 00000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000030: 0000 0000 0000 0000 0000 0000 f000 0000 ................ 00000040: 0e1f ba0e 00b4 09cd 21b8 014c cd21 5468 ........!..L.!Th 00000050: 6973 2070 726f 6772 616d 2063 616e 6e6f is program canno 00000060: 7420 6265 2072 756e 2069 6e20 444f 5320 t be run in DOS 00000070: 6d6f 6465 2e0d 0d0a 2400 0000 0000 0000 mode....$....... 00000080: ecdd 9412 a8bc fa41 a8bc fa41 a8bc fa41 .......A...A...A 00000090: d13d ff40 25bc fa41 d13d fe40 a4bc fa41 .=.@%..A.=.@...A
armexeThe hyperfine benchmarking tool as an ARM64 Linux executable.
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0300 b700 0100 0000 accf 0100 0000 0000 ................ 00000020: 4000 0000 0000 0000 30ec 1100 0000 0000 @.......0....... 00000030: 0000 0000 4000 3800 0a00 4000 1f00 1e00 ....@.8...@..... 00000040: 0600 0000 0500 0000 4000 0000 0000 0000 ........@....... 00000050: 4000 0000 0000 0000 4000 0000 0000 0000 @.......@....... 00000060: 3002 0000 0000 0000 3002 0000 0000 0000 0.......0....... 00000070: 0800 0000 0000 0000 0300 0000 0400 0000 ................ 00000080: 7002 0000 0000 0000 7002 0000 0000 0000 p.......p....... 00000090: 7002 0000 0000 0000 1b00 0000 0000 0000 p...............
The same kinds of data at gigabyte scale, where long-range matching and big compression windows start to matter. Still being assembled.
weights-smollm2-135m.safetensorsSmolLM2-135M — a small (135M-parameter) language model's weights (Apache-2.0). The middle rung of the weights size-ladder.
nasa-http-jul-aug-1995.logScale-tier file — a large rung for large-window and throughput testing.
big-buck-bunny-1080p.movThe full open film Big Buck Bunny in 1080p H.264 video.
bts-ontime-2022-2024.parquetScale-tier file — a large rung for large-window and throughput testing.
weights-qwen2.5-0.5b.safetensorsQwen2.5-0.5B — a 0.5B-parameter language model's weights (Apache-2.0). The second rung of the weights size-ladder.
enwik9.txtThe first billion bytes of an English Wikipedia XML dump (the Hutter-Prize text).
ecoli-DRR002013-full.fastqScale-tier file — a large rung for large-window and throughput testing.
noaa-ghcn-daily-2024-full.csvScale-tier file — a large rung for large-window and throughput testing.
clang-releases-16-17-18-19.tarFour LLVM/Clang release source trees concatenated — a real software archive.
llvm-project-19.1.0.src.tarScale-tier file — a large rung for large-window and throughput testing.
weights-qwen2.5-1.5b.safetensorsQwen2.5-1.5B — a larger (1.5B-parameter) language model's weights (Apache-2.0). The top rung of the ladder; multi-GB, for large-window and throughput work.
noaa-ghcn-daily-2021-2023.csvScale-tier file — a large rung for large-window and throughput testing.