openalex bucket, under the data/ prefix.
Two formats
The snapshot is published in two formats, each a complete copy of the same data:| Format | Compression | One record per | Prefix |
|---|---|---|---|
| JSON Lines | gzip (.gz) | line | /data/jsonl/ |
| Apache Parquet | snappy (.parquet) | row | /data/parquet/ |
Parquet is rolling out to the free public snapshot with the June 2026 quarterly release. Enterprise API-key users already receive both formats in the daily staging snapshot (see Download to your machine).
Bucket structure
Under each format prefix there is one folder per entity type, plus a combinedmanifest.json:
| Core entities | Topic hierarchy | Lookup / aggregation entities |
|---|---|---|
works | topics | keywords, concepts |
authors | subfields | continents, countries |
institutions | fields | institution-types, source-types, work-types |
sources | domains | languages, licenses, sdgs |
publishers, funders, awards |
Pre-2026 snapshots used a flat
data/{entity}/ layout (JSON Lines only) with no jsonl//parquet/ split. That older layout, along with the merged_ids/ directory, is preserved under the legacy-data/ prefix.updated_date. For example, a Work with updated_date of 2026-06-24 lives under:
part_0000.*, part_0001.*, …) of up to 400,000 records each.
Size
The gzip-compressed JSON Lines snapshot is approximately 330 GB and decompresses to about 1.6 TB. The Parquet copy is provided alongside it, so downloading both formats roughly doubles the transfer. If you only need one format, download a single format prefix.Entity schemas
The structure of each entity type matches the API response format:Works
Authors
Sources
Institutions
Topics
Publishers
API-only fields: Some Work properties are only available through the API and not included in the snapshot. For example,
content_url — use the content endpoint directly with work IDs from the snapshot.Keeping your snapshot up to date
Theupdated_date partitions make incremental updates straightforward. Unlike dated snapshots that each contain the full dataset, each partition contains only the records that last changed on that date.
How partitions work
Imagine launching OpenAlex with 1,000 Authors, all created on 2024-01-01:updated_date=2024-01-15 to get everything that changed or was added since then. The Parquet copy under /data/parquet/authors/ is partitioned exactly the same way.
You never need to re-download a partition you already have. Anything that changed has moved to a newer partition.
The manifest files
Every format has a combinedmanifest.json listing all data files across all entities (/data/{format}/manifest.json), and every entity has its own manifest.json (/data/{format}/{entity}/manifest.json). The manifest is written last, after all data files are uploaded, so if it’s present the data for that format is complete.
A per-entity manifest looks like this:
entities array, with a top-level meta carrying the totals.
To use a manifest for incremental updates:
Check for new partitions
Get the file list from the
url property of each item in the files list. Identify any updated_date partitions you haven’t seen before.Verify consistency
Download the manifest again. If it hasn’t changed since step 1, no records moved between partitions during your download.