Skip to main content
All OpenAlex data is stored in Amazon S3 in the openalex bucket, under the data/ prefix.

Two formats

The snapshot is published in two formats, each a complete copy of the same data:
FormatCompressionOne record perPrefix
JSON Linesgzip (.gz)line/data/jsonl/
Apache Parquetsnappy (.parquet)row/data/parquet/
Parquet is rolling out to the free public snapshot with the June 2026 quarterly release. Enterprise API-key users already receive both formats in the daily staging snapshot (see Download to your machine).

Bucket structure

Under each format prefix there is one folder per entity type, plus a combined manifest.json:
s3://openalex/data/
├── jsonl/
│   ├── manifest.json          # all entities, this format
│   ├── works/
│   │   ├── manifest.json       # works only
│   │   └── updated_date=2026-06-24/
│   │       ├── part_0000.gz
│   │       └── part_0001.gz
│   ├── authors/
│   └── ...
└── parquet/
    ├── manifest.json
    ├── works/
    │   ├── manifest.json
    │   └── updated_date=2026-06-24/
    │       ├── part_0000.parquet
    │       └── part_0001.parquet
    └── ...
The entity folders under each format are:
Core entitiesTopic hierarchyLookup / aggregation entities
workstopicskeywords, concepts
authorssubfieldscontinents, countries
institutionsfieldsinstitution-types, source-types, work-types
sourcesdomainslanguages, licenses, sdgs
publishers, funders, awards
You can browse the bucket at openalex.s3.amazonaws.com/browse.html.
Pre-2026 snapshots used a flat data/{entity}/ layout (JSON Lines only) with no jsonl//parquet/ split. That older layout, along with the merged_ids/ directory, is preserved under the legacy-data/ prefix.
Records are partitioned by updated_date. For example, a Work with updated_date of 2026-06-24 lives under:
/data/jsonl/works/updated_date=2026-06-24/      # JSON Lines
/data/parquet/works/updated_date=2026-06-24/    # Parquet
Each partition holds one or more part files (part_0000.*, part_0001.*, …) of up to 400,000 records each.

Size

The gzip-compressed JSON Lines snapshot is approximately 330 GB and decompresses to about 1.6 TB. The Parquet copy is provided alongside it, so downloading both formats roughly doubles the transfer. If you only need one format, download a single format prefix.

Entity schemas

The structure of each entity type matches the API response format:

Works

Authors

Sources

Institutions

Topics

Publishers

API-only fields: Some Work properties are only available through the API and not included in the snapshot. For example, content_url — use the content endpoint directly with work IDs from the snapshot.

Keeping your snapshot up to date

The free public snapshot is refreshed quarterly. A daily-refreshed snapshot and daily change files (via the Changefiles API) require a paid plan. Contact sales@openalex.org.
The updated_date partitions make incremental updates straightforward. Unlike dated snapshots that each contain the full dataset, each partition contains only the records that last changed on that date.

How partitions work

Imagine launching OpenAlex with 1,000 Authors, all created on 2024-01-01:
/data/jsonl/authors/
├── manifest.json
└── updated_date=2024-01-01 [1000 Authors]
    ├── part_0000.gz
    └── ...
If we update 50 of those Authors on 2024-01-15, they move out of the old partition and into the new one:
/data/jsonl/authors/
├── manifest.json
├── updated_date=2024-01-01 [950 Authors]
│   └── ...
└── updated_date=2024-01-15 [50 Authors]
    └── ...
If we also discover 50 new Authors, they go into the same new partition:
/data/jsonl/authors/
├── manifest.json
├── updated_date=2024-01-01 [950 Authors]
│   └── ...
└── updated_date=2024-01-15 [100 Authors]
    └── ...
So if you made your snapshot copy on 2024-01-01, you only need to download updated_date=2024-01-15 to get everything that changed or was added since then. The Parquet copy under /data/parquet/authors/ is partitioned exactly the same way.
To update a snapshot copy that you created or updated on date X, insert or update the records in partitions where updated_date > X.
You never need to re-download a partition you already have. Anything that changed has moved to a newer partition.

The manifest files

Every format has a combined manifest.json listing all data files across all entities (/data/{format}/manifest.json), and every entity has its own manifest.json (/data/{format}/{entity}/manifest.json). The manifest is written last, after all data files are uploaded, so if it’s present the data for that format is complete. A per-entity manifest looks like this:
{
  "date": "2026-06-25",
  "format": "jsonl",
  "entity": "works",
  "record_count": 264000000,
  "content_length": 350000000000,
  "files": [
    {
      "url": "s3://openalex/data/jsonl/works/updated_date=2026-06-24/part_0000.gz",
      "meta": { "content_length": 936733, "record_count": 499 }
    }
  ]
}
The combined manifest has the same shape but nests one entry per entity under an entities array, with a top-level meta carrying the totals. To use a manifest for incremental updates:
1

Download the manifest

aws s3 cp s3://openalex/data/jsonl/works/manifest.json ./manifest.json --no-sign-request
2

Check for new partitions

Get the file list from the url property of each item in the files list. Identify any updated_date partitions you haven’t seen before.
3

Download new partitions

Download objects with new updated_date values.
4

Verify consistency

Download the manifest again. If it hasn’t changed since step 1, no records moved between partitions during your download.
5

Load the data

Read each file (decompress and parse one JSON entity per line for JSON Lines; read columns directly for Parquet). Insert or update into your database using each entity’s ID as the primary key.
If you have an existing data pipeline, these details may be all you need. For step-by-step download instructions, see Download to your machine.