openalex bucket. The data files are gzip-compressed JSON Lines — one entity per line.
Bucket structure
The bucket contains one prefix (folder) for each entity type:| Entity | S3 prefix | Browse |
|---|---|---|
| Works | /data/works/ | Browse |
| Authors | /data/authors/ | Browse |
| Sources | /data/sources/ | Browse |
| Institutions | /data/institutions/ | Browse |
| Topics | /data/topics/ | Browse |
| Domains | /data/domains/ | Browse |
| Fields | /data/fields/ | Browse |
| Subfields | /data/subfields/ | Browse |
| Publishers | /data/publishers/ | Browse |
| Funders | /data/funders/ | Browse |
| Concepts | /data/concepts/ | Browse |
updated_date. Within each entity type, files are further prefixed by date. For example, an Author with updated_date of 2024-01-15 lives under:
Size
The gzip-compressed snapshot is approximately 330 GB and decompresses to about 1.6 TB.Entity schemas
The structure of each entity type matches the API response format:API-only fields: Some Work properties are only available through the API and not included in the snapshot. For example,
content_url — use the content endpoint directly with work IDs from the snapshot.Keeping your snapshot up to date
Theupdated_date partitions make incremental updates straightforward. Unlike dated snapshots that each contain the full dataset, each partition contains only the records that last changed on that date.
How partitions work
Imagine launching OpenAlex with 1,000 Authors, all created on 2024-01-01:updated_date=2024-01-15 to get everything that changed or was added since then.
You never need to re-download a partition you already have. Anything that changed has moved to a newer partition.
Merged entities
Alongside the entity type folders, you’ll find amerged_ids folder. This contains the IDs of entities that have been merged, along with the IDs they were merged into.
Merging is a way of deleting an entity while preserving its ID. In practice, you can just delete the entity. You don’t need to track the merge date or target.
A2257618939. The effects of merging (like crediting the target author with works) are already reflected in the affected entities.
Like the updated_date partitions, you only need to download merged_ids files that are new to you.
The manifest file
Each entity type has amanifest file that lists all data files. When we start writing a new updated_date partition, we delete the manifest. When we finish, we recreate it with the new files included. So if the manifest is present, all data files are complete.
The file uses Redshift manifest format. To use it for incremental updates:
Check for new partitions
Get the file list from the
url property of each item in the entries list. Identify any updated_date partitions you haven’t seen before.Verify consistency
Download the manifest again. If it hasn’t changed since step 1, no records moved between partitions during your download.