Skip to main content
The OpenAlex snapshot is hosted on Amazon S3 and is free to download. You don’t need an AWS account.
Many thanks to the AWS Open Data program, which covers the data-transfer fees (about $70 per download) so users don’t have to.

Prerequisites

Install the AWS CLI. All commands in this guide use it. No account or credentials are needed — the --no-sign-request flag provides anonymous access. You can also browse the snapshot in your browser: openalex.s3.amazonaws.com/browse.html

Download the full snapshot

This copies everything in the openalex S3 bucket to a local folder.
aws s3 sync "s3://openalex" "openalex-snapshot" --no-sign-request
This pulls both formats (JSON Lines under data/jsonl/ and Parquet under data/parquet/) plus the legacy-data/ prefix, so it’s well over 660 GB. Most users want a single format — see Download a single format or entity type below.
If you’re downloading into a folder that already has a previous snapshot, use the --delete flag to remove outdated files. Otherwise you’ll get duplicate entities that have moved between partitions.
aws s3 sync "s3://openalex" "openalex-snapshot" --no-sign-request --delete

Check the current size

The snapshot size changes over time. Check before downloading:
aws s3 ls --summarize --human-readable --no-sign-request --recursive "s3://openalex/"

File structure

After downloading, you’ll have a structure like this. The data is split into two top-level format folders — jsonl/ and parquet/ — each a complete copy of every entity:
openalex-snapshot/
├── LICENSE.txt
├── RELEASE_NOTES.txt
├── legacy-data/              # pre-2026 flat layout + merged_ids (kept for back-compat)
└── data
    ├── jsonl
    │   ├── manifest.json      # all entities
    │   ├── works
    │   │   ├── manifest.json
    │   │   └── updated_date=2026-06-24
    │   │       ├── part_0000.gz
    │   │       └── part_0001.gz
    │   ├── authors
    │   │   └── ...
    │   ├── institutions
    │   ├── sources
    │   ├── publishers
    │   ├── funders
    │   ├── awards
    │   └── ...               # topics, keywords, concepts, countries, etc.
    └── parquet
        ├── manifest.json
        ├── works
        │   ├── manifest.json
        │   └── updated_date=2026-06-24
        │       ├── part_0000.parquet
        │       └── part_0001.parquet
        └── ...
See Snapshot data format for the full entity list, the partition structure, and how to keep your copy up to date.

Download a single format or entity type

If you only need one format, sync its prefix:
# JSON Lines only
aws s3 sync "s3://openalex/data/jsonl" "openalex-snapshot/data/jsonl" --no-sign-request

# Parquet only
aws s3 sync "s3://openalex/data/parquet" "openalex-snapshot/data/parquet" --no-sign-request
To narrow further to a single entity type, add it to the prefix:
aws s3 sync "s3://openalex/data/jsonl/works" "openalex-snapshot/data/jsonl/works" --no-sign-request

Alternatives to local download

If you don’t want to download files locally, some services can read directly from S3:
  • Amazon Redshift: Load from S3 using the manifest files
  • ETL tools with S3 connectors (Xplenty, Airbyte, etc.)
For these approaches, the snapshot data format documentation should have enough detail to get started.

Download with an enterprise API key

Enterprise users can download a daily-refreshed snapshot. Each day’s full snapshot is published to dated folders in the openalex-snapshots staging bucket, in both JSON Lines and Parquet.
  1. Add this to ~/.aws/config (replace YOUR_KEY with your OpenAlex API key):
[profile openalex]
credential_process = curl -sf -X POST "https://api.openalex.org/snapshots/credentials?api_key=YOUR_KEY"
The AWS CLI will fetch and refresh credentials automatically.
  1. Browse and download
aws s3 ls s3://openalex-snapshots/full/ --profile openalex

aws s3 sync s3://openalex-snapshots/full/2026-04-29/jsonl/ ./openalex-snapshot-jsonl --profile openalex

aws s3 sync s3://openalex-snapshots/full/2026-04-29/parquet/ ./openalex-snapshot-parquet --profile openalex
Each dated folder under full/ is a complete snapshot built that day, so you can pull a fresh full copy daily rather than waiting for the quarterly public release. Both formats are included; Parquet is also being added to the free quarterly public snapshot beginning June 2026. If you’re interested in a daily-refreshed enterprise snapshot, contact sales@openalex.org.