| Format | Files | Size |
|---|---|---|
| ~60M | ~250 TB | |
| TEI XML | ~43M | ~20 TB |
Download options
Option 1: API (up to ~10K files)
Download files one at a time from the content API. Each download costs $0.01.| Extension | Format |
|---|---|
.pdf | PDF (default) |
.grobid-xml | TEI XML |
has_content filter to find works with available content:
content_url field on any work object to see if content is available.
With a free API key ($1/day), you can download about 100 files per day. Good for research projects, building small corpora, or sampling.
Option 2: OpenAlex CLI (up to a few million files)
For larger downloads, use the OpenAlex CLI. It handles parallel downloads, retries, checkpointing, and resume automatically.Option 3: Complete archive sync
For the complete archive (all 60M files), we provide direct access to the storage bucket via time-limited credentials. Files are stored on Cloudflare R2, which is fully S3-compatible. One-time download: 30-day R2 read access to sync the complete archive. Ongoing sync: Persistent R2 read access is included with the enterprise subscription. See the pricing page for details, or contact us to get started. How it works:
At typical network speeds, expect 1–2 weeks to download the full archive.
File naming
Files are named by UUID, not work ID:locations array in each work contains pdf_url fields that include the UUID.
Licensing
The PDFs retain their original copyright. OpenAlex does not grant any additional rights to the content. To check the license for a specific work, use thebest_oa_location.license field in the API: