OpenAlex has cached full-text content for about 60 million works:
| Format | Files | Size |
|---|
| PDF | ~60M | ~250 TB |
| TEI XML | ~43M | ~20 TB |
TEI XML files are machine-readable structured text parsed by Grobid.
Download options
Option 1: API (up to ~10K files)
Download files one at a time from the content API. Each download costs $0.01.
https://content.openalex.org/works/{work_id}.pdf?api_key=YOUR_KEY
| Extension | Format |
|---|
.pdf | PDF (default) |
.grobid-xml | TEI XML |
curl "https://content.openalex.org/works/W3038568908.pdf?api_key=YOUR_KEY"
Finding downloadable works:
Use the has_content filter to find works with available content:
# Works with PDFs from 2024
https://api.openalex.org/works?filter=has_content.pdf:true,publication_year:2024
# CC-BY works with PDFs
https://api.openalex.org/works?filter=has_content.pdf:true,best_oa_location.license:cc-by
You can also check the content_url field on any work object to see if content is available.
With a free API key ($1/day), you can download about 100 files per day. Good for research projects, building small corpora, or sampling.
Option 2: OpenAlex CLI (up to a few million files)
For larger downloads, use the OpenAlex CLI. It handles parallel downloads, retries, checkpointing, and resume automatically.
pip install openalex-official
Download PDFs for a specific topic:
openalex download \
--api-key YOUR_KEY \
--output ./climate-pdfs \
--filter "topics.id:T10325,has_content.pdf:true" \
--content pdf
Download works with Creative Commons licenses:
openalex download \
--api-key YOUR_KEY \
--output ./cc-pdfs \
--filter "best_oa_location.license:cc-by,has_content.pdf:true" \
--content pdf
Download metadata + PDFs + TEI XML:
openalex download \
--api-key YOUR_KEY \
--output ./my-corpus \
--filter "topics.id:T10325,has_content.pdf:true" \
--content pdf,xml
Standard rates apply ($0.01 per content file; metadata is free). At full speed, you can download a few million files in a few days.
Option 3: Complete archive sync
For the complete archive (all 60M files), we provide direct access to the storage bucket via time-limited credentials. Files are stored on Cloudflare R2, which is fully S3-compatible.
One-time download: 30-day R2 read access to sync the complete archive.
Ongoing sync: Persistent R2 read access is included with the enterprise subscription.
See the pricing page for details, or contact us to get started.
How it works:
Get credentials
We generate R2 API credentials with read-only access.
Sync with AWS CLI
aws s3 sync s3://openalex-pdfs ./pdfs \
--endpoint-url https://a452eddbbe06eb7d02f4879cee70d29c.r2.cloudflarestorage.com
Stay current
For ongoing sync, run periodically to get new files.
At typical network speeds, expect 1–2 weeks to download the full archive.
The archive is production-ready but still early, and we add and clean files continuously, so treat it as a living dataset you re-sync periodically rather than a frozen snapshot. As a sensible default, validate each file on ingest (check the leading %PDF magic bytes and skip implausibly small objects) so your pipeline is robust to any in-flight changes.
Mapping work IDs to files
Files in the archive bucket are named by UUID, not work ID. A work’s PDF and its TEI XML are stored as separate objects with different UUIDs — the file names alone won’t tell you which PDF and which XML belong to the same work:
openalex-pdfs/
├── 3a07228e-de2a-4c37-955d-b1411a498328.pdf
├── 7b12f4a1-8c9d-4e5f-a2b3-c4d5e6f7a8b9.pdf
└── ...
openalex-grobid-xml/
├── 5e6ea859-62f6-40ab-9f7e-6472aab3a19d.xml.gz
├── 523b7cb3-17f3-4e2f-b81c-867fd8802182.xml.gz
└── ...
The manifest (for archive sync)
If you’ve synced the bucket, use the manifest we publish in the openalex-pdfs bucket itself:
s3://openalex-pdfs/_manifest/content_index/
It’s a set of Parquet files, regenerated in full once a day, that maps every work to its file UUIDs. A normal aws s3 sync of the bucket pulls the manifest down alongside the content (into a local _manifest/content_index/ folder), so you typically already have it.
| Column | Example | Object |
|---|
openalex_id | W2741809807 | the OpenAlex work ID |
pdf_uuid | 3a07228e-de2a-4c37-955d-b1411a498328 | openalex-pdfs/{pdf_uuid}.pdf |
grobid_xml_id | 5e6ea859-62f6-40ab-9f7e-6472aab3a19d | openalex-grobid-xml/{grobid_xml_id}.xml.gz |
updated_date | 2026-06-20 | when this work’s content last changed |
UUIDs are stored without their file extension — append .pdf and .xml.gz to build the object keys. A work may have a PDF but no XML (GROBID can’t parse every PDF), so grobid_xml_id is sometimes empty.
Query it directly with any Parquet reader — no need to load it into a database:
import duckdb
# Look up one work
duckdb.sql("""
SELECT * FROM '_manifest/content_index/*.parquet'
WHERE openalex_id = 'W2741809807'
""")
The join is then entirely local, with no API calls:
- Join
openalex_id to your works snapshot to attach metadata.
- Join
pdf_uuid / grobid_xml_id to your synced files to attach the PDF and TEI XML.
Because the manifest is a full daily replacement, re-pull the _manifest/ prefix whenever you re-sync the archive.
The content endpoint (for per-work fetches)
If you’re fetching files one at a time through the content API rather than syncing the bucket, you don’t deal with UUIDs at all — the endpoint resolves the work ID to the underlying object server-side:
https://content.openalex.org/works/{work_id}.pdf
Work objects from the API expose this directly via has_content and content_urls:
"has_content": { "pdf": true, "grobid_xml": true },
"content_urls": { "pdf": "https://content.openalex.org/works/W1775749144.pdf",
"grobid_xml": "https://content.openalex.org/works/W1775749144.grobid-xml" }
Filter your candidate set with has_content.pdf:true (available in the snapshot and the API; content_urls is API-only), then fetch each work by ID. This is convenient for small-to-medium sets, but at archive scale prefer the manifest above — the join is offline and costs nothing.
Do not map work IDs to archive objects via locations[].pdf_url. That field is the original publisher URL, not the archive object — keying off it produces a lossy, inferred match. Use the manifest or the content endpoint instead.
TEI XML quality and limitations
GROBID (project, docs) is the state of the art for converting scholarly PDFs to structured TEI XML. But PDF parsing is genuinely hard, and a meaningful share of files will contain errors — missing or duplicated references, occasional self-references (the paper’s own DOI picked up from its header or footer), wrong or partial header metadata, and the odd truncated section. You can see how GROBID performs field-by-field in the project’s own benchmarks.
GROBID also can’t parse every PDF. It doesn’t do OCR, so scanned or image-only PDFs produce little or nothing useful, and unusual or malformed PDFs can fail outright. Filter on has_content.grobid_xml:true to limit your set to works where we have a parse.
We pass GROBID’s output through unchanged, including any errors it made. You can catch some with simple guard clauses — for example, drop any reference whose DOI matches the work’s own DOI, and cross-check GROBID’s header metadata against the OpenAlex Work record (which draws on Crossref and other sources independently of GROBID).
A lot of people today actually want Markdown rather than XML for LLM and RAG pipelines. We don’t have Markdown parses yet, though they’re on the roadmap. In the meantime you can roll your own by downloading the PDFs we provide and running them through one of the newer PDF→Markdown tools like Marker, Docling, or MinerU.
Licensing
The PDFs retain their original copyright. OpenAlex does not grant any additional rights to the content.
To check the license for a specific work, use the best_oa_location.license field in the API:
https://api.openalex.org/works?filter=has_content.pdf:true,best_oa_location.license:cc-by
This returns works that have downloadable PDFs and are licensed under CC BY.