OpenAlex has cached full-text content for about 60 million works:Documentation Index
Fetch the complete documentation index at: https://developers.openalex.org/llms.txt
Use this file to discover all available pages before exploring further.
| Format | Files | Size |
|---|---|---|
| ~60M | ~250 TB | |
| TEI XML | ~43M | ~20 TB |
Download options
Option 1: API (up to ~10K files)
Download files one at a time from the content API. Each download costs $0.01.| Extension | Format |
|---|---|
.pdf | PDF (default) |
.grobid-xml | TEI XML |
has_content filter to find works with available content:
content_url field on any work object to see if content is available.
With a free API key ($1/day), you can download about 100 files per day. Good for research projects, building small corpora, or sampling.
Option 2: OpenAlex CLI (up to a few million files)
For larger downloads, use the OpenAlex CLI. It handles parallel downloads, retries, checkpointing, and resume automatically.Option 3: Complete archive sync
For the complete archive (all 60M files), we provide direct access to the storage bucket via time-limited credentials. Files are stored on Cloudflare R2, which is fully S3-compatible. One-time download: 30-day R2 read access to sync the complete archive. Ongoing sync: Persistent R2 read access is included with the enterprise subscription. See the pricing page for details, or contact us to get started. How it works:
At typical network speeds, expect 1–2 weeks to download the full archive.
The archive is production-ready but still early, and we add and clean files continuously, so treat it as a living dataset you re-sync periodically rather than a frozen snapshot. As a sensible default, validate each file on ingest (check the leading
%PDF magic bytes and skip implausibly small objects) so your pipeline is robust to any in-flight changes.
Mapping work IDs to files
Files in the archive bucket are named by UUID, not work ID:https://content.openalex.org/works/{work_id}.pdf is an addressable endpoint that resolves the work ID to the underlying archive object server-side. Filter your candidate set with has_content.pdf:true (available in the snapshot and the API; content_urls is API-only), then fetch each work by ID.
TEI XML quality and limitations
GROBID (project, docs) is the state of the art for converting scholarly PDFs to structured TEI XML. But PDF parsing is genuinely hard, and a meaningful share of files will contain errors — missing or duplicated references, occasional self-references (the paper’s own DOI picked up from its header or footer), wrong or partial header metadata, and the odd truncated section. You can see how GROBID performs field-by-field in the project’s own benchmarks. GROBID also can’t parse every PDF. It doesn’t do OCR, so scanned or image-only PDFs produce little or nothing useful, and unusual or malformed PDFs can fail outright. Filter onhas_content.grobid_xml:true to limit your set to works where we have a parse.
We pass GROBID’s output through unchanged, including any errors it made. You can catch some with simple guard clauses — for example, drop any reference whose DOI matches the work’s own DOI, and cross-check GROBID’s header metadata against the OpenAlex Work record (which draws on Crossref and other sources independently of GROBID).
A lot of people today actually want Markdown rather than XML for LLM and RAG pipelines. We don’t have Markdown parses yet, though they’re on the roadmap. In the meantime you can roll your own by downloading the PDFs we provide and running them through one of the newer PDF→Markdown tools like Marker, Docling, or MinerU.
Licensing
The PDFs retain their original copyright. OpenAlex does not grant any additional rights to the content. To check the license for a specific work, use thebest_oa_location.license field in the API: