Skip to main content

Documentation Index

Fetch the complete documentation index at: https://developers.openalex.org/llms.txt

Use this file to discover all available pages before exploring further.

OpenAlex has cached full-text content for about 60 million works:
FormatFilesSize
PDF~60M~250 TB
TEI XML~43M~20 TB
TEI XML files are machine-readable structured text parsed by Grobid.

Download options

Option 1: API (up to ~10K files)

Download files one at a time from the content API. Each download costs $0.01.
https://content.openalex.org/works/{work_id}.pdf?api_key=YOUR_KEY
ExtensionFormat
.pdfPDF (default)
.grobid-xmlTEI XML
curl "https://content.openalex.org/works/W3038568908.pdf?api_key=YOUR_KEY"
Finding downloadable works: Use the has_content filter to find works with available content:
# Works with PDFs from 2024
https://api.openalex.org/works?filter=has_content.pdf:true,publication_year:2024

# CC-BY works with PDFs
https://api.openalex.org/works?filter=has_content.pdf:true,best_oa_location.license:cc-by
You can also check the content_url field on any work object to see if content is available. With a free API key ($1/day), you can download about 100 files per day. Good for research projects, building small corpora, or sampling.

Option 2: OpenAlex CLI (up to a few million files)

For larger downloads, use the OpenAlex CLI. It handles parallel downloads, retries, checkpointing, and resume automatically.
pip install openalex-official
Download PDFs for a specific topic:
openalex download \
  --api-key YOUR_KEY \
  --output ./climate-pdfs \
  --filter "topics.id:T10325,has_content.pdf:true" \
  --content pdf
Download works with Creative Commons licenses:
openalex download \
  --api-key YOUR_KEY \
  --output ./cc-pdfs \
  --filter "best_oa_location.license:cc-by,has_content.pdf:true" \
  --content pdf
Download metadata + PDFs + TEI XML:
openalex download \
  --api-key YOUR_KEY \
  --output ./my-corpus \
  --filter "topics.id:T10325,has_content.pdf:true" \
  --content pdf,xml
Standard rates apply ($0.01 per content file; metadata is free). At full speed, you can download a few million files in a few days.

Option 3: Complete archive sync

For the complete archive (all 60M files), we provide direct access to the storage bucket via time-limited credentials. Files are stored on Cloudflare R2, which is fully S3-compatible. One-time download: 30-day R2 read access to sync the complete archive. Ongoing sync: Persistent R2 read access is included with the enterprise subscription. See the pricing page for details, or contact us to get started. How it works:
1

Get credentials

We generate R2 API credentials with read-only access.
2

Sync with AWS CLI

aws s3 sync s3://openalex-pdfs ./pdfs \
  --endpoint-url https://a452eddbbe06eb7d02f4879cee70d29c.r2.cloudflarestorage.com
3

Stay current

For ongoing sync, run periodically to get new files.
At typical network speeds, expect 1–2 weeks to download the full archive. The archive is production-ready but still early, and we add and clean files continuously, so treat it as a living dataset you re-sync periodically rather than a frozen snapshot. As a sensible default, validate each file on ingest (check the leading %PDF magic bytes and skip implausibly small objects) so your pipeline is robust to any in-flight changes.

Mapping work IDs to files

Files in the archive bucket are named by UUID, not work ID:
openalex-pdfs/
├── 3a07228e-de2a-4c37-955d-b1411a498328.pdf
├── 7b12f4a1-8c9d-4e5f-a2b3-c4d5e6f7a8b9.pdf
└── ...

openalex-grobid-xml/
├── 3a07228e-de2a-4c37-955d-b1411a498328.xml.gz
├── 7b12f4a1-8c9d-4e5f-a2b3-c4d5e6f7a8b9.xml.gz
└── ...
You don’t need to reconstruct the UUID join yourself. Every work object exposes an authoritative mapping through two fields:
"has_content":  { "pdf": true, "grobid_xml": true },
"content_urls": { "pdf": "https://content.openalex.org/works/W1775749144.pdf",
                  "grobid_xml": "https://content.openalex.org/works/W1775749144.grobid-xml" }
https://content.openalex.org/works/{work_id}.pdf is an addressable endpoint that resolves the work ID to the underlying archive object server-side. Filter your candidate set with has_content.pdf:true (available in the snapshot and the API; content_urls is API-only), then fetch each work by ID.
Do not map work IDs to archive objects via locations[].pdf_url. That field is the original publisher URL, not the archive object — keying off it produces a lossy, inferred match. Use content_urls / the content endpoint instead.

TEI XML quality and limitations

GROBID (project, docs) is the state of the art for converting scholarly PDFs to structured TEI XML. But PDF parsing is genuinely hard, and a meaningful share of files will contain errors — missing or duplicated references, occasional self-references (the paper’s own DOI picked up from its header or footer), wrong or partial header metadata, and the odd truncated section. You can see how GROBID performs field-by-field in the project’s own benchmarks. GROBID also can’t parse every PDF. It doesn’t do OCR, so scanned or image-only PDFs produce little or nothing useful, and unusual or malformed PDFs can fail outright. Filter on has_content.grobid_xml:true to limit your set to works where we have a parse. We pass GROBID’s output through unchanged, including any errors it made. You can catch some with simple guard clauses — for example, drop any reference whose DOI matches the work’s own DOI, and cross-check GROBID’s header metadata against the OpenAlex Work record (which draws on Crossref and other sources independently of GROBID). A lot of people today actually want Markdown rather than XML for LLM and RAG pipelines. We don’t have Markdown parses yet, though they’re on the roadmap. In the meantime you can roll your own by downloading the PDFs we provide and running them through one of the newer PDF→Markdown tools like Marker, Docling, or MinerU.

Licensing

The PDFs retain their original copyright. OpenAlex does not grant any additional rights to the content. To check the license for a specific work, use the best_oa_location.license field in the API:
https://api.openalex.org/works?filter=has_content.pdf:true,best_oa_location.license:cc-by
This returns works that have downloadable PDFs and are licensed under CC BY.