> ## Documentation Index
> Fetch the complete documentation index at: https://developers.openalex.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Full-text PDFs

> Download full-text PDFs and TEI XML for millions of works

OpenAlex has cached full-text content for about 60 million works:

| Format  | Files | Size     |
| ------- | ----- | -------- |
| PDF     | \~60M | \~250 TB |
| TEI XML | \~43M | \~20 TB  |

TEI XML files are machine-readable structured text parsed by [Grobid](https://github.com/kermitt2/grobid).

## Download options

### Option 1: API (up to \~10K files)

Download files one at a time from the content API. Each download costs **\$0.01**.

```
https://content.openalex.org/works/{work_id}.pdf?api_key=YOUR_KEY
```

| Extension     | Format        |
| ------------- | ------------- |
| `.pdf`        | PDF (default) |
| `.grobid-xml` | TEI XML       |

<CodeGroup>
  ```bash PDF theme={"dark"}
  curl "https://content.openalex.org/works/W3038568908.pdf?api_key=YOUR_KEY"
  ```

  ```bash TEI XML theme={"dark"}
  curl "https://content.openalex.org/works/W3038568908.grobid-xml?api_key=YOUR_KEY"
  ```
</CodeGroup>

**Finding downloadable works:**

Use the `has_content` filter to find works with available content:

```bash theme={"dark"}
# Works with PDFs from 2024
https://api.openalex.org/works?filter=has_content.pdf:true,publication_year:2024

# CC-BY works with PDFs
https://api.openalex.org/works?filter=has_content.pdf:true,best_oa_location.license:cc-by
```

You can also check the `content_url` field on any work object to see if content is available.

With a free API key (\$1/day), you can download about 100 files per day. Good for research projects, building small corpora, or sampling.

### Option 2: OpenAlex CLI (up to a few million files)

For larger downloads, use the [OpenAlex CLI](/download/openalex-cli). It handles parallel downloads, retries, checkpointing, and resume automatically.

```bash theme={"dark"}
pip install openalex-official
```

**Download PDFs for a specific topic:**

```bash theme={"dark"}
openalex download \
  --api-key YOUR_KEY \
  --output ./climate-pdfs \
  --filter "topics.id:T10325,has_content.pdf:true" \
  --content pdf
```

**Download works with Creative Commons licenses:**

```bash theme={"dark"}
openalex download \
  --api-key YOUR_KEY \
  --output ./cc-pdfs \
  --filter "best_oa_location.license:cc-by,has_content.pdf:true" \
  --content pdf
```

**Download metadata + PDFs + TEI XML:**

```bash theme={"dark"}
openalex download \
  --api-key YOUR_KEY \
  --output ./my-corpus \
  --filter "topics.id:T10325,has_content.pdf:true" \
  --content pdf,xml
```

Standard rates apply (\$0.01 per content file; metadata is free). At full speed, you can download a few million files in a few days.

### Option 3: Complete archive sync

For the complete archive (all 60M files), we provide direct access to the storage bucket via time-limited credentials. Files are stored on [Cloudflare R2](https://developers.cloudflare.com/r2/), which is fully S3-compatible.

**One-time download:** 30-day R2 read access to sync the complete archive.

**Ongoing sync:** Persistent R2 read access is included with the enterprise subscription.

See the [pricing page](https://openalex.org/pricing) for details, or [contact us](mailto:steve@ourresearch.org) to get started.

**How it works:**

<Steps>
  <Step title="Get credentials">
    We generate R2 API credentials with read-only access.
  </Step>

  <Step title="Sync with AWS CLI">
    ```bash theme={"dark"}
    aws s3 sync s3://openalex-pdfs ./pdfs \
      --endpoint-url https://a452eddbbe06eb7d02f4879cee70d29c.r2.cloudflarestorage.com
    ```
  </Step>

  <Step title="Stay current">
    For ongoing sync, run periodically to get new files.
  </Step>
</Steps>

At typical network speeds, expect **1–2 weeks** to download the full archive.

The archive is production-ready but still early, and we add and clean files continuously, so treat it as a living dataset you re-sync periodically rather than a frozen snapshot. As a sensible default, validate each file on ingest (check the leading `%PDF` magic bytes and skip implausibly small objects) so your pipeline is robust to any in-flight changes.

## Mapping work IDs to files

Files in the archive bucket are named by UUID, not work ID. A work's PDF and its TEI XML are stored as **separate objects with different UUIDs** — the file names alone won't tell you which PDF and which XML belong to the same work:

```
openalex-pdfs/
├── 3a07228e-de2a-4c37-955d-b1411a498328.pdf
├── 7b12f4a1-8c9d-4e5f-a2b3-c4d5e6f7a8b9.pdf
└── ...

openalex-grobid-xml/
├── 5e6ea859-62f6-40ab-9f7e-6472aab3a19d.xml.gz
├── 523b7cb3-17f3-4e2f-b81c-867fd8802182.xml.gz
└── ...
```

### The manifest (for archive sync)

If you've synced the bucket, use the **manifest** we publish in the `openalex-pdfs` bucket itself:

```
s3://openalex-pdfs/_manifest/content_index/
```

It's a set of Parquet files, regenerated in full once a day, that maps every work to its file UUIDs. A normal `aws s3 sync` of the bucket pulls the manifest down alongside the content (into a local `_manifest/content_index/` folder), so you typically already have it.

| Column          | Example                                | Object                                       |
| --------------- | -------------------------------------- | -------------------------------------------- |
| `openalex_id`   | `W2741809807`                          | the OpenAlex work ID                         |
| `pdf_uuid`      | `3a07228e-de2a-4c37-955d-b1411a498328` | `openalex-pdfs/{pdf_uuid}.pdf`               |
| `grobid_xml_id` | `5e6ea859-62f6-40ab-9f7e-6472aab3a19d` | `openalex-grobid-xml/{grobid_xml_id}.xml.gz` |
| `updated_date`  | `2026-06-20`                           | when this work's content last changed        |

UUIDs are stored without their file extension — append `.pdf` and `.xml.gz` to build the object keys. A work may have a PDF but no XML (GROBID can't parse every PDF), so `grobid_xml_id` is sometimes empty.

Query it directly with any Parquet reader — no need to load it into a database:

```python theme={"dark"}
import duckdb

# Look up one work
duckdb.sql("""
  SELECT * FROM '_manifest/content_index/*.parquet'
  WHERE openalex_id = 'W2741809807'
""")
```

The join is then entirely local, with no API calls:

1. Join `openalex_id` to your [works snapshot](/download/snapshot-format) to attach metadata.
2. Join `pdf_uuid` / `grobid_xml_id` to your synced files to attach the PDF and TEI XML.

Because the manifest is a full daily replacement, re-pull the `_manifest/` prefix whenever you re-sync the archive.

### The content endpoint (for per-work fetches)

If you're fetching files one at a time through the [content API](/download/full-text-pdfs#option-1-api-up-to-10k-files) rather than syncing the bucket, you don't deal with UUIDs at all — the endpoint resolves the work ID to the underlying object server-side:

```
https://content.openalex.org/works/{work_id}.pdf
```

Work objects from the API expose this directly via `has_content` and `content_urls`:

```json theme={"dark"}
"has_content":  { "pdf": true, "grobid_xml": true },
"content_urls": { "pdf": "https://content.openalex.org/works/W1775749144.pdf",
                  "grobid_xml": "https://content.openalex.org/works/W1775749144.grobid-xml" }
```

Filter your candidate set with `has_content.pdf:true` (available in the [snapshot](/download/snapshot-format) and the API; `content_urls` is API-only), then fetch each work by ID. This is convenient for small-to-medium sets, but at archive scale prefer the manifest above — the join is offline and costs nothing.

<Warning>
  Do **not** map work IDs to archive objects via `locations[].pdf_url`. That field is the *original publisher* URL, not the archive object — keying off it produces a lossy, inferred match. Use the manifest or the content endpoint instead.
</Warning>

## TEI XML quality and limitations

GROBID ([project](https://github.com/kermitt2/grobid), [docs](https://grobid.readthedocs.io/)) is the state of the art for converting scholarly PDFs to structured TEI XML. But PDF parsing is genuinely hard, and a meaningful share of files will contain errors — missing or duplicated references, occasional self-references (the paper's own DOI picked up from its header or footer), wrong or partial header metadata, and the odd truncated section. You can see how GROBID performs field-by-field in [the project's own benchmarks](https://grobid.readthedocs.io/en/latest/Benchmarking/).

GROBID also can't parse every PDF. It doesn't do OCR, so scanned or image-only PDFs produce little or nothing useful, and unusual or malformed PDFs can fail outright. Filter on `has_content.grobid_xml:true` to limit your set to works where we have a parse.

We pass GROBID's output through unchanged, including any errors it made. You can catch some with simple guard clauses — for example, **drop any reference whose DOI matches the work's own DOI**, and **cross-check GROBID's header metadata against the OpenAlex Work record** (which draws on Crossref and other sources independently of GROBID).

A lot of people today actually want Markdown rather than XML for LLM and RAG pipelines. We don't have Markdown parses yet, though they're on the roadmap. In the meantime you can roll your own by downloading the PDFs we provide and running them through one of the newer PDF→Markdown tools like [Marker](https://github.com/datalab-to/marker), [Docling](https://github.com/docling-project/docling), or [MinerU](https://github.com/opendatalab/MinerU).

## Licensing

The PDFs retain their original copyright. OpenAlex does not grant any additional rights to the content.

To check the license for a specific work, use the `best_oa_location.license` field in the API:

```
https://api.openalex.org/works?filter=has_content.pdf:true,best_oa_location.license:cc-by
```

This returns works that have downloadable PDFs and are licensed under [CC BY](https://creativecommons.org/licenses/by/4.0/).
