The complete OpenAlex database, available in two formats: gzip-compressed JSON Lines and Apache Parquet. The free public snapshot is updated quarterly; paid plans get daily-refreshed snapshots and daily change files. Includes works, authors, sources, institutions, topics, publishers, funders, awards, and more.Best for: Full database replication, data warehousing, comprehensive analysisSize: The JSON Lines snapshot is ~330 GB compressed (~1.6 TB decompressed). Parquet is provided alongside it as a separate copy of the same data, so downloading both roughly doubles the transfer.Learn more about the snapshot format
Download PDFs and TEI XML for about 60 million works. Requires an API key — content downloads cost $0.01 per file.Best for: Text mining, content analysis, building corporaFull-text PDF documentation
Do you need the complete database?├── Yes → Download the snapshot│ (/download/snapshot-format)└── No ├── Do you need filtered metadata or content files? │ ├── Yes → Use the OpenAlex CLI │ │ (/download/openalex-cli) │ └── No → Use the REST API │ (/api-reference/introduction) └── Do you need bulk full-text PDFs? ├── Yes → See full-text PDF options │ (/download/full-text-pdfs) └── No → Use the REST API
For the CLI: Install with pip install openalex-official and run openalex download --help
For PDFs: See full-text PDFs for the three download options
Working with the full snapshot is challenging. The dataset is large (330 GB+ compressed) and complex. If you’re unsure, start with the REST API — it can answer most questions with much less setup.