> ## Documentation Index
> Fetch the complete documentation index at: https://developers.openalex.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Snapshot data format

> Where the OpenAlex data lives and how it's structured

All OpenAlex data is stored in [Amazon S3](https://aws.amazon.com/s3/) in the [`openalex`](https://openalex.s3.amazonaws.com/browse.html) bucket, under the `data/` prefix.

## Two formats

The snapshot is published in two formats, each a complete copy of the same data:

| Format                                        | Compression         | One record per | Prefix           |
| --------------------------------------------- | ------------------- | -------------- | ---------------- |
| [JSON Lines](https://jsonlines.org/)          | gzip (`.gz`)        | line           | `/data/jsonl/`   |
| [Apache Parquet](https://parquet.apache.org/) | snappy (`.parquet`) | row            | `/data/parquet/` |

<Info>
  Parquet is rolling out to the free public snapshot with the **June 2026 quarterly** release. Enterprise API-key users already receive both formats in the daily staging snapshot (see [Download to your machine](/download/download-to-machine#download-with-an-enterprise-api-key)).
</Info>

## Bucket structure

Under each format prefix there is one folder per entity type, plus a combined `manifest.json`:

```
s3://openalex/data/
├── jsonl/
│   ├── manifest.json          # all entities, this format
│   ├── works/
│   │   ├── manifest.json       # works only
│   │   └── updated_date=2026-06-24/
│   │       ├── part_0000.gz
│   │       └── part_0001.gz
│   ├── authors/
│   └── ...
└── parquet/
    ├── manifest.json
    ├── works/
    │   ├── manifest.json
    │   └── updated_date=2026-06-24/
    │       ├── part_0000.parquet
    │       └── part_0001.parquet
    └── ...
```

The entity folders under each format are:

| Core entities                     | Topic hierarchy | Lookup / aggregation entities                     |
| --------------------------------- | --------------- | ------------------------------------------------- |
| `works`                           | `topics`        | `keywords`, `concepts`                            |
| `authors`                         | `subfields`     | `continents`, `countries`                         |
| `institutions`                    | `fields`        | `institution-types`, `source-types`, `work-types` |
| `sources`                         | `domains`       | `languages`, `licenses`, `sdgs`                   |
| `publishers`, `funders`, `awards` |                 |                                                   |

You can browse the bucket at [openalex.s3.amazonaws.com/browse.html](https://openalex.s3.amazonaws.com/browse.html#data/).

<Note>
  Pre-2026 snapshots used a flat `data/{entity}/` layout (JSON Lines only) with no `jsonl/`/`parquet/` split. That older layout, along with the `merged_ids/` directory, is preserved under the `legacy-data/` prefix.
</Note>

Records are partitioned by `updated_date`. For example, a Work with `updated_date` of `2026-06-24` lives under:

```
/data/jsonl/works/updated_date=2026-06-24/      # JSON Lines
/data/parquet/works/updated_date=2026-06-24/    # Parquet
```

Each partition holds one or more part files (`part_0000.*`, `part_0001.*`, …) of up to 400,000 records each.

## Size

The gzip-compressed JSON Lines snapshot is approximately **330 GB** and decompresses to about **1.6 TB**. The Parquet copy is provided alongside it, so downloading both formats roughly doubles the transfer. If you only need one format, [download a single format prefix](/download/download-to-machine#download-a-single-format-or-entity-type).

## Entity schemas

The structure of each entity type matches the API response format:

<CardGroup cols={3}>
  <Card title="Works" href="/api-reference/works" />

  <Card title="Authors" href="/api-reference/authors" />

  <Card title="Sources" href="/api-reference/sources" />

  <Card title="Institutions" href="/api-reference/institutions" />

  <Card title="Topics" href="/api-reference/topics" />

  <Card title="Publishers" href="/api-reference/publishers" />
</CardGroup>

<Info>
  **API-only fields:** Some Work properties are only available through the API and not included in the snapshot. For example, `content_url` — use the [content endpoint](/download/full-text-pdfs) directly with work IDs from the snapshot.
</Info>

## Keeping your snapshot up to date

<Warning>
  The free public snapshot is refreshed **quarterly**. A **daily-refreshed snapshot** and **daily change files** (via the [Changefiles API](/download/changefiles)) require a [paid plan](https://openalex.org/pricing). Contact [sales@openalex.org](mailto:sales@openalex.org).
</Warning>

The `updated_date` partitions make incremental updates straightforward. Unlike dated snapshots that each contain the full dataset, each partition contains only the records that **last changed** on that date.

### How partitions work

Imagine launching OpenAlex with 1,000 Authors, all created on 2024-01-01:

```
/data/jsonl/authors/
├── manifest.json
└── updated_date=2024-01-01 [1000 Authors]
    ├── part_0000.gz
    └── ...
```

If we update 50 of those Authors on 2024-01-15, they **move out of** the old partition and **into** the new one:

```
/data/jsonl/authors/
├── manifest.json
├── updated_date=2024-01-01 [950 Authors]
│   └── ...
└── updated_date=2024-01-15 [50 Authors]
    └── ...
```

If we also discover 50 new Authors, they go into the same new partition:

```
/data/jsonl/authors/
├── manifest.json
├── updated_date=2024-01-01 [950 Authors]
│   └── ...
└── updated_date=2024-01-15 [100 Authors]
    └── ...
```

So if you made your snapshot copy on 2024-01-01, you only need to download `updated_date=2024-01-15` to get everything that changed or was added since then. The Parquet copy under `/data/parquet/authors/` is partitioned exactly the same way.

<Tip>
  To update a snapshot copy that you created or updated on date `X`, insert or update the records in partitions where `updated_date` > `X`.
</Tip>

You never need to re-download a partition you already have. Anything that changed has moved to a newer partition.

## The manifest files

Every format has a combined `manifest.json` listing all data files across all entities (`/data/{format}/manifest.json`), and every entity has its own `manifest.json` (`/data/{format}/{entity}/manifest.json`). The manifest is written last, after all data files are uploaded, so if it's present the data for that format is complete.

A per-entity manifest looks like this:

```json theme={"dark"}
{
  "date": "2026-06-25",
  "format": "jsonl",
  "entity": "works",
  "record_count": 264000000,
  "content_length": 350000000000,
  "files": [
    {
      "url": "s3://openalex/data/jsonl/works/updated_date=2026-06-24/part_0000.gz",
      "meta": { "content_length": 936733, "record_count": 499 }
    }
  ]
}
```

The combined manifest has the same shape but nests one entry per entity under an `entities` array, with a top-level `meta` carrying the totals.

To use a manifest for incremental updates:

<Steps>
  <Step title="Download the manifest">
    ```bash theme={"dark"}
    aws s3 cp s3://openalex/data/jsonl/works/manifest.json ./manifest.json --no-sign-request
    ```
  </Step>

  <Step title="Check for new partitions">
    Get the file list from the `url` property of each item in the `files` list. Identify any `updated_date` partitions you haven't seen before.
  </Step>

  <Step title="Download new partitions">
    Download objects with new `updated_date` values.
  </Step>

  <Step title="Verify consistency">
    Download the manifest again. If it hasn't changed since step 1, no records moved between partitions during your download.
  </Step>

  <Step title="Load the data">
    Read each file (decompress and parse one JSON entity per line for JSON Lines; read columns directly for Parquet). Insert or update into your database using each entity's ID as the primary key.
  </Step>
</Steps>

If you have an existing data pipeline, these details may be all you need. For step-by-step download instructions, see [Download to your machine](/download/download-to-machine).
