# Cache Hydration

## Overview

Hydration pre-warms the fused Zarr cache for a region so that subsequent
`/fuse` requests return instantly from cache instead of computing on the fly.

Unlike `/fuse` (which is synchronous and returns data), `/hydrate` is an
asynchronous workflow: it accepts a bounding box, splits it into 0.05° grid
cells, processes each cell through the full fusion pipeline, writes results to
`~/.cache/topobathysim/fused_zarr/`, and reports progress.

## Architecture

```{mermaid}
sequenceDiagram
    participant Client
    participant Server as FastAPI Server
    participant Sub as Hydration Subprocess
    participant FS as JSON State File
    participant Cache as Zarr Cache

    Client->>Server: POST /hydrate {bbox, resolution}
    Server->>FS: Write initial state (pending)
    Server->>Sub: Spawn subprocess
    Server-->>Client: {job_id, ws_url}

    loop For each grid cell (batched)
        Sub->>Cache: Check cell cache
        alt Cache miss
            Sub->>Cache: Fuse & write .zarr
        end
        Sub->>FS: Atomic write progress
    end
    Sub->>FS: Write final state (completed/failed)

    alt WebSocket monitoring
        Client->>Server: WS /hydrate/{job_id}/ws
        loop Until done
            Server->>FS: Read state
            Server-->>Client: JSON frame
        end
    else HTTP polling
        loop Until done
            Client->>Server: GET /hydrate/{job_id}
            Server->>FS: Read state
            Server-->>Client: JSON response
        end
    end
```

### Why a subprocess?

The hydration task loads large raster datasets (NCEI BAG tiles at 50cm–2m
resolution, CUDEM DEMs, etc.) that can consume gigabytes of memory. Running
this as a subprocess means:

- **OOM isolation** — if the subprocess is killed by the OS, the web server
  stays up and the job state file records the failure.
- **No shared mutable state** — all communication is via atomic JSON file
  writes (`write .tmp` then `os.replace`), eliminating race conditions.
- **Multi-worker safe** — any uvicorn worker can serve `GET /hydrate/{job_id}`
  by reading the state file from disk.

### Dead process detection

When a reader (GET or WebSocket) loads a state file that says `"status": "running"`,
it checks whether the recorded PID is still alive via `os.kill(pid, 0)`. If the
process is gone, the state is atomically updated to `"status": "failed"` with an
error message indicating the likely cause (OOM).

## Usage

### CLI script

The `hydrate_cache` script provides two subcommands: `fuse` (zarr cells) and
`tiles` (rendered PNGs).

```bash
# Hydrate fused zarr cells (30m default)
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26

# 15m resolution with custom policy
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26 \
  -r 15 \
  -c config/topobathysim/policies/great_bay_estuary.yaml

# Hydrate rendered PNG tiles at zoom 13
python -m topobathysim.scripts.hydrate_cache tiles -70.99 42.89 -70.49 43.26 -z 13

# HTTP polling only (skip WebSocket)
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26 --no-ws

# Custom service URL
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26 --url http://remote:9595
```

**Arguments:**

| Argument                 | Required | Default                  | Description                        |
| ------------------------ | -------- | ------------------------ | ---------------------------------- |
| `WEST SOUTH EAST NORTH`  | yes      | —                        | Bounding box (WGS84)               |
| `-c, --config`           | no       | server default           | Path to YAML policy file           |
| `-r, --resolution`       | no       | 30.0                     | Output resolution in meters        |
| `--url`                  | no       | `http://localhost:9595`  | Service base URL                   |
| `--no-ws`                | no       | false                    | Disable WebSocket, use HTTP polling|

### REST API

**Submit a job:**

```text
POST /hydrate
Content-Type: application/json

{
  "bbox": [-70.99, 42.89, -70.49, 43.26],
  "resolution": 15,
  "policy_yaml": "name: Custom\ncrs: EPSG:4326\nvariables: ..."
}
```

**Request fields:**

| Field          | Type            | Required | Default | Description                                        |
| -------------- | --------------- | -------- | ------- | -------------------------------------------------- |
| `bbox`         | `[w, s, e, n]`  | yes      | —       | Bounding box in WGS84 decimal degrees              |
| `resolution`   | float           | no       | 30.0    | Output resolution in meters                        |
| `policy_yaml`  | string          | no       | null    | Raw YAML policy to override server default         |
| `max_workers`  | int (1–16)      | no       | 2       | Thread pool parallelism for cell processing        |

When `policy_yaml` is provided, the custom policy is used instead of the
server's default. The cache key is a SHA256 hash of the policy content +
cell bbox + resolution, so custom policies produce **separate cache entries**
that never collide with the server default. See the "Cache behavior" section
in {doc}`fuse_endpoint` for details.

**Response:**

```json
{
  "status": "accepted",
  "job_id": "a1b2c3d4-...",
  "ws": "ws://localhost:9595/hydrate/a1b2c3d4-.../ws",
  "bbox": [-70.99, 42.89, -70.49, 43.26],
  "resolution": 15
}
```

**Poll status:**

```text
GET /hydrate/{job_id}
```

```json
{
  "id": "a1b2c3d4-...",
  "status": "running",
  "total_cells": 48,
  "processed_cells": 12,
  "cached_cells": 30,
  "failed_cells": 0
}
```

**List recent jobs:**

```text
GET /hydrate
```

### WebSocket

Connect to `ws://host:port/hydrate/{job_id}/ws` to receive real-time JSON
frames as each cell completes. The server only sends a frame when state
changes (deduplicated). The connection closes automatically when the job
reaches `completed` or `failed` status.

### Web UI

Navigate to the hydration page at `http://localhost:9595/static/hydrate.html`.
The UI provides a Leaflet map for bbox selection, resolution controls, optional
policy YAML upload, and live progress monitoring via WebSocket.

## Grid cell caching

Hydration splits the requested bbox into a fixed 0.05° grid. Each cell is
cached independently at `~/.cache/topobathysim/fused_zarr/{hash}.zarr`, where
the hash is derived from the policy content, cell bbox, and resolution.

If you request a second bbox that overlaps a previously hydrated region, the
overlapping cells are **cache hits** — only new cells are processed. This
makes incremental hydration of adjacent regions efficient.

## Memory management

The hydration subprocess uses several strategies to limit peak memory:

- **Batched cell processing** — cells are submitted to a `ThreadPoolExecutor`
  in small batches (default: `max_workers * 2`) rather than all at once.
  `gc.collect()` runs between batches.
- **Per-cell cleanup** — after each cell is written to Zarr, dataset
  references are explicitly deleted and garbage collected.
- **BAG LRU cache cap** — the NCEI BAG provider's in-memory LRU cache is
  capped at 8 entries (tunable via `TOPOBATHY_BAG_LRU_SIZE`).
- **Worker recycling** — uvicorn workers auto-restart after 200 requests
  (tunable via `--limit-max-requests`).

## Job state file format

State files are stored at `~/.cache/topobathysim/hydration_jobs/{job_id}.json`:

```json
{
  "id": "a1b2c3d4-...",
  "status": "running",
  "pid": 12345,
  "bbox": [-70.99, 42.89, -70.49, 43.26],
  "resolution": 15,
  "submitted_at": "2026-03-22T22:19:40+00:00",
  "total_cells": 48,
  "processed_cells": 12,
  "cached_cells": 30,
  "failed_cells": 0
}
```

Valid statuses: `pending`, `running`, `completed`, `failed`.

Completed and failed jobs are automatically pruned after 24 hours.