Cache Hydration

Overview

Hydration pre-warms the fused Zarr cache for a region so that subsequent /fuse requests return instantly from cache instead of computing on the fly.

Unlike /fuse (which is synchronous and returns data), /hydrate is an asynchronous workflow: it accepts a bounding box, splits it into 0.05° grid cells, processes each cell through the full fusion pipeline, writes results to ~/.cache/topobathysim/fused_zarr/, and reports progress.

Architecture

        sequenceDiagram
    participant Client
    participant Server as FastAPI Server
    participant Sub as Hydration Subprocess
    participant FS as JSON State File
    participant Cache as Zarr Cache

    Client->>Server: POST /hydrate {bbox, resolution}
    Server->>FS: Write initial state (pending)
    Server->>Sub: Spawn subprocess
    Server-->>Client: {job_id, ws_url}

    loop For each grid cell (batched)
        Sub->>Cache: Check cell cache
        alt Cache miss
            Sub->>Cache: Fuse & write .zarr
        end
        Sub->>FS: Atomic write progress
    end
    Sub->>FS: Write final state (completed/failed)

    alt WebSocket monitoring
        Client->>Server: WS /hydrate/{job_id}/ws
        loop Until done
            Server->>FS: Read state
            Server-->>Client: JSON frame
        end
    else HTTP polling
        loop Until done
            Client->>Server: GET /hydrate/{job_id}
            Server->>FS: Read state
            Server-->>Client: JSON response
        end
    end

Why a subprocess?

The hydration task loads large raster datasets (NCEI BAG tiles at 50cm–2m resolution, CUDEM DEMs, etc.) that can consume gigabytes of memory. Running this as a subprocess means:

OOM isolation — if the subprocess is killed by the OS, the web server stays up and the job state file records the failure.
No shared mutable state — all communication is via atomic JSON file writes (write .tmp then os.replace), eliminating race conditions.
Multi-worker safe — any uvicorn worker can serve GET /hydrate/{job_id} by reading the state file from disk.

Dead process detection

When a reader (GET or WebSocket) loads a state file that says "status": "running", it checks whether the recorded PID is still alive via os.kill(pid, 0). If the process is gone, the state is atomically updated to "status": "failed" with an error message indicating the likely cause (OOM).

Usage

CLI script

The hydrate_cache script provides two subcommands: fuse (zarr cells) and tiles (rendered PNGs).

# Hydrate fused zarr cells (30m default)
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26

# 15m resolution with custom policy
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26 \
  -r 15 \
  -c config/topobathysim/policies/great_bay_estuary.yaml

# Hydrate rendered PNG tiles at zoom 13
python -m topobathysim.scripts.hydrate_cache tiles -70.99 42.89 -70.49 43.26 -z 13

# HTTP polling only (skip WebSocket)
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26 --no-ws

# Custom service URL
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26 --url http://remote:9595

Arguments:

Argument	Required	Default	Description
`WEST SOUTH EAST NORTH`	yes	—	Bounding box (WGS84)
`-c, --config`	no	server default	Path to YAML policy file
`-r, --resolution`	no	30.0	Output resolution in meters
`--url`	no	`http://localhost:9595`	Service base URL
`--no-ws`	no	false	Disable WebSocket, use HTTP polling

REST API

Submit a job:

POST /hydrate
Content-Type: application/json

{
  "bbox": [-70.99, 42.89, -70.49, 43.26],
  "resolution": 15,
  "policy_yaml": "name: Custom\ncrs: EPSG:4326\nvariables: ..."
}

Request fields:

Field	Type	Required	Default	Description
`bbox`	`[w, s, e, n]`	yes	—	Bounding box in WGS84 decimal degrees
`resolution`	float	no	30.0	Output resolution in meters
`policy_yaml`	string	no	null	Raw YAML policy to override server default
`max_workers`	int (1–16)	no	2	Thread pool parallelism for cell processing

When policy_yaml is provided, the custom policy is used instead of the server’s default. The cache key is a SHA256 hash of the policy content + cell bbox + resolution, so custom policies produce separate cache entries that never collide with the server default. See the “Cache behavior” section in /fuse Endpoint — Zarr & GeoTIFF Output for details.

Response:

{
  "status": "accepted",
  "job_id": "a1b2c3d4-...",
  "ws": "ws://localhost:9595/hydrate/a1b2c3d4-.../ws",
  "bbox": [-70.99, 42.89, -70.49, 43.26],
  "resolution": 15
}

Poll status:

GET /hydrate/{job_id}

{
  "id": "a1b2c3d4-...",
  "status": "running",
  "total_cells": 48,
  "processed_cells": 12,
  "cached_cells": 30,
  "failed_cells": 0
}

List recent jobs:

GET /hydrate

WebSocket

Connect to ws://host:port/hydrate/{job_id}/ws to receive real-time JSON frames as each cell completes. The server only sends a frame when state changes (deduplicated). The connection closes automatically when the job reaches completed or failed status.

Web UI

Navigate to the hydration page at http://localhost:9595/static/hydrate.html. The UI provides a Leaflet map for bbox selection, resolution controls, optional policy YAML upload, and live progress monitoring via WebSocket.

Grid cell caching

Hydration splits the requested bbox into a fixed 0.05° grid. Each cell is cached independently at ~/.cache/topobathysim/fused_zarr/{hash}.zarr, where the hash is derived from the policy content, cell bbox, and resolution.

If you request a second bbox that overlaps a previously hydrated region, the overlapping cells are cache hits — only new cells are processed. This makes incremental hydration of adjacent regions efficient.

Memory management

The hydration subprocess uses several strategies to limit peak memory:

Batched cell processing — cells are submitted to a ThreadPoolExecutor in small batches (default: max_workers * 2) rather than all at once. gc.collect() runs between batches.
Per-cell cleanup — after each cell is written to Zarr, dataset references are explicitly deleted and garbage collected.
BAG LRU cache cap — the NCEI BAG provider’s in-memory LRU cache is capped at 8 entries (tunable via TOPOBATHY_BAG_LRU_SIZE).
Worker recycling — uvicorn workers auto-restart after 200 requests (tunable via --limit-max-requests).

Job state file format

State files are stored at ~/.cache/topobathysim/hydration_jobs/{job_id}.json:

{
  "id": "a1b2c3d4-...",
  "status": "running",
  "pid": 12345,
  "bbox": [-70.99, 42.89, -70.49, 43.26],
  "resolution": 15,
  "submitted_at": "2026-03-22T22:19:40+00:00",
  "total_cells": 48,
  "processed_cells": 12,
  "cached_cells": 30,
  "failed_cells": 0
}

Valid statuses: pending, running, completed, failed.

Completed and failed jobs are automatically pruned after 24 hours.