Cache Hydration
Overview
Hydration pre-warms the fused Zarr cache for a region so that subsequent
/fuse requests return instantly from cache instead of computing on the fly.
Unlike /fuse (which is synchronous and returns data), /hydrate is an
asynchronous workflow: it accepts a bounding box, splits it into 0.05° grid
cells, processes each cell through the full fusion pipeline, writes results to
~/.cache/topobathysim/fused_zarr/, and reports progress.
Architecture
sequenceDiagram
participant Client
participant Server as FastAPI Server
participant Sub as Hydration Subprocess
participant FS as JSON State File
participant Cache as Zarr Cache
Client->>Server: POST /hydrate {bbox, resolution}
Server->>FS: Write initial state (pending)
Server->>Sub: Spawn subprocess
Server-->>Client: {job_id, ws_url}
loop For each grid cell (batched)
Sub->>Cache: Check cell cache
alt Cache miss
Sub->>Cache: Fuse & write .zarr
end
Sub->>FS: Atomic write progress
end
Sub->>FS: Write final state (completed/failed)
alt WebSocket monitoring
Client->>Server: WS /hydrate/{job_id}/ws
loop Until done
Server->>FS: Read state
Server-->>Client: JSON frame
end
else HTTP polling
loop Until done
Client->>Server: GET /hydrate/{job_id}
Server->>FS: Read state
Server-->>Client: JSON response
end
end
Why a subprocess?
The hydration task loads large raster datasets (NCEI BAG tiles at 50cm–2m resolution, CUDEM DEMs, etc.) that can consume gigabytes of memory. Running this as a subprocess means:
OOM isolation — if the subprocess is killed by the OS, the web server stays up and the job state file records the failure.
No shared mutable state — all communication is via atomic JSON file writes (
write .tmpthenos.replace), eliminating race conditions.Multi-worker safe — any uvicorn worker can serve
GET /hydrate/{job_id}by reading the state file from disk.
Dead process detection
When a reader (GET or WebSocket) loads a state file that says "status": "running",
it checks whether the recorded PID is still alive via os.kill(pid, 0). If the
process is gone, the state is atomically updated to "status": "failed" with an
error message indicating the likely cause (OOM).
Usage
CLI script
The hydrate_cache script provides two subcommands: fuse (zarr cells) and
tiles (rendered PNGs).
# Hydrate fused zarr cells (30m default)
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26
# 15m resolution with custom policy
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26 \
-r 15 \
-c config/topobathysim/policies/great_bay_estuary.yaml
# Hydrate rendered PNG tiles at zoom 13
python -m topobathysim.scripts.hydrate_cache tiles -70.99 42.89 -70.49 43.26 -z 13
# HTTP polling only (skip WebSocket)
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26 --no-ws
# Custom service URL
python -m topobathysim.scripts.hydrate_cache fuse -70.99 42.89 -70.49 43.26 --url http://remote:9595
Arguments:
Argument |
Required |
Default |
Description |
|---|---|---|---|
|
yes |
— |
Bounding box (WGS84) |
|
no |
server default |
Path to YAML policy file |
|
no |
30.0 |
Output resolution in meters |
|
no |
|
Service base URL |
|
no |
false |
Disable WebSocket, use HTTP polling |
REST API
Submit a job:
POST /hydrate
Content-Type: application/json
{
"bbox": [-70.99, 42.89, -70.49, 43.26],
"resolution": 15,
"policy_yaml": "name: Custom\ncrs: EPSG:4326\nvariables: ..."
}
Request fields:
Field |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
|
yes |
— |
Bounding box in WGS84 decimal degrees |
|
float |
no |
30.0 |
Output resolution in meters |
|
string |
no |
null |
Raw YAML policy to override server default |
|
int (1–16) |
no |
2 |
Thread pool parallelism for cell processing |
When policy_yaml is provided, the custom policy is used instead of the
server’s default. The cache key is a SHA256 hash of the policy content +
cell bbox + resolution, so custom policies produce separate cache entries
that never collide with the server default. See the “Cache behavior” section
in /fuse Endpoint — Zarr & GeoTIFF Output for details.
Response:
{
"status": "accepted",
"job_id": "a1b2c3d4-...",
"ws": "ws://localhost:9595/hydrate/a1b2c3d4-.../ws",
"bbox": [-70.99, 42.89, -70.49, 43.26],
"resolution": 15
}
Poll status:
GET /hydrate/{job_id}
{
"id": "a1b2c3d4-...",
"status": "running",
"total_cells": 48,
"processed_cells": 12,
"cached_cells": 30,
"failed_cells": 0
}
List recent jobs:
GET /hydrate
WebSocket
Connect to ws://host:port/hydrate/{job_id}/ws to receive real-time JSON
frames as each cell completes. The server only sends a frame when state
changes (deduplicated). The connection closes automatically when the job
reaches completed or failed status.
Web UI
Navigate to the hydration page at http://localhost:9595/static/hydrate.html.
The UI provides a Leaflet map for bbox selection, resolution controls, optional
policy YAML upload, and live progress monitoring via WebSocket.
Grid cell caching
Hydration splits the requested bbox into a fixed 0.05° grid. Each cell is
cached independently at ~/.cache/topobathysim/fused_zarr/{hash}.zarr, where
the hash is derived from the policy content, cell bbox, and resolution.
If you request a second bbox that overlaps a previously hydrated region, the overlapping cells are cache hits — only new cells are processed. This makes incremental hydration of adjacent regions efficient.
Memory management
The hydration subprocess uses several strategies to limit peak memory:
Batched cell processing — cells are submitted to a
ThreadPoolExecutorin small batches (default:max_workers * 2) rather than all at once.gc.collect()runs between batches.Per-cell cleanup — after each cell is written to Zarr, dataset references are explicitly deleted and garbage collected.
BAG LRU cache cap — the NCEI BAG provider’s in-memory LRU cache is capped at 8 entries (tunable via
TOPOBATHY_BAG_LRU_SIZE).Worker recycling — uvicorn workers auto-restart after 200 requests (tunable via
--limit-max-requests).
Job state file format
State files are stored at ~/.cache/topobathysim/hydration_jobs/{job_id}.json:
{
"id": "a1b2c3d4-...",
"status": "running",
"pid": 12345,
"bbox": [-70.99, 42.89, -70.49, 43.26],
"resolution": 15,
"submitted_at": "2026-03-22T22:19:40+00:00",
"total_cells": 48,
"processed_cells": 12,
"cached_cells": 30,
"failed_cells": 0
}
Valid statuses: pending, running, completed, failed.
Completed and failed jobs are automatically pruned after 24 hours.