Skip to content

Commit 5902faa

Browse files
rdhyeeclaude
andauthored
Add Cloudflare Worker for data.isamples.org with immutable cache-control (#120)
* Add Cloudflare Worker for data.isamples.org with immutable cache-control Creates workers/data-isamples-org/ colocated with the site it serves. The Worker proxies the iSamples R2 bucket via an R2 bucket binding (env.BUCKET) and adds a Cache-Control header so Cloudflare's edge and the user's browser can treat filename-versioned parquets as immutable. Key behavior: - isamples_YYYYMM_*.parquet → Cache-Control: public, max-age=31536000, immutable - Anything else → Cache-Control: public, max-age=300 - Range requests parsed and forwarded (required for DuckDB-WASM). - HEAD requests use BUCKET.head() and return headers only. - CORS: Access-Control-Allow-Origin: *, exposes Content-Range + friends. Does not deploy. Raymond needs to run `wrangler deploy` from the workers/data-isamples-org/ directory against his Cloudflare account. If another Worker currently owns the data.isamples.org/* route, this deploy replaces it — see README for the verification checklist. Addresses the free-speed finding from the perf strategy discussion: currently no cache-control is being set anywhere on data.isamples.org, so the CF edge does not cache and browsers re-fetch unpredictably. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Fix R2 bucket name: isamples-ry (verified via Cloudflare API) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent a189189 commit 5902faa

4 files changed

Lines changed: 259 additions & 0 deletions

File tree

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
node_modules/
2+
.wrangler/
3+
.dev.vars
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# `data.isamples.org` Worker
2+
3+
A Cloudflare Worker that proxies the iSamples R2 bucket at
4+
`data.isamples.org` and — most importantly — adds a `Cache-Control` header
5+
so Cloudflare's edge and the user's browser can cache immutable parquet
6+
versions aggressively.
7+
8+
## Why this exists
9+
10+
The parquet files under `data.isamples.org` are filename-versioned
11+
(`isamples_202601_wide.parquet`, `isamples_202601_h3_summary_res4.parquet`,
12+
etc.) — the month appears in the filename, so content at a given URL never
13+
changes.
14+
15+
Without a `Cache-Control` header, Cloudflare's edge does **not** cache these
16+
files, and browsers use unpredictable heuristic caching (often: re-fetch on
17+
every visit). This Worker fixes that by emitting:
18+
19+
```
20+
Cache-Control: public, max-age=31536000, immutable
21+
```
22+
23+
…for any path matching `^isamples_\d{6}_.*\.parquet$`, and a short 5-minute
24+
fallback for anything else.
25+
26+
## What it does
27+
28+
| Concern | How |
29+
| --- | --- |
30+
| R2 access | R2 bucket binding (`env.BUCKET`) — no public `r2.dev` hop |
31+
| Range requests | Parsed and forwarded; required for DuckDB-WASM |
32+
| CORS | `Access-Control-Allow-Origin: *`, exposes `Content-Range` etc. |
33+
| HEAD requests | Uses `BUCKET.head()` and returns headers only |
34+
| Immutable cache | `max-age=31536000, immutable` for versioned parquets |
35+
| Short cache fallback | `max-age=300` for anything else |
36+
37+
## Deploying
38+
39+
One-time setup (if not already done):
40+
41+
```bash
42+
cd workers/data-isamples-org
43+
npm install -g wrangler # or: npx wrangler ...
44+
wrangler login # opens browser, auth to isamples.org account
45+
```
46+
47+
Verify the R2 bucket name in `wrangler.toml` matches your actual bucket
48+
(Cloudflare dashboard → R2 → buckets). Update `bucket_name` if needed.
49+
50+
Deploy:
51+
52+
```bash
53+
wrangler deploy
54+
```
55+
56+
This publishes the Worker and installs the route `data.isamples.org/*`.
57+
58+
> ⚠️ If another Worker is already bound to `data.isamples.org/*` (e.g. a
59+
> legacy proxy from the original setup), `wrangler deploy` will **replace**
60+
> it. Check `wrangler deployments list` or the Cloudflare dashboard
61+
> (Workers → Routes) before deploying if you want to be cautious.
62+
63+
## Verifying
64+
65+
After deploy:
66+
67+
```bash
68+
curl -sI https://data.isamples.org/isamples_202601_h3_summary_res4.parquet \
69+
| grep -iE 'cache-control|cf-cache-status|etag'
70+
```
71+
72+
You should see:
73+
74+
```
75+
cache-control: public, max-age=31536000, immutable
76+
etag: "..."
77+
```
78+
79+
First request after deploy will show `cf-cache-status: MISS`; subsequent
80+
requests should show `HIT` (edge cache warmed). Browser refreshes on the
81+
Interactive Explorer (`?perf=1`) should drop phase 1 res4 duration toward
82+
zero on warm cache.
83+
84+
## Local dev
85+
86+
```bash
87+
wrangler dev
88+
```
89+
90+
Starts a local server on `http://localhost:8787/` that proxies the live R2
91+
bucket. Useful for testing header logic without touching production.
92+
93+
## Future extensions
94+
95+
- Path-based routing (e.g. `/parquet/...`, `/record/<uuid>`) per issue #81
96+
- Per-object cache hints via R2 custom metadata
97+
- Index listing at `/` (currently just a plain-text stub)
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
/**
2+
* Cloudflare Worker: data.isamples.org
3+
*
4+
* Proxies the iSamples R2 bucket and adds cache-control headers so the
5+
* Cloudflare edge and the browser can cache immutable parquet versions
6+
* aggressively.
7+
*
8+
* Strategy:
9+
* - Filename-versioned parquets (isamples_YYYYMM_*.parquet) are immutable
10+
* by naming convention → cache one year + immutable.
11+
* - Anything else falls back to a short TTL.
12+
*
13+
* Uses the R2 bucket binding (env.BUCKET) rather than fetching the r2.dev
14+
* public URL — fewer hops, lower latency, no need to expose the bucket
15+
* publicly.
16+
*
17+
* Range requests are supported so DuckDB-WASM's HTTP range fetches keep
18+
* working.
19+
*/
20+
21+
const IMMUTABLE_PATTERN = /^isamples_\d{6}_.*\.parquet$/;
22+
const IMMUTABLE_MAX_AGE = 60 * 60 * 24 * 365; // 1 year
23+
const FALLBACK_MAX_AGE = 300; // 5 minutes
24+
25+
const CORS_HEADERS = {
26+
'Access-Control-Allow-Origin': '*',
27+
'Access-Control-Allow-Methods': 'GET, HEAD, OPTIONS',
28+
'Access-Control-Allow-Headers': 'Range',
29+
'Access-Control-Expose-Headers': 'Content-Length, Content-Range, Accept-Ranges, ETag',
30+
};
31+
32+
export default {
33+
async fetch(request, env) {
34+
if (request.method === 'OPTIONS') {
35+
return new Response(null, { status: 204, headers: CORS_HEADERS });
36+
}
37+
38+
if (request.method !== 'GET' && request.method !== 'HEAD') {
39+
return new Response('Method not allowed', { status: 405, headers: CORS_HEADERS });
40+
}
41+
42+
const url = new URL(request.url);
43+
const key = decodeURIComponent(url.pathname.replace(/^\/+/, ''));
44+
45+
if (!key) {
46+
// Simple root response — could be replaced with an index listing later.
47+
return new Response('data.isamples.org — R2 bucket proxy\n', {
48+
status: 200,
49+
headers: { 'content-type': 'text/plain; charset=utf-8', ...CORS_HEADERS },
50+
});
51+
}
52+
53+
// Parse Range header if present. R2's get() accepts { offset, length } or
54+
// { suffix }, mirroring HTTP Range semantics.
55+
const rangeHeader = request.headers.get('range');
56+
const range = rangeHeader ? parseRange(rangeHeader) : undefined;
57+
58+
const getOptions = range ? { range } : {};
59+
const object = request.method === 'HEAD'
60+
? await env.BUCKET.head(key)
61+
: await env.BUCKET.get(key, getOptions);
62+
63+
if (!object) {
64+
return new Response('Not found', { status: 404, headers: CORS_HEADERS });
65+
}
66+
67+
const headers = new Headers();
68+
object.writeHttpMetadata(headers);
69+
headers.set('ETag', object.httpEtag);
70+
headers.set('Accept-Ranges', 'bytes');
71+
72+
for (const [k, v] of Object.entries(CORS_HEADERS)) headers.set(k, v);
73+
74+
// Cache-Control: this is the optimization.
75+
if (IMMUTABLE_PATTERN.test(key)) {
76+
headers.set('Cache-Control', `public, max-age=${IMMUTABLE_MAX_AGE}, immutable`);
77+
} else {
78+
headers.set('Cache-Control', `public, max-age=${FALLBACK_MAX_AGE}`);
79+
}
80+
81+
if (request.method === 'HEAD') {
82+
headers.set('Content-Length', String(object.size));
83+
return new Response(null, { status: 200, headers });
84+
}
85+
86+
// Range response: 206 + Content-Range. R2 populates object.range when a
87+
// range was requested, but for safety compute the Content-Range ourselves.
88+
if (range) {
89+
const total = object.size !== undefined ? object.size : null;
90+
// object.get with range returns only the sliced body + partial size info.
91+
// We need the full object size for the Content-Range header; fetch via
92+
// head() once per cold request.
93+
let fullSize = total;
94+
if (fullSize == null || typeof fullSize !== 'number') {
95+
const head = await env.BUCKET.head(key);
96+
fullSize = head ? head.size : null;
97+
}
98+
const start = range.offset ?? 0;
99+
const length = range.length ?? (fullSize != null ? fullSize - start : undefined);
100+
const end = length != null ? start + length - 1 : (fullSize != null ? fullSize - 1 : 0);
101+
if (fullSize != null) {
102+
headers.set('Content-Range', `bytes ${start}-${end}/${fullSize}`);
103+
headers.set('Content-Length', String(end - start + 1));
104+
}
105+
return new Response(object.body, { status: 206, headers });
106+
}
107+
108+
return new Response(object.body, { status: 200, headers });
109+
},
110+
};
111+
112+
/**
113+
* Parse an HTTP Range header into the { offset, length } shape R2 expects.
114+
* Supports `bytes=START-END` and `bytes=-SUFFIX`. Returns undefined for
115+
* anything we can't parse so the caller falls back to a full-object fetch.
116+
*/
117+
function parseRange(header) {
118+
const match = /^bytes=(\d*)-(\d*)$/.exec(header.trim());
119+
if (!match) return undefined;
120+
const [, startStr, endStr] = match;
121+
if (startStr === '' && endStr === '') return undefined;
122+
if (startStr === '') {
123+
// Suffix: last N bytes
124+
const suffix = Number(endStr);
125+
if (!Number.isFinite(suffix) || suffix <= 0) return undefined;
126+
return { suffix };
127+
}
128+
const offset = Number(startStr);
129+
if (!Number.isFinite(offset) || offset < 0) return undefined;
130+
if (endStr === '') return { offset };
131+
const end = Number(endStr);
132+
if (!Number.isFinite(end) || end < offset) return undefined;
133+
return { offset, length: end - offset + 1 };
134+
}
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Cloudflare Worker: data.isamples.org
2+
#
3+
# Proxies the iSamples R2 bucket and adds Cache-Control so edge + browser
4+
# caches can treat versioned parquets as immutable. See README.md for deploy
5+
# instructions.
6+
7+
name = "data-isamples-org"
8+
main = "src/index.js"
9+
compatibility_date = "2026-04-01"
10+
11+
# Bind the R2 bucket that holds the iSamples parquet files.
12+
# The binding name (BUCKET) matches env.BUCKET in src/index.js.
13+
[[r2_buckets]]
14+
binding = "BUCKET"
15+
bucket_name = "isamples-ry"
16+
17+
# Route: everything under data.isamples.org goes through this Worker.
18+
# Zone is inferred from isamples.org being in the same Cloudflare account.
19+
routes = [
20+
{ pattern = "data.isamples.org/*", zone_name = "isamples.org" },
21+
]
22+
23+
# Observability: enable Worker logs in the Cloudflare dashboard.
24+
[observability]
25+
enabled = true

0 commit comments

Comments
 (0)