|
10 | 10 | (ns ezmiller.relaunching-tablecloth-time |
11 | 11 | (:require [tablecloth.api :as tc] |
12 | 12 | [tablecloth.time.api :as tct] |
| 13 | + [tablecloth.time.column.api :as tctc] |
| 14 | + [tablecloth.time.parse :as tparse] |
13 | 15 | [scicloj.tableplot.v1.plotly :as plotly] |
14 | 16 | [tech.v3.datatype.functional :as dfn] |
15 | 17 | [scicloj.kindly.v4.kind :as kind])) |
|
26 | 28 | ;; had built this project around a dataset index mechanism that was |
27 | 29 | ;; built into tech.ml.dataset, but after that feature was removed in |
28 | 30 | ;; v7, the project required a rethink. This post walks through that |
29 | | -;; rethink and the projects core core primitives today using the |
| 31 | +;; rethink and the core primitives today, using the |
30 | 32 | ;; Victoria electricity demand dataset. |
31 | 33 |
|
32 | 34 | ;; ## Why No Index? |
|
63 | 65 | ;; on humanscodes. |
64 | 66 |
|
65 | 67 | ;; Now let's dig into this library's primitives and basic functionality. |
| 68 | +;; Throughout these examples, `tc` refers to `tablecloth.api`, |
| 69 | +;; `tct` refers to `tablecloth.time.api`, and `tctc` refers to |
| 70 | +;; `tablecloth.time.column.api`. |
66 | 71 |
|
67 | 72 | ;; ## Loading the Data |
68 | 73 | ;; |
69 | 74 | ;; We'll use the `vic_elec` dataset: half-hourly electricity demand from Victoria, |
70 | | -;; Australia, spanning 2012-2014. Let's load it and take a look. Throughout these examples the tablecloth library is aliased as `tc`, following the common conventio, and tablecloth.time is aliased as `tct`. |
| 75 | +;; Australia, spanning 2012-2014. Strings are parsed to datetime types on load: |
71 | 76 |
|
72 | 77 | (def vic-elec |
73 | 78 | (-> (tc/dataset "https://gist.githubusercontent.com/ezmiller/6edf3e0f41848f532436c15bc94c2f4d/raw/vic_elec.csv" |
|
79 | 84 | ;; The dataset has half-hourly readings with `:Time`, `:Demand` (in MW), |
80 | 85 | ;; `:Temperature`, and other fields. |
81 | 86 |
|
82 | | -;; ## Extracting Time Components |
| 87 | +;; ## Time at the Column Level |
83 | 88 | ;; |
84 | | -;; The first primitive is `add-time-columns`. It extracts temporal fields from a |
85 | | -;; datetime column — day-of-week, month, hour, etc. — as new columns you can |
86 | | -;; group or filter on. Here's a quick look at what it produces: |
| 89 | +;; Before diving into the high-level API, it's worth understanding what's |
| 90 | +;; underneath. tablecloth.time mirrors tablecloth's two-level design: a |
| 91 | +;; dataset API and a column API. The column API is where the actual time |
| 92 | +;; manipulation happens, built on dtype-next's vectorized operations. |
| 93 | +;; |
| 94 | +;; Why does this matter? Because manipulating time data is notoriously fiddly. |
| 95 | +;; Java's `java.time` package is powerful but verbose. Working with columns |
| 96 | +;; of timestamps — converting, extracting, flooring — typically means writing |
| 97 | +;; loops or mapping functions over sequences. tablecloth.time's column API |
| 98 | +;; gives you operations that work on entire columns at once, using the same |
| 99 | +;; fast, primitive-backed machinery as the rest of tech.ml.dataset. |
| 100 | +;; |
| 101 | +;; The building blocks fall into three categories: |
| 102 | +;; |
| 103 | +;; **Parsing** — `tablecloth.time.parse/parse` handles ISO-8601 strings and |
| 104 | +;; custom formats with cached formatters for performance. For now this is |
| 105 | +;; scalar (single value), but bulk parsing happens automatically when loading |
| 106 | +;; datasets with `tc/convert-types`. |
| 107 | +;; |
| 108 | +;; **Conversion** — `convert-time` moves between representations (Instants, |
| 109 | +;; LocalDateTimes, LocalDates, epoch milliseconds) with timezone awareness. |
| 110 | +;; This is the workhorse for preparing time columns for different operations. |
| 111 | +;; |
| 112 | +;; **Flooring and extraction** — `down-to-nearest`, `floor-to-month`, and |
| 113 | +;; field extractors like `year`, `hour`, `day-of-week` operate on columns |
| 114 | +;; using dtype-next's vectorized arithmetic. These are **column in, column out**: |
| 115 | + |
| 116 | +;; Extract just the hour from the Time column: |
| 117 | +(tctc/hour (:Time vic-elec)) |
| 118 | + |
| 119 | +;; Floor timestamps to hour buckets: |
| 120 | +(tctc/down-to-nearest (:Time vic-elec) 1 :hours {:zone "UTC"}) |
| 121 | + |
| 122 | +;; The key thing to notice: no Clojure seqs, no explicit loops. These |
| 123 | +;; operations work on primitive arrays under the hood, just like dtype-next's |
| 124 | +;; numeric operations. The result is a column that can be added directly |
| 125 | +;; to a dataset. |
| 126 | + |
| 127 | +;; ## Building Up: add-time-columns |
| 128 | +;; |
| 129 | +;; With these column-level tools in hand, the dataset-level API is just |
| 130 | +;; convenience. `add-time-columns` — the function that most users reach |
| 131 | +;; for first — is actually a thin wrapper around the extractors we just saw. |
| 132 | +;; |
| 133 | +;; Here's what it does internally: |
| 134 | +;; |
| 135 | +;; 1. Take the source time column from the dataset |
| 136 | +;; 2. Look up extractor functions from a map (`:year` → `tctc/year`, etc.) |
| 137 | +;; 3. Apply each extractor to produce new columns |
| 138 | +;; 4. Add those columns back to the dataset |
| 139 | +;; |
| 140 | +;; The "primitive" is just composition of lower-level pieces. This matters |
| 141 | +;; because it means you can drop down when the high-level API doesn't |
| 142 | +;; quite fit. Need a custom computed field? Build it from the column |
| 143 | +;; tools and add it yourself. |
| 144 | +;; |
| 145 | +;; Let's see it in action: |
87 | 146 |
|
88 | 147 | (-> vic-elec |
89 | 148 | (tct/add-time-columns :Time [:day-of-week :hour]) |
90 | 149 | (tc/head 10)) |
91 | 150 |
|
| 151 | +;; ## The Resampling Pattern |
| 152 | +;; |
| 153 | +;; With time fields extracted, standard tablecloth operations take |
| 154 | +;; over. Resampling, which in time series means aggregating to coarser |
| 155 | +;; time granularity, is just another pattern of composition: add time |
| 156 | +;; columns, group, aggregate, order. |
| 157 | +;; |
| 158 | +;; Let's break it into two steps. First, the data transformation: |
| 159 | + |
92 | 160 | (def demand-by-day |
93 | 161 | (-> vic-elec |
94 | 162 | (tct/add-time-columns :Time [:day-of-week]) |
|
99 | 167 | ;; Look at the aggregated data: |
100 | 168 | (tc/head demand-by-day 7) |
101 | 169 |
|
102 | | -;; Step 2: Visualize the result: |
| 170 | +;; Then visualize: |
103 | 171 | (plotly/layer-bar demand-by-day |
104 | 172 | {:=x :day-of-week :=y :Demand}) |
105 | 173 |
|
106 | 174 | ;; Weekends (days 6 and 7) clearly have lower demand. The `:day-of-week` field |
107 | 175 | ;; came from `add-time-columns`; the group-by, aggregate, and order-by are pure |
108 | | -;; tablecloth. The two libraries compose seamlessly. |
| 176 | +;; tablecloth. tablecloth.time provides the time-specific pieces, then gets |
| 177 | +;; out of the way. |
| 178 | +;; |
| 179 | +;; The same pattern scales to different granularities. Here are daily and |
| 180 | +;; monthly averages: |
| 181 | + |
| 182 | +;; Daily averages: |
| 183 | +(-> vic-elec |
| 184 | + (tct/add-time-columns :Time [:year :month :day]) |
| 185 | + (tc/group-by [:year :month :day]) |
| 186 | + (tc/aggregate {:Demand #(dfn/mean (:Demand %)) |
| 187 | + :Temperature #(dfn/mean (:Temperature %))}) |
| 188 | + (tc/order-by [:year :month :day]) |
| 189 | + (tc/head 10)) |
| 190 | + |
| 191 | +;; Monthly averages: |
| 192 | +(-> vic-elec |
| 193 | + (tct/add-time-columns :Time [:year :month]) |
| 194 | + (tc/group-by [:year :month]) |
| 195 | + (tc/aggregate {:Demand #(dfn/mean (:Demand %))}) |
| 196 | + (tc/order-by [:year :month]) |
| 197 | + (plotly/layer-bar {:=x :month :=y :Demand :=color :year})) |
| 198 | + |
| 199 | +;; Note that tablecloth.time is just a light layer here. You could do this |
| 200 | +;; with tablecloth alone by manually extracting datetime components. |
| 201 | +;; `add-time-columns` just adds concision — it composes naturally with the |
| 202 | +;; tablecloth operations you're already using. |
109 | 203 |
|
110 | 204 | ;; ## Slicing Time Ranges |
111 | 205 | ;; |
112 | | -;; `slice` selects rows within a time range using binary search on sorted data. |
113 | | -;; It's fast even on large datasets. |
| 206 | +;; `slice` selects rows within a time range using binary search on |
| 207 | +;; sorted data. Here is where we would have previously leaned on an |
| 208 | +;; index. Now we use binary search on a sorted column. It's fast even |
| 209 | +;; on large datasets — the O(log n) lookup without the overhead of |
| 210 | +;; maintaining a tree structure, though it may need to sort the data |
| 211 | +;; if unsorted. |
114 | 212 |
|
115 | 213 | (-> vic-elec |
116 | 214 | (tct/slice :Time "2012-01-09" "2012-01-15") |
|
127 | 225 |
|
128 | 226 | ;; ## Lag and Lead Columns |
129 | 227 | ;; |
130 | | -;; `add-lag` shifts column values by a fixed number of rows — useful for |
| 228 | +;; `add-lag` shifts column values by a fixed number of rows — useful for |
131 | 229 | ;; autocorrelation analysis. Note this is row-based, not time-aware: you need |
132 | | -;; to know your data's frequency and calculate the offset. |
133 | | -;; |
134 | | -;; Since this dataset has half-hourly readings, a lag of 48 rows equals 24 hours: |
| 230 | +;; to know your data's frequency and calculate the offset. Since this dataset |
| 231 | +;; has half-hourly readings, a lag of 48 rows equals 24 hours: |
135 | 232 |
|
136 | 233 | (-> vic-elec |
137 | 234 | (tct/add-lag :Demand 48 :Demand_lag48) |
|
149 | 246 |
|
150 | 247 | ;; The tight diagonal shows strong positive correlation — demand at any given |
151 | 248 | ;; time is highly predictive of demand at the same time the previous day. |
152 | | - |
153 | | -;; `add-lead` shifts values forward — current Demand aligns with Demand 24 hours |
154 | | -;; ahead. Let's see if today's demand predicts tomorrow's: |
| 249 | +;; |
| 250 | +;; `add-lead` works the same way but shifts values forward. Current demand |
| 251 | +;; aligns with demand 24 hours ahead — useful when you need to align past |
| 252 | +;; observations with future outcomes for predictive modeling: |
155 | 253 |
|
156 | 254 | (-> vic-elec |
157 | 255 | (tct/add-lead :Demand 48 :Demand_lead48) |
|
160 | 258 | :=y :Demand_lead48 |
161 | 259 | :=mark-opacity 0.3})) |
162 | 260 |
|
163 | | -;; ## Resampling as a Pattern |
164 | | -;; |
165 | | -;; We showed the resampling pattern above: extract time fields, group, aggregate, |
166 | | -;; order. The same pattern scales to different granularities. Here are daily and |
167 | | -;; monthly averages using the same building blocks: |
168 | | - |
169 | | -;; Daily averages: |
170 | | -(-> vic-elec |
171 | | - (tct/add-time-columns :Time [:year :month :day]) |
172 | | - (tc/group-by [:year :month :day]) |
173 | | - (tc/aggregate {:Demand #(dfn/mean (:Demand %)) |
174 | | - :Temperature #(dfn/mean (:Temperature %))}) |
175 | | - (tc/order-by [:year :month :day]) |
176 | | - (tc/head 10)) |
177 | | - |
178 | | -;; Monthly averages: |
179 | | -(-> vic-elec |
180 | | - (tct/add-time-columns :Time [:year :month]) |
181 | | - (tc/group-by [:year :month]) |
182 | | - (tc/aggregate {:Demand #(dfn/mean (:Demand %))}) |
183 | | - (tc/order-by [:year :month]) |
184 | | - (plotly/layer-bar {:=x :month :=y :Demand :=color :year})) |
185 | | - |
186 | | -;; Note that tablecloth.time is just a light layer in these |
187 | | -;; expressions. You could do this with tablecloth alone by manually |
188 | | -;; extracting datetime components. tablecloth.time's add-time-columns |
189 | | -;; just adds concision and expressiveness — it composes naturally with |
190 | | -;; the tablecloth operations. |
191 | | - |
192 | 261 | ;; ## Combining Primitives |
193 | 262 | ;; |
194 | 263 | ;; Let's do something more interesting: analyze the daily demand profile, |
|
207 | 276 | ;; Weekday demand shows the classic two-peak pattern (morning and evening), |
208 | 277 | ;; while weekend demand is flatter and lower overall. |
209 | 278 |
|
210 | | -;; ## Time Utilities (Column API) |
211 | | -;; |
212 | | -;; tablecloth.time mirrors tablecloth's structure: a dataset API (`tct`) |
213 | | -;; and a column API (`tablecloth.time.column.api`). The column API provides |
214 | | -;; lower-level utilities for working with time data directly — parsing, |
215 | | -;; conversion, flooring, extraction. These power the high-level functions |
216 | | -;; and are available when you need finer control. |
217 | | -;; |
218 | | -;; **Parsing** — `tablecloth.time.parse/parse` handles ISO-8601 strings and |
219 | | -;; custom formats with cached formatters for performance. |
220 | | -;; |
221 | | -;; **Conversion** — `convert-time` moves between representations (Instants, |
222 | | -;; LocalDateTimes, LocalDates, epoch milliseconds) with timezone awareness. |
223 | | -;; |
224 | | -;; **Flooring** — `down-to-nearest`, `floor-to-month`, `floor-to-quarter` bucket |
225 | | -;; timestamps to intervals. Useful for aggregating sub-daily data: |
226 | | - |
227 | | -(require '[tablecloth.time.column.api :as tctc]) |
228 | | - |
229 | | -(-> vic-elec |
230 | | - (tc/add-column :HourBucket |
231 | | - #(tctc/down-to-nearest (% :Time) 1 :hours {:zone "UTC"})) |
232 | | - (tc/head 5)) |
233 | | - |
234 | | -;; The column API parallels `tablecloth.column.api` — work with columns |
235 | | -;; directly, then add them to your dataset. The high-level dataset functions |
236 | | -;; are convenience wrappers built from these pieces. Manipulating time data |
237 | | -;; is notoriously fiddly; tablecloth.time tries to smooth the sharp edges |
238 | | -;; without hiding the underlying java.time power. |
239 | | - |
240 | 279 | ;; ## What's Next |
241 | 280 | ;; |
242 | | -;; tablecloth.time is experimental. Planned additions include rolling windows, |
243 | | -;; differencing, and higher-level patterns like `resample` that wrap the |
244 | | -;; composable building blocks. |
| 281 | +;; tablecloth.time is experimental. The current release provides |
| 282 | +;; focused primitives built on solid foundations: parsing, conversion, |
| 283 | +;; and field extraction at the column level; convenient dataset-level |
| 284 | +;; wrappers that compose with standard tablecloth operations. My hope |
| 285 | +;; is this provides a solid basis for building convinient abstractions |
| 286 | +;; that are just patterns of composition. |
| 287 | +;; |
| 288 | +;; Planned additions include rolling windows, differencing, and higher-level |
| 289 | +;; patterns like `resample` that wrap the composable building blocks. |
245 | 290 | ;; |
246 | 291 | ;; The [repository is on GitHub](https://github.com/scicloj/tablecloth.time). |
247 | 292 | ;; For more worked examples, see the |
|
0 commit comments