Skip to content

Commit d08aa7f

Browse files
committed
Updated metadata and did some cleaning
1 parent 7bb0e9d commit d08aa7f

6 files changed

Lines changed: 56 additions & 144 deletions

File tree

site/db.edn

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,12 @@
116116
:email "edwardaw@connect.hku.hk"
117117
:affiliation [:cim]
118118
:links [{:icon "github" :href "https://github.com/W-Edward"}]}
119+
{:id :tombarys
120+
:name "Tomáš Baránek"
121+
:image "https://avatars.githubusercontent.com/u/5729352?v=4"
122+
:url "https://barys.me/#english-section"
123+
:email "tom@barys.me"
124+
:links [{:icon "github" :href "https://github.com/tombarys"}]}
119125
{:id :mattb
120126
:name "Matthias Buehlmaier"
121127
:affiliation [:cim]}
@@ -145,8 +151,7 @@
145151
:url "https://techascent.com"}
146152
{:id :cim
147153
:name "cim"
148-
:url "https://cim.hkubs.hku.hk/"
149-
}]
154+
:url "https://cim.hkubs.hku.hk/"}]
150155

151156
:topic
152157
[{:id :core

src/data_analysis/book_sales_analysis/about_apriori.clj

Lines changed: 39 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
^{:kindly/hide-code true ; don't render this code to the HTML document
22
:clay {:title "From Correlations to Recommendations"
33
:quarto {:author [:tombarys]
4+
:image "src/data_analysis/book_sales_analysis/graphviz.png"
5+
:description "A Publisher's Journey into Data-Driven Book Sales – exploring how association rule mining can transform business insights using the SciCloj stack."
46
:type :draft
57
:date "2025-10-13"
68
:category :data-analysis
@@ -24,7 +26,6 @@
2426

2527
;; When you run an indie publishing house with over 160 titles and sell thousands of books each month, one question keeps coming back: **Which books do our customers buy together?** This seemingly simple question led me down a fascinating path from basic correlation analysis to building a more robust recommendation system using association rule mining—all with Clojure and the SciCloj ecosystem.
2628

27-
;; > *Disclaimer: please consider that my experience in Clojure is intermediate and entry-level at SciCloj and data-science. I can still assure you my enthusiasm is high.*
2829

2930
;; ## The Starting Point: Understanding Our Data
3031

@@ -37,7 +38,7 @@
3738
(tc/random 5)))
3839

3940
^:kindly/hide-code
40-
(kind/table
41+
(kind/table
4142
orders-sample {:element/max-height 400})
4243

4344
;; *(for clarity, many columns were omitted here; rows were generated with `(tc/random 5)` from anonymized dataset)*
@@ -50,18 +51,23 @@
5051

5152
;; ## The Transformation: Making Data Analysis-Ready
5253

53-
;; The transformation from raw orders to an analysis-ready format was crucial. Using Tablecloth, the transformation pipeline was surprisingly readable:
54-
55-
;; ❗ FIXME this is too simplified, I have to change this ❗
54+
;; The transformation from raw orders to an analysis-ready format was crucial. Using Tablecloth, the transformation pipeline was easy (and can be even more simplified).
5655

57-
^:kindly/hide-code
56+
^:kindly/hide-code
5857
(kind/code
59-
";; From customer orders with book lists...
60-
(-> orders
61-
(tc/group-by :zakaznik) ;; Group by customer
62-
(tc/aggregate ;; Aggregate their purchases
63-
{:books #(distinct-books %)})
64-
;; ...to binary matrix where each column is a book")
58+
";; From customer orders with book lists...
59+
(map
60+
(fn [customer-row]
61+
(let [customer-name (:zakaznik customer-row)
62+
books-bought-set (set (parse-books-from-list (:all-products customer-row)))
63+
one-hot-map (reduce (fn [acc book]
64+
(assoc acc book (if (contains? books-bought-set book) 1 0)))
65+
{}
66+
all-titles)]
67+
(merge {:zakaznik customer-name}
68+
one-hot-map)))
69+
(tc/rows customer+orders :as-maps))
70+
;; ...to binary matrix where each column is a book")
6571

6672
;; After transformation, each customer became a row, and each book a column with 1 or 0:
6773

@@ -73,7 +79,7 @@
7379
(tc/reorder-columns [:zakaznik])
7480
(tc/random 5)
7581
(tc/head 4)
76-
(tc/select-columns #"^:zakaznik|^:book-00.*")))
82+
(tc/select-columns #"^:zakaznik|^:book-00.*"))) ;; just quick reduction of table width
7783

7884
^:kindly/hide-code
7985
(kind/table onehot-sample)
@@ -176,23 +182,28 @@ scatter-plot
176182

177183
;; One of the most elegant aspects of the Apriori implementation is how it generates larger itemsets from smaller ones without creating duplicates. Consider generating 3-item sets from 2-item sets:
178184

185+
;; ```
186+
;; We want: [:book-a :book-b :book-c] ✓
187+
;; Not: [:book-a :book-c :book-b] ✗ (same set, different order)
188+
;; [:book-b :book-a :book-c] ✗ (same set, different order)
189+
;; ```
190+
179191
^:kindly/hide-code
180192
(kind/code
181-
";; We want: [:book-a :book-b :book-c] ✓
182-
;; Not: [:book-a :book-c :book-b] ✗ (same set, different order)
183-
;; [:book-b :book-a :book-c] ✗ (same set, different order)
184-
185-
(defn join-itemsets [frequent-itemsets k]
193+
"(defn join-itemsets
194+
[frequent-itemsets k]
186195
(let [k-1 (dec k)
196+
;; Only process itemsets of the correct size
187197
valid-sets (filter #(= (count %) k-1) frequent-itemsets)
198+
;; Group by prefix (first k-2 elements) for efficiency
188199
by-prefix (group-by #(vec (take (- k 2) %)) valid-sets)]
189200
(mapcat
190201
(fn [[prefix items]]
191202
(for [set1 items
192203
set2 items
193204
:let [last1 (last set1)
194205
last2 (last set2)]
195-
;; This line ensures canonical ordering
206+
;; Only join if last2 > last1 (enforces canonical order)
196207
:when (and (not= last1 last2)
197208
(pos? (compare last2 last1)))]
198209
(concat prefix [last1 last2])))
@@ -206,21 +217,21 @@ scatter-plot
206217

207218
;; **Support** measures how frequently an itemset appears:
208219

209-
^:kindly/hide-code
210-
(kind/tex
211-
"\\text{Support}(\\{A, B\\}) = \\dfrac{\\text{orders with A and B}}{\\text{total orders}}")
220+
;; $$
221+
;; \text{Support}(\{A, B\}) = \dfrac{\text{orders with A and B}}{\text{total orders}}
222+
;; $$
212223

213224
;; **Confidence** measures how often B appears when A is purchased:
214225

215-
^:kindly/hide-code
216-
(kind/tex
217-
"\\text{Confidence}(A \\rightarrow B) = \\dfrac{\\text{Support}(\\{A, B\\})}{\\text{Support}(\\{A\\})}")
226+
;; $$
227+
;; \text{Confidence}(A \rightarrow B) = \dfrac{\text{Support}(\{A, B\})}{\text{Support}(\{A\})}
228+
;; $$
218229

219230
;; **Lift** measures whether this happens more than random chance:
220231

221-
^:kindly/hide-code
222-
(kind/tex
223-
"\\text{Lift}(A \\rightarrow B) = \\dfrac{\\text{Confidence}(A \\rightarrow B)}{\\text{Support}(\\{B\\})}")
232+
;; $$
233+
;; "\text{Lift}(A \rightarrow B) = \dfrac{\text{Confidence}(A \rightarrow B)}{\text{Support}(\{B\})}
234+
;; $$
224235

225236
;; A lift greater than 1 indicates positive association—the items are purchased together more often than if they were independent. A lift of 2.3 means the combination is 2.3 times more likely than chance.
226237

src/data_analysis/book_sales_analysis/core_helpers_v2.clj

Lines changed: 8 additions & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,12 @@
77
[clojure.string :as str]
88
[java-time.api :as jt]
99
[fastmath.stats :as stats]
10-
[scicloj.kindly.v4.kind :as kind]))
10+
[scicloj.kindly.v4.kind :as kind]
11+
[data-analysis.book-sales-analysis.data-sources-v2 :as data]))
1112

1213
;; ## Data Transformation Functions
1314
;; Common data processing functions used across multiple analysis files
1415

15-
;; ### Date Time
16-
17-
(def end-time
18-
(jt/local-date 2025 10 1))
19-
2016
;; ### Scicloj Helpers
2117

2218
(defn merge-csvs [file-list options]
@@ -118,37 +114,18 @@
118114
(map #(str/replace % #"\+" ""))
119115
(map #(str/trim %))
120116
(map sanitize-str)
121-
(map #(str/replace % #"\-\-.+$" "")) ;; zdvojené názvy
122-
(map #(str/replace % #"\-+$" "")) ;; pomlčky na konci
117+
(map #(str/replace % #"\-\-.+$" ""))
118+
(map #(str/replace % #"\-+$" ""))
123119
(map #(str/replace % #"^3" "k3"))
124-
(map #(str/replace % #"^5" "k5"));; eliminace čísel 3 na začátku dvou knih
120+
(map #(str/replace % #"^5" "k5"))
125121
(remove (fn [item] (some (fn [substr] (str/includes? (name item) substr))
126122
["balicek" "poukaz" "zapisnik" "limitovana-edice" "taska" "aktualizovane-vydani" "cd" "puvodni-vydani/neni-skladem"
127123
"merch"])))
128124
distinct
129125
(mapv keyword))
130126
nil))
131127

132-
;; ### Melvil Data Enriching and Convenience Functions
133-
134-
(defn months-between "Calculate how many months a product has been on market"
135-
[start-date end-date]
136-
(let [days (if (and start-date end-date)
137-
(jt/time-between start-date end-date :days)
138-
0)]
139-
(long (Math/round (/ days 30.4375)))))
140-
141-
(defn months-on-market
142-
"Months `book` is on a market. Zero if not at all."
143-
[books-ds book end-date]
144-
(let [date (try
145-
(-> books-ds
146-
(tc/select-columns [:titul :datum-zahajeni-prodeje])
147-
(tc/select-rows #(str/starts-with? (name (:titul %)) (name book)))
148-
(tc/get-entry :datum-zahajeni-prodeje 0))
149-
(catch Exception e nil))
150-
month (if (nil? date) 0 (months-between date end-date))]
151-
month))
128+
;; ### Metadata Enriching and Convenience Functions
152129

153130
(defn czech-author? [book-title]
154131
(let [czech-books #{:k30-hodin
@@ -182,71 +159,10 @@
182159
(rand-int 2)
183160
(if (contains? czech-books (keyword book-title)) 1 0))))
184161

185-
(def category-enrichments
186-
{:k30-hodin "podnikani,firemni-kultura"
187-
:k5-principu-rodicovstvi "vzdelavani-a-vychova,kariera,psychologie,mezilidska-komunikace"
188-
:autismus-bez-masky "psychologie,spolecnost,mezilidska-komunikace"
189-
:genialne-potraviny :zdravi
190-
:genialni-potraviny :zdravi
191-
:jak-zabranit-dalsi-pandemii "budoucnost,spolecnost,ekologie"
192-
:krvavy-uterek :historie
193-
:male-experimenty "kariera,produktivita,psychologie"
194-
:myty-a-nadeje-digitalniho-sveta "budoucnost,spolecnost,kariera"
195-
:nestekej-na-sveho-psa "vzdelavani-a-vychova,mezilidska-komunikace"
196-
:nove-zbrane-vlivu "podnikani,psychologie,kariera,mezilidska-komunikace"
197-
:pamet "zdravi,psychologie"
198-
:pomala-produktivita "produktivita,kariera"
199-
:poridte-si-druhy-mozek "produktivita,kariera"
200-
:prezit "zdravi,psychologie,mezilidska-komunikace"
201-
:stastnejsi "psychologie,zdravi,mezilidska-komunikace"
202-
:ultrazpracovani-lide :zdravi
203-
:vitamin-l "psychologie,mezilidska-komunikace"
204-
:zazracna-imunita :zdravi
205-
:heureka! "podnikani,firemni-kultura"
206-
:zrozeni-evropanu "historie,spolecnost"})
207-
208-
(defn enrich-metadata-csv
209-
"Takes a CSV `file` with book titles, subtitles, categories and technical parameters
210-
and enriches it with supplemental categories and info about author nationalities."
211-
[file]
212-
(let [summary-raw-ds (-> (tc/dataset
213-
file
214-
{:key-fn #(keyword (sanitize-column-name-str %))
215-
:separator \;
216-
:parser-fn {:datum-zahajeni-prodeje :string}
217-
:encoding :utf-8})
218-
(tc/update-columns :datum-zahajeni-prodeje
219-
#(map (fn [date-str]
220-
(when (not-empty date-str)
221-
(parse-csv-date date-str)))
222-
%))
223-
#_(tc/drop-missing [:edice :datum-zahajeni-prodeje]))
224-
sanitized-colnames-ds (-> summary-raw-ds
225-
(tc/rename-columns :all (fn [col] (if col (keyword (sanitize-column-name-str (name col))) col))))
226-
227-
sanitized-rows-ds (-> sanitized-colnames-ds
228-
(tc/update-columns [:titul :podtitul :vazba :barevnost :edice :cenova-kategorie :tloustka]
229-
(fn [column-data] (map sanitize-column-name-str column-data)))
230-
(tc/update-columns [:kategorie-na-e-shopu :kategorie-tema]
231-
(fn [column-data] (map sanitize-category-str column-data)))
232-
(tc/update-columns [:titul] #(map (comp keyword parse-book-name) %)))
233-
234-
enriched--categories-ds (tc/update-columns sanitized-rows-ds :kategorie-na-e-shopu
235-
(fn [categories]
236-
(map (fn [title category]
237-
(if-let [enriched-category (get category-enrichments title)]
238-
(name enriched-category)
239-
category))
240-
(sanitized-rows-ds :titul)
241-
categories)))
242-
enriched-ds (tc/add-column enriched--categories-ds
243-
:cesky-autor (fn [ds] (map czech-author? (ds :titul))))]
244-
(-> enriched-ds
245-
(tc/rename-columns {:column-0 :titul}))))
246-
247162
;; ### One-Hot Encoding Functions
248163

249-
(defn onehot-encode-by-customers
164+
165+
(defn onehot-encode-by-customers ;; FIXME needs refactor and simplification :)
250166
"One-hot encode dataset aggregated by customer.
251167
Each customer gets one row with 0/1 values for each book they bought.
252168
Used for market basket analysis, customer segmentation, etc."
@@ -296,21 +212,6 @@
296212
0.0
297213
(double (/ transactions-with-itemset total-transactions)))))
298214

299-
(defn build-popularity-index
300-
"Creates a popularity index (map in format `{:book :popularity}`) for all items in one-hot encoded dataset"
301-
[dataset]
302-
(let [items (-> dataset (tc/drop-columns :zakaznik) tc/column-names)]
303-
(reduce (fn [acc item]
304-
(assoc acc item (calculate-support dataset [item])))
305-
{}
306-
items)))
307-
308-
^:kindly/hide-code
309-
#_(-> (build-popularity-index (onehot-encode-by-customers data-sources/orders))
310-
(tc/dataset)
311-
(tc/pivot->longer)
312-
(tc/order-by :$value))
313-
314215

315216
^:kindly/hide-code
316217
(defn calculate-adaptive-coefficient

src/data_analysis/book_sales_analysis/data_sources_v2.clj

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,6 @@
2929

3030
;; ### Book Metadata
3131

32-
#_(def enriched-book-metadata-ds
33-
(helpers/enrich-metadata-csv "data/summary-all-time-all-books-300725.csv"))
34-
3532
;; ## Quick Access
3633
;; Most commonly used datasets with short aliases
3734

131 KB
Loading

src/data_analysis/book_sales_analysis/market_basket_analysis_v2.clj

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,6 @@
1212
[clojure.math.combinatorics :as combo]
1313
[clojure.string :as str]
1414
[clojure.set]
15-
[scicloj.tableplot.v1.plotly :as plotly]
16-
[scicloj.kindly.v4.kind :as kind]
1715
[data-analysis.book-sales-analysis.data-sources-v2 :as data]
1816
[data-analysis.book-sales-analysis.core-helpers-v2 :as helpers]))
1917

@@ -403,7 +401,7 @@
403401
(comment
404402
;; Generate once - takes ~1 minute
405403
(def quick-analysis
406-
(generate-market-basket-analysis data/orders-slides
404+
(generate-market-basket-analysis data/orders-share
407405
:subset-size 6000
408406
:min-confidence 0.03))
409407

@@ -492,7 +490,7 @@
492490

493491
;; ### Prediction Demo
494492

495-
;; 🎯 **LIVE DEMO**: Show how the website will make recommendations
493+
;; 🎯 **LIVE DEMO**: Show how the website should make recommendations
496494

497495
^:kindly/hide-code
498496
(comment

0 commit comments

Comments
 (0)