Skip to content

Add per-entity top-k query support#383

Draft
zipdoki wants to merge 1 commit into
feat/multi-edge-count-by-aggfrom
feat/per-entity-top-k
Draft

Add per-entity top-k query support#383
zipdoki wants to merge 1 commit into
feat/multi-edge-count-by-aggfrom
feat/per-entity-top-k

Conversation

@zipdoki

@zipdoki zipdoki commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Introduces the engine foundation for per-entity top-k queries (#369).

Score entries use a composite source key "{source}:{topk_name}", allowing all ranking variants to share a single EDGE table while remaining independently scannable per user per dimension. Top-k reads scan the score index in O(K).

The background job flow is: read aggregated counts from EdgeGroup via multiEdgeCount, then upsert into the score table. The engine automatically maintains the score index on each upsert, so rankings stay sorted without additional work.

Group metadata - declare a topk target on any group to opt in:

{
"groups": [
  {
    "group": "_count",
    "type": "COUNT",
    "fields": [
      {
        "name": "_target"
      }
    ],
    "directionType": "OUT",
    "ttl": 9223372036854776000,
    "topk": "top_purchased"
  },
  {
    "group": "_count_1y",
    "type": "COUNT",
    "fields": [
      {
        "name": "_target"
      },
      {
        "name": "day",
        "bucket": {
          "type": "date",
          "unit": "MILLISECOND",
          "timezone": "+09:00",
          "format": "yyyy-MM-dd"
        }
      }
    ],
    "directionType": "OUT",
    "ttl": 31536000000,
    "topk": "top_purchased_1y"
  }
]
}

Score table - a plain EDGE table with a score DESC index:

{
  "table": "_{table}_score",
  "schema": {
    "type": "EDGE",
    "source": {
      "type": "STRING",
      "comment": "{user}:{topk_name}"
    },
    "target": {
      "type": "STRING",
      "comment": "item_id"
    },
    "properties": [
      {
        "name": "score",
        "type": "LONG",
        "nullable": false
      }
    ],
    "direction": "OUT",
    "indexes": [
      {
        "name": "score",
        "fields": [
          {
            "field": "score",
            "order": "DESC"
          }
        ]
      }
    ]
  }
}

Top-k query

$ curl -X GET "/graph/v3/databases/{database}/tables/_{table}_score/edges/top-k/{topk}?start=user_A&direction=OUT&limit=10"

{
  "edges": [
    {
      "source": "user_A",
      "target": "item_X",
      "properties": {
        "score": 42
      }
    },
    {
      "source": "user_A",
      "target": "item_Y",
      "properties": {
        "score": 31
      }
    }
  ]
}

Changes

  • Add topk field to Group to declare which score table a group feeds
  • Add multiEdgeCount API — aggregated (source, target) pair count with optional time-window ranges
  • Add topk API — scans a score-indexed EDGE table in O(K) using composite source key "{source}:{topk_name}"
  • Add PerEntityTopKSpec covering metadata definition, background job simulation, and top-k query

How to Test

AI Assistance

  • This PR was written largely with AI assistance.
    • Tool / model:

Introduces the topk API that serves pre-ranked item lists per entity in O(K) by scanning a score index. The score table (EDGE with score DESC index) is kept in sync by a background job that reads aggregated counts from EdgeGroup and upserts composite-keyed score entries.

Group gains a topk field to declare which score table a group feeds, enabling the background job to discover targets automatically
@zipdoki zipdoki self-assigned this Jun 14, 2026
@em3s

em3s commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@zipdoki Great start — looks like a lot of thought went into this. 👏

This is a complex flow, so before going into code-level details, I'd like to align on whether the current direction can support the scenarios we want to build.

Example scenario: "Top 10 most-purchased items over the past year (365-day rolling window)", where user_A purchases 12 items (Product_A ~ Product_L). Assume sweep → ingestion → query happens once a day.

2025-06-01
  sweep:     nothing to expire
  ingestion: A +10, B +9, C +8, D +7, E +6, F +5, G +4, H +3, I +2, J +1
  query:     [A:10, B:9, C:8, D:7, E:6, F:5, G:4, H:3, I:2, J:1]

2025-06-02
  sweep:     nothing to expire
  ingestion: J +15, I +10
  query:     [J:16, I:12, A:10, B:9, C:8, D:7, E:6, F:5, G:4, H:3]

2025-06-03
  sweep:     nothing to expire
  ingestion: N/A
  query:     [J:16, I:12, A:10, B:9, C:8, D:7, E:6, F:5, G:4, H:3]

2025-06-04
  sweep:     nothing to expire
  ingestion: A +5, K +4, L +3
  query:     [J:16, A:15, I:12, B:9, C:8, D:7, E:6, F:5, G:4, H:3]
             (K:4, L:3 fall outside the top 10)

... (2025-06-05 ~ 2026-05-31: no activity, nothing for sweep to expire)

2026-06-01
  sweep:     expire 2025-06-01 → A: 15→5, B~H: all 0, I: 12→10, J: 16→15
  ingestion: N/A
  query:     [J:15, I:10, A:5, K:4, L:3]
             (B~H drop to 0; K, L get promoted into the top 10)

2026-06-02
  sweep:     expire 2025-06-02 → J: 15→0, I: 10→0
  ingestion: N/A
  query:     [A:5, K:4, L:3]

2026-06-03
  sweep:     nothing to expire (no activity on 2025-06-03)
  ingestion: N/A
  query:     [A:5, K:4, L:3]

2026-06-04
  sweep:     expire 2025-06-04 → A: 5→0, K: 4→0, L: 3→0
  ingestion: N/A
  query:     []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants