Skip to content

Optimize LiveJournal degree count queries#512

Open
adsharma wants to merge 1 commit into
mainfrom
optimize-livejournal-degree-counts
Open

Optimize LiveJournal degree count queries#512
adsharma wants to merge 1 commit into
mainfrom
optimize-livejournal-degree-counts

Conversation

@adsharma
Copy link
Copy Markdown
Contributor

@adsharma adsharma commented May 23, 2026

What Changed Summary

  • Added REL_DEGREE_TABLE logical/physical source operator.
  • q07 rewrite: MATCH (u)-[:follows]->(v) RETURN count(DISTINCT u.id) now becomes REL_DEGREE_TABLE in ACTIVE_BOUND_COUNT mode.
  • q06 rewrite: MATCH (u)-[:follows]->(v) RETURN u.id, count(v) AS deg ORDER BY deg DESC LIMIT 10 now becomes REL_DEGREE_TABLE in TOP_K_DEGREES mode after top-k optimization.
  • Added CSR degree summary methods for native, icebug-disk/parquet CSR, and Arrow CSR rel tables.
  • Kept q08 COUNT_REL_TABLE rewrite intact.

Why

LiveJournal benchmark q06 and q07 were still executing scan-based count plans. These rewrites let the planner answer the unfiltered degree/count shapes from CSR metadata instead of scanning all follows edges.

Validation

  • cmake --build build/release --target lbug_shell
  • LiveJournal icebug-disk benchmark smoke run with build/release/tools/shell/lbug :memory: -b -i ../live-journal-benchmark/icebug_disk/schema.cypher:
    • q06 returned the expected top 10 in ~48ms executing.
    • q07 returned 4004103 in ~0.94ms executing.
    • q08 returned 69362378 in ~0.09ms executing.
  • Native in-memory smoke test for q06/q07 rewrite behavior.

@adsharma
Copy link
Copy Markdown
Contributor Author

How does it work?

COUNT_REL_TABLE

COUNT_REL_TABLE is the fast path for plain relationship cardinality.

It replaces plans like:

  SCAN_REL_TABLE -> PROJECT f._ID -> COUNT(f._ID)

with a direct count over the relationship table.

Instead of scanning every edge row, it asks the rel table for its row count. For icebug-disk/parquet-backed rel tables this comes from table/file metadata. For native CSR tables it sums CSR lengths, with handling for committed/uncommitted rows and deletions.

It returns one row: the total relationship count.

Example:

  MATCH (a:user)-[f:follows]->(b:user)
  RETURN count(f);

REL_DEGREE_TABLE

REL_DEGREE_TABLE is the fast path for degree-derived count queries.

It has two modes.

ACTIVE_BOUND_COUNT

Used for q07:

  MATCH (u:user)-[:follows]->(v)
  RETURN count(DISTINCT u.id);

This counts how many source CSR rows have at least one outgoing edge, instead of scanning all edges and deduplicating u.id.

For icebug-disk CSR, this is:

  count rows where indptr[i + 1] > indptr[i]

TOP_K_DEGREES

Used for q06:

  MATCH (u:user)-[:follows]->(v)
  RETURN u.id, count(v) AS deg
  ORDER BY deg DESC
  LIMIT 10;

This computes each source node’s degree from CSR row lengths, keeps a top-k heap, and outputs (u.id, deg) directly.

For LiveJournal’s icebug-disk layout, this reads the CSR indptr degree summary instead of materializing ~69M follows rows, aggregating, sorting, and limiting.

Summary

  • COUNT_REL_TABLE: “How many edges are in this rel table?”
  • REL_DEGREE_TABLE: “How many bound nodes have edges?” or “Which bound nodes have the largest degrees?”

@adsharma adsharma force-pushed the optimize-livejournal-degree-counts branch from a9a95a1 to 8293095 Compare May 23, 2026 17:06
@adsharma adsharma marked this pull request as ready for review May 23, 2026 17:07
@adsharma adsharma force-pushed the optimize-livejournal-degree-counts branch from 8293095 to 352cca4 Compare May 23, 2026 17:09
@adsharma adsharma requested a review from aheev May 23, 2026 17:13
Copy link
Copy Markdown
Contributor

@aheev aheev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to see some tests involving active writes, multiple rel tables, bwd rel count

nodeKeyVector->state->getSelVectorUnsafe().setToUnfiltered(entries.size());
}
for (sel_t i = 0; i < entries.size(); ++i) {
writeNodeKey(entries[i].first, i);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we sum for a node appearing in multiple tables?

Comment thread src/storage/table/rel_table.cpp Outdated
if (auto* localTable = transaction->getLocalStorage()->getLocalTable(tableID)) {
auto& localRelTable = localTable->cast<LocalRelTable>();
for (const auto& [nodeOffset, rowIndices] : localRelTable.getCSRIndex(direction)) {
pushDegree(nodeOffset, rowIndices.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if a nodeOffset appears in both local and persistent?

Comment thread src/storage/table/rel_table.cpp Outdated
if (auto* localTable = transaction->getLocalStorage()->getLocalTable(tableID)) {
auto& localRelTable = localTable->cast<LocalRelTable>();
for (const auto& [_, rowIndices] : localRelTable.getCSRIndex(direction)) {
result += !rowIndices.empty();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread src/storage/table/arrow_rel_table.cpp Outdated
std::vector<std::pair<offset_t, row_idx_t>> ArrowRelTable::getTopKDegreeEntries(
const transaction::Transaction* transaction, RelDataDirection direction, idx_t k) const {
if (layout != ArrowRelTableLayout::CSR || direction == RelDataDirection::BWD || k == 0) {
return const_cast<ArrowRelTable*>(this)->RelTable::getTopKDegrees(transaction, direction,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would result in crash right? same for IceDiskRel tables

@adsharma adsharma force-pushed the optimize-livejournal-degree-counts branch from 352cca4 to b42aeda Compare May 24, 2026 05:17
@adsharma
Copy link
Copy Markdown
Contributor Author

  • Degree rows are now merged by bound node across persistent, committed in-memory, and local write-transaction
    data.
  • REL_DEGREE_TABLE now sums duplicate node degrees across multiple rel tables, and active-source counts
    deduplicate nodes across tables.
  • Arrow/IceDisk unsupported degree directions now return no fast-path rows instead of falling through into native
    CSR code.
  • Added optimizer coverage for active writes, multiple rel tables, and backward rel count.
  • Allowed the degree top-k rewrite for count(*) in the one-hop degree pattern.

@aheev
Copy link
Copy Markdown
Contributor

aheev commented May 24, 2026

voops tests are failing

@adsharma adsharma force-pushed the optimize-livejournal-degree-counts branch from b42aeda to d5bbf26 Compare May 24, 2026 16:14
@adsharma adsharma force-pushed the optimize-livejournal-degree-counts branch from d5bbf26 to b022401 Compare May 24, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants