Skip to content

client.query.sql() silently truncates at 5,000 rows — should auto-paginate #157

@suyask-msft

Description

@suyask-msft

Problem

client.query.sql() sends a single GET /{entity_set}?sql=<query> and returns whatever the server responds with. The Dataverse Web API default page size is 5,000 rows. If the query matches more than 5,000 rows, results are silently truncated — no error, no warning.

This is inconsistent with the rest of the SDK:

  • client.records.get() — auto-paginates via @odata.nextLink, returns all results
  • client.dataframe.get() — auto-paginates, consolidates into a single DataFrame

client.query.sql() is the only read method that silently drops data.

Observed behavior

# Table has 136,590 rows
results = client.query.sql("SELECT srcid FROM prefix_ballbyball")
len(results)  # → 5000 (silently truncated, no error)

The user gets 5,000 rows and has no way to know 131,590 rows are missing.

Expected behavior

client.query.sql() should follow @odata.nextLink in the response and return all matching rows, consistent with records.get() and dataframe.get().

Proposed fix

Check for @odata.nextLink in the response from _query_sql() and continue fetching pages until exhausted:

def _query_sql(self, sql: str) -> list[dict[str, Any]]:
    # ... existing validation ...
    url = f"{self.api}/{entity_set}"
    params = {"sql": sql}
    all_rows = []
    
    r = self._request("get", url, params=params)
    body = r.json() if r.text else {}
    all_rows.extend(body.get("value", []))
    
    next_link = body.get("@odata.nextLink")
    while next_link:
        r = self._request("get", next_link)
        body = r.json() if r.text else {}
        all_rows.extend(body.get("value", []))
        next_link = body.get("@odata.nextLink")
    
    return all_rows

Alternatively, return a page iterator consistent with records.get() — but this would change the return type from list[Record] to Iterable[list[Record]].

Impact

Any user querying a table with >5,000 rows via client.query.sql() is getting silently truncated results today. The ?sql= parameter is commonly used for filtered reads and its utility would increase significantly if it could return full result sets.

Context

Discovered during end-to-end testing of a 21-table cricket dataset import (297K total records). After import, analytical queries via client.query.sql() returned only 5,000 of 136,590 rows from the largest table, producing incorrect aggregation results.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions