Skip to content

feat: cap directory listing at max_directory_count to avoid full scans#370

Merged
allison-truhlar merged 11 commits into
JaneliaSciComp:mainfrom
mkitti:mkitti-max-dir-entries
May 15, 2026
Merged

feat: cap directory listing at max_directory_count to avoid full scans#370
allison-truhlar merged 11 commits into
JaneliaSciComp:mainfrom
mkitti:mkitti-max-dir-entries

Conversation

@mkitti
Copy link
Copy Markdown
Contributor

@mkitti mkitti commented May 11, 2026

Summary

  • yield_file_infos_paginated previously called list(os.scandir(...)) which read every entry in a directory before paginating, causing slow responses for large directories
  • Now uses itertools.islice(scanner, max_count + 1) to read at most max_directory_count + 1 entries (default 10,000, configurable via Settings.max_directory_count / FGC_MAX_DIRECTORY_COUNT)
  • When truncated, entries are returned in filesystem order (sort skipped) and is_truncated: true is included in the API response (There's a complication here. See below.)
  • Frontend disables column sorting when isTruncated is true

Test plan

  • Run backend tests: pixi run -e test test-backend
  • Run frontend type check: pixi run node-check
  • Test with a large directory (>10,000 files) and verify fast first-page response
  • Verify is_truncated appears in the API response for large directories
  • Verify sort controls are disabled in the UI when a directory is truncated

🤖 Generated with Claude Code

mkitti and others added 4 commits May 11, 2026 17:15
Large directories caused slow responses because yield_file_infos_paginated
read and sorted all entries regardless of pagination. Now at most
max_directory_count+1 entries are read via itertools.islice; if truncated,
entries are returned in filesystem order (sort skipped) and is_truncated
is set in the API response. The frontend disables column sorting when
is_truncated is true. max_directory_count defaults to 10000 and is
configurable via settings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ount

Without sorting in the truncated case, entries were returned in arbitrary
filesystem order, breaking cursor-based pagination and causing each page
request to re-scan all max_count entries from scratch to locate the cursor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a cap on directory entry scanning for paginated listings to avoid full directory scans on very large folders, and surfaces truncation state to the API and UI.

Changes:

  • Backend: cap yield_file_infos_paginated scanning/counting via max_count (defaulting to Settings.max_directory_count) and return an is_truncated flag.
  • API + frontend: propagate is_truncated to the client and disable table sorting when a listing is truncated.
  • Tests: update pagination tests for the new return shape and add a truncation test.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
fileglancer/filestore.py Limits scandir reads to max_count + 1, caps total_count, and returns is_truncated.
fileglancer/settings.py Adds max_directory_count setting (default 10,000).
fileglancer/server.py Adds is_truncated to the paginated directory listing API response.
frontend/src/queries/fileQueries.ts Adds is_truncated to response typing and maps it to isTruncated in query data.
frontend/src/components/ui/BrowsePage/FileTable.tsx Disables column sorting when the listing is truncated.
tests/test_pagination.py Updates tuple unpacking for new return value and adds a truncation-focused test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fileglancer/filestore.py Outdated
directory has more entries than max_count.
"""
if max_count is None:
max_count = get_settings().max_directory_count
Comment thread fileglancer/filestore.py
Comment on lines +529 to +538
# Read max_count + 1 entries: the extra one detects truncation without
# reading the entire directory.
with os.scandir(full_path) as scanner:
entries = list(itertools.islice(scanner, max_count + 1))
total_count = len(entries)
is_truncated = total_count > max_count
if is_truncated:
entries = entries[:max_count]
total_count = max_count
entries.sort(key=lambda e: (not e.is_dir(follow_symlinks=False), e.name))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting and cursor methods may need to be reevaluated in the case of truncation.

Comment thread tests/test_pagination.py
Comment on lines +198 to +208
def test_truncation(self, pagination_dir):
"""is_truncated is True and entries are in filesystem order (not sorted) when truncated."""
fsp = FileSharePath(zone="test", name="test", mount_path=pagination_dir)
store = Filestore(fsp)
# max_count=5 on a 13-entry directory
infos, has_more, next_cursor, total_count, is_truncated = (
store.yield_file_infos_paginated(None, limit=10, max_count=5)
)
assert total_count == 5
assert is_truncated is True
assert len(infos) == 5 # limit=10 > max_count=5, so all 5 returned
@mkitti
Copy link
Copy Markdown
Contributor Author

mkitti commented May 11, 2026

We may need to re-evaluated how sorting or cursors work when truncation is applied. The comment about is_dir is also intriguing regarding the performance impact.

The core requirement is a stable, consistent ordering so cursor-based pagination works without re-scanning on every page. There are a few alternatives:

  1. Sort by name only (drop is_dir) — The expensive part on NFS is is_dir(): when the NFS server
    returns DT_UNKNOWN for d_type, each call requires a stat RPC. Sorting by name alone gives stable
    pagination ordering without those extra network calls:
    entries.sort(key=lambda e: e.name)
  2. Directories and files would be interleaved rather than dirs-first, but pagination would work
    correctly and cheaply.
  3. Cache the sorted entry list — Scan and sort once, store the result in memory (e.g., keyed by (fsp,
    path, mtime)) with a short TTL. Subsequent page requests hit the cache instead of NFS. The first
    load is still slow, but pages 2–50 are instant.
  4. Offset-based pagination instead of cursor-based — Replace the cursor (filename string) with an
    integer offset. Each page does entries[offset : offset+limit] after the islice. Avoids searching for
    the cursor name on every page, but loses the resilience to insertions/deletions mid-browse that
    cursors provide.

Of these, sort by name only is the simplest change with the best NFS performance. Caching would be
the most impactful for repeated browsing of the same large directory. Want to pursue either of those?

@allison-truhlar
Copy link
Copy Markdown
Collaborator

I added a notification for when the folder is truncated - thoughts on the wording?
Screenshot 2026-05-13 at 2 26 54 PM

@krokicki
Copy link
Copy Markdown
Member

@allison-truhlar LGTM

type: 'warning',
title: 'Folder contents truncated',
message:
'Only showing the first 10,000 items in this folder. Reorganize contents into sub-folders to see more.'
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I have the maximum number of items configurable in settings.py. Would there be a better place to define this so that the notification matches the setting?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just pass the value through the response along with is_truncated?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit a04628b?

@mkitti
Copy link
Copy Markdown
Contributor Author

mkitti commented May 13, 2026

I added a notification for when the folder is truncated - thoughts on the wording?

I would say the the following and avoid the recommendation regarding subfolder. The recommendation would likely corrupt the neuroglancer precomputed layout here.

This directory contains more than 10,000 items and only the first 10,000 items are shown.

Do we use the term "folder" consistently? I associate "folder" with the GUI appearance and directory as the technical term for the file system object.

@mkitti
Copy link
Copy Markdown
Contributor Author

mkitti commented May 13, 2026

I think chromatic.yml needs to specify repository. It does not work since I submitted this from my fork.

Copilot suggests the following.

- uses: actions/checkout@v5
  with:
    repository: ${{ github.event.pull_request.head.repo.full_name || github.repository }}
    ref: ${{ github.event.pull_request.head.sha || github.sha }}
    fetch-depth: 0

@allison-truhlar
Copy link
Copy Markdown
Collaborator

Do we use the term "folder" consistently? I associate "folder" with the GUI appearance and directory as the technical term for the file system object.

At some point we did talk about using "folder" rather than "directory", with the assumption that this was how most people outside SciComp would refer to a directory on their computer. However, looking through the UI, there are several places we do use the word "directory" instead - some steps in the "tours" (although other steps use "folder"), tooltip for the "favorite" button in the toolbar, change permissions dialog, the permissions table in the properties panel, Napari tab in the "how to use your data link" dialog, and the job options listing. I counted 10 places where we use "folder", so it's about 3:5 directory:folder. I will note that we should standardize the language going forward, but I lean towards keeping folder now since that was our intended language.

@allison-truhlar
Copy link
Copy Markdown
Collaborator

allison-truhlar commented May 13, 2026

I think chromatic.yml needs to specify repository. It does not work since I submitted this from my fork.

Copilot suggests the following.

- uses: actions/checkout@v5
  with:
    repository: ${{ github.event.pull_request.head.repo.full_name || github.repository }}
    ref: ${{ github.event.pull_request.head.sha || github.sha }}
    fetch-depth: 0

I think I will just disable this for external pull requests since it requires a secret to run - Claude suggested that no matter how it's configured, the secret won't be accessible to a pull_request from a fork. Chromatic suggests configuring the action to run on push, but I didn't want the action to run on every push. I think external pull requests are rare enough I think I will add a workflow_dispatch to manually trigger this with.

For this PR, I will just merge it because I know we're not changing any of the new design system components.

@allison-truhlar
Copy link
Copy Markdown
Collaborator

Revised wording:
Screenshot 2026-05-13 at 3 55 29 PM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

frontend/src/components/ui/BrowsePage/FileTable.tsx:340

  • Placing the truncation warning inside FileTable means it only renders when the parent renders the table. FileBrowser renders this component only when the filtered displayFiles array is non-empty, so a truncated directory whose returned entries are all hidden by filters (for example dotfiles) will show the empty-state instead of warning that results were truncated; render the warning at the browser level or ensure the table is mounted for truncated results.
      {isTruncated ? (
        <div
          className={`${getNotificationStyles('warning').container} p-4 rounded-md mb-3`}
        >
          <NotificationItem
            notification={{
              id: 0,
              type: 'warning',
              title: 'Folder contents truncated',
              message: `This folder contains more than ${truncationLimit} items. Only ${truncationLimit} items are shown.`
            }}
            showDismissButton={false}
          />
        </div>
      ) : null}

Comment thread fileglancer/server.py Outdated
Comment thread fileglancer/settings.py
Comment thread tests/test_pagination.py
Comment thread frontend/src/components/ui/BrowsePage/FileTable.tsx Outdated
…ation

The truncated subset contains whatever max_count+1 entries os.scandir returned first; scandir order is filesystem-dependent, so the test's assertion that dirs appear first was a coincidence that held locally but not on CI. Assert what yield_file_infos_paginated actually guarantees: the returned subset is sorted, and repeat calls are stable.
@allison-truhlar
Copy link
Copy Markdown
Collaborator

Merging this after addressing all the Copilot review comments.

@allison-truhlar allison-truhlar merged commit 8b7045c into JaneliaSciComp:main May 15, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants