Add a memory bound FileStatisticsCache for the Listing Table by mkleen · Pull Request #20047 · apache/datafusion

mkleen · 2026-01-28T13:50:43Z

Which issue does this PR close?

This change introduces a default FileStatisticsCache implementation for the Listing-Table with a size limit, implementing the following steps following #19052 (comment) :

Add heap size estimation for file statistics and the relevant data types used in caching (This is temporary until Add a crate for HeapSize trait arrow-rs#9138 is resolved)
Redesign DefaultFileStatisticsCache to use a LruQueue to make it memory-bound following Adds memory-bound DefaultListFilesCache #18855
Introduce a size limit and use it together with the heap-size to limit the memory usage of the cache
Move FileStatisticsCache creation into CacheManager, making it session-scoped and shared across statements and tables.
Disable caching in some of the SQL-logic tests where the change altered the output result, because the cache is now session-scoped and not query-scoped anymore.
Closes Add a default FileStatisticsCache implementation for the ListingTable #19217
Closes Add limit to DefaultFileStatisticsCache #19052

Rationale for this change

See above.

What changes are included in this PR?

See above.

Are these changes tested?

Yes.

Are there any user-facing changes?

A new runtime setting datafusion.runtime.file_statistics.cache_limit

kosiew

@mkleen

Thanks for working on this.

datafusion/execution/src/cache/cache_unit.rs

datafusion/common/src/heap_size.rs

mkleen · 2026-02-04T12:10:45Z

@kosiew Thank you for the feedback!

kosiew

LGTM

mkleen · 2026-02-10T05:18:22Z

@kosiew Anything else needed to get this merged? Another approval maybe?

martin-g · 2026-02-10T07:04:54Z

datafusion/common/src/heap_size.rs

+impl<T: DFHeapSize> DFHeapSize for Arc<T> {
+    fn heap_size(&self) -> usize {
+        // Arc stores weak and strong counts on the heap alongside an instance of T
+        2 * size_of::<usize>() + size_of::<T>() + self.as_ref().heap_size()


This won't be accurate.

let a1 = Arc::new(vec![1, 2, 3]); let a2 = a1.clone(); let a3 = a1.clone(); let a4 = a3.clone(); // this should be true because all `a`s point to the same object in memory // but the current implementation does not detect this and counts them separately assert_eq!(a4.heap_size(), a1.heap_size() + a2.heap_size() + a3.heap_size() + a4.heap_size());

The only solution I imagine is the caller to keep track of the pointer addresses which have been "sized" and ignore any Arc's which point to an address which has been "sized" earlier.

Good catch! I took this implementation from https://github.com/apache/arrow-rs/blob/main/parquet/src/file/metadata/memory.rs#L97-L102 . I would suggest to also do a follow-up here. We are planing anyway to restructure the whole heap size estimation.

datafusion/execution/src/cache/cache_unit.rs

datafusion/execution/src/cache/cache_manager.rs

datafusion/core/src/execution/context/mod.rs

datafusion/execution/src/cache/cache_unit.rs

datafusion/execution/src/cache/file_statistics_cache.rs

datafusion/execution/src/cache/cache_unit.rs

datafusion/execution/src/cache/file_statistics_cache.rs

datafusion/sqllogictest/test_files/set_variable.slt

datafusion/execution/src/cache/cache_manager.rs

mkleen · 2026-02-10T08:02:48Z

@martin-g Thanks for this great review! I am on it.

mkleen · 2026-04-09T12:28:07Z

I’ve incorporated all reviewer sfeedback and addressed the issues raised.

Fixed the regression affecting cache hits by ensuring the listing table cache is correctly referenced via the CacheManager.
Scoped the cache by including the table name in the cache key, allowing entries to be removed when a table is dropped.

With these changes, all tests are now passing. However, I’m not fully confident that the overall design is ideal.

I’d appreciate another round of review.

mkleen · 2026-04-09T12:28:47Z

@martin-g Could you also do another review round please?

mkleen · 2026-04-09T13:53:40Z

datafusion/sqllogictest/test_files/encrypted_parquet.slt

 STORED AS PARQUET LOCATION 'test_files/scratch/encrypted_parquet/'

-query error DataFusion error: Parquet error: Parquet error: Parquet file has an encrypted footer but decryption properties were not provided
+query error Parquet error: Parquet error: Parquet file has an encrypted footer but decryption properties were not provided


Error message changed from:

DataFusion error: Parquet error: Parquet error: Parquet file has an encrypted footer but decryption properties were not provided

to:

DataFusion error: Parquet error: Parquet error: Failed to fetch metadata for file Users/mkleen/datafusion/datafusion/sqllogictest/test_files/scratch/encrypted_parquet/YkbUUuKrhTO6FhwX_3.parquet: Parquet error: Parquet error: Parquet file has an encrypted footer but decryption properties were not provided

mkleen · 2026-04-10T09:01:50Z

And sorry to everyone that this took so long. I was very busy and could not find the time.

kosiew

@mkleen

Thanks for the update here. I like the direction overall, but I found two issues that should be fixed before this lands. I also left one non-blocking suggestion around cache introspection.

kosiew · 2026-04-10T10:56:44Z

datafusion/execution/src/cache/file_statistics_cache.rs

+            return None;
+        }
+
+        let old_value = self.lru_queue.put(key.clone(), value);


Nice change overall, but I think there is a bug in the replacement path for memory_used.

When an existing cache entry is overwritten, this path adds the new key size and value size, but if old_value is present it only subtracts old_entry.heap_size(). It does not subtract the old key's heap usage.

That means repeatedly refreshing the same file will slowly inflate memory_used, which can trigger eviction earlier than expected and make the memory bound inaccurate.

Could you update the overwrite path to subtract the previous key contribution as well? Please also add a regression test that does repeated put calls on the same TableScopedPath and verifies memory_used stays stable.

kosiew · 2026-04-10T10:56:44Z

datafusion/core/src/execution/context/mod.rs

@@ -1418,8 +1427,11 @@ impl SessionContext {
            schema.deregister_table(&table)?;
            if table_type == TableType::Base
                && let Some(lfc) = self.runtime_env().cache_manager.get_list_files_cache()


I think this invalidation logic now only runs when both caches are enabled, because get_list_files_cache() and get_file_statistic_cache() are chained in the same if let.

That creates a cleanup gap. For example, if a session disables the list-files cache but keeps the file-statistics cache enabled, deregistering a table will leave the statistics entries behind. That leaves stale session-scoped state around and can leak memory if the table is re-registered.

Can we make these invalidations independent so each cache is cleaned up whenever it is present?

kosiew · 2026-04-10T10:56:44Z

datafusion/execution/src/cache/cache_manager.rs

+    /// Updates the cache with a new memory limit in bytes.
+    fn update_cache_limit(&self, limit: usize);
+
    /// Retrieves the information about the entries currently cached.


Non-blocking thought: now that the file statistics cache key is TableScopedPath, list_entries() -> HashMap<Path, FileStatisticsCacheEntry> seems to flatten away the table dimension again.

That can make observability a bit confusing for helpers like statistics_cache(), especially if two tables point at the same object-store path.

It may be worth exposing TableScopedPath here, or otherwise carrying the table reference through the reported entry, so the debug and introspection surface matches the actual cache semantics.

github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate common Related to common crate execution Related to the execution crate labels Jan 28, 2026

mkleen force-pushed the file-stats-cache branch from a66420a to 3b33739 Compare January 28, 2026 13:56

mkleen mentioned this pull request Jan 28, 2026

Add limit to DefaultFileStatisticsCache #19052

Open

mkleen force-pushed the file-stats-cache branch from 3b33739 to 8e5560b Compare January 28, 2026 14:19

github-actions bot removed the documentation Improvements or additions to documentation label Jan 28, 2026

mkleen force-pushed the file-stats-cache branch 2 times, most recently from e273afc to b297378 Compare January 28, 2026 14:40

github-actions bot added the documentation Improvements or additions to documentation label Jan 28, 2026

mkleen marked this pull request as ready for review January 28, 2026 16:23

mkleen changed the title ~~Add a default FileStatisticsCache implementation for the ListingTable~~ Add a default FileStatisticsCache with a size limit Jan 28, 2026

mkleen changed the title ~~Add a default FileStatisticsCache with a size limit~~ Add a FileStatisticsCache with a size limit Jan 28, 2026

mkleen changed the title ~~Add a FileStatisticsCache with a size limit~~ Add FileStatisticsCache with a size limit Jan 28, 2026

mkleen changed the title ~~Add FileStatisticsCache with a size limit~~ Add a memory bound FileStatisticsCache with a size limit Jan 29, 2026

mkleen changed the title ~~Add a memory bound FileStatisticsCache with a size limit~~ Add a memory bound FileStatisticsCache for the Listing Table Jan 31, 2026

mkleen mentioned this pull request Jan 31, 2026

Add heap memory estimation for statistics #19599

Closed

kosiew requested changes Feb 4, 2026

View reviewed changes

datafusion/execution/src/cache/cache_unit.rs Outdated Show resolved Hide resolved

datafusion/common/src/heap_size.rs Show resolved Hide resolved

mkleen force-pushed the file-stats-cache branch from 59c6bce to 4542db8 Compare February 4, 2026 12:08

mkleen requested a review from kosiew February 4, 2026 12:10

kosiew approved these changes Feb 5, 2026

View reviewed changes

mkleen force-pushed the file-stats-cache branch from 205f96c to 92899a7 Compare February 10, 2026 05:58

martin-g reviewed Feb 10, 2026

View reviewed changes

mkleen force-pushed the file-stats-cache branch from 92899a7 to 2e3aff9 Compare February 11, 2026 14:49

mkleen added 21 commits April 9, 2026 12:54

Fix fmt

21847da

Fix clippy

708ada5

minor

0b43242

Add more key memory accounting

2474104

Fix Formatting

04740f3

Account path as string and remove dependency to object_store

7e0b042

Improve error handling

5b15bdb

Fix fmt

0bf5d66

Remove path.clone

b05b4b0

Simplify accounting for statistics

72bb18a

Adapt offset buffer

92772cb

Fix heap size for Arc

4fe130b

Adapt estimate in test

99039ef

Fix sql logic test

d839d7a

Register cache from cachemanager at listing table

45dcc07

Revert slt

7149feb

Add tablescoping for file stats cache

ce8a2d8

Adapt slt

301ead5

Fix linter

44674c3

Remove uneeded clone

4ef6b7c

Rename cache_unit to file_statistics_cache

fb78530

mkleen force-pushed the file-stats-cache branch from a50770d to fb78530 Compare April 9, 2026 10:54

Simplify heap size accounting

13cf3f7

mkleen requested a review from nuno-faria April 9, 2026 12:28

mkleen commented Apr 9, 2026

View reviewed changes

Adapt comments in test

c49c906

kosiew requested changes Apr 10, 2026

View reviewed changes

Conversation

mkleen commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mkleen commented Feb 4, 2026

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

mkleen commented Feb 10, 2026

Uh oh!

martin-g Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

mkleen Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkleen commented Feb 10, 2026

Uh oh!

mkleen commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkleen commented Apr 9, 2026

Uh oh!

mkleen Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

mkleen commented Apr 10, 2026

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mkleen commented Jan 28, 2026 •

edited

Loading

mkleen commented Apr 9, 2026 •

edited

Loading

kosiew left a comment •

edited

Loading