Skip to content

Add a memory bound FileStatisticsCache for the Listing Table#20047

Open
mkleen wants to merge 40 commits intoapache:mainfrom
mkleen:file-stats-cache
Open

Add a memory bound FileStatisticsCache for the Listing Table#20047
mkleen wants to merge 40 commits intoapache:mainfrom
mkleen:file-stats-cache

Conversation

@mkleen
Copy link
Copy Markdown
Contributor

@mkleen mkleen commented Jan 28, 2026

Which issue does this PR close?

This change introduces a default FileStatisticsCache implementation for the Listing-Table with a size limit, implementing the following steps following #19052 (comment) :

Rationale for this change

See above.

What changes are included in this PR?

See above.

Are these changes tested?

Yes.

Are there any user-facing changes?

A new runtime setting datafusion.runtime.file_statistics.cache_limit

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate common Related to common crate execution Related to the execution crate labels Jan 28, 2026
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jan 28, 2026
@mkleen mkleen force-pushed the file-stats-cache branch 2 times, most recently from e273afc to b297378 Compare January 28, 2026 14:40
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 28, 2026
@mkleen mkleen marked this pull request as ready for review January 28, 2026 16:23
@mkleen mkleen changed the title Add a default FileStatisticsCache implementation for the ListingTable Add a default FileStatisticsCache with a size limit Jan 28, 2026
@mkleen mkleen changed the title Add a default FileStatisticsCache with a size limit Add a FileStatisticsCache with a size limit Jan 28, 2026
@mkleen mkleen changed the title Add a FileStatisticsCache with a size limit Add FileStatisticsCache with a size limit Jan 28, 2026
@mkleen mkleen changed the title Add FileStatisticsCache with a size limit Add a memory bound FileStatisticsCache with a size limit Jan 29, 2026
@mkleen mkleen changed the title Add a memory bound FileStatisticsCache with a size limit Add a memory bound FileStatisticsCache for the Listing Table Jan 31, 2026
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleen

Thanks for working on this.

@mkleen
Copy link
Copy Markdown
Contributor Author

mkleen commented Feb 4, 2026

@kosiew Thank you for the feedback!

@mkleen mkleen requested a review from kosiew February 4, 2026 12:10
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mkleen
Copy link
Copy Markdown
Contributor Author

mkleen commented Feb 10, 2026

@kosiew Anything else needed to get this merged? Another approval maybe?

impl<T: DFHeapSize> DFHeapSize for Arc<T> {
fn heap_size(&self) -> usize {
// Arc stores weak and strong counts on the heap alongside an instance of T
2 * size_of::<usize>() + size_of::<T>() + self.as_ref().heap_size()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be accurate.

let a1 = Arc::new(vec![1, 2, 3]);
let a2 = a1.clone();
let a3 = a1.clone();
let a4 = a3.clone();

// this should be true because all `a`s point to the same object in memory 
// but the current implementation does not detect this and counts them separately
assert_eq!(a4.heap_size(), a1.heap_size() + a2.heap_size() + a3.heap_size() + a4.heap_size());

The only solution I imagine is the caller to keep track of the pointer addresses which have been "sized" and ignore any Arc's which point to an address which has been "sized" earlier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I took this implementation from https://github.com/apache/arrow-rs/blob/main/parquet/src/file/metadata/memory.rs#L97-L102 . I would suggest to also do a follow-up here. We are planing anyway to restructure the whole heap size estimation.

@mkleen
Copy link
Copy Markdown
Contributor Author

mkleen commented Feb 10, 2026

@martin-g Thanks for this great review! I am on it.

@mkleen mkleen force-pushed the file-stats-cache branch from a50770d to fb78530 Compare April 9, 2026 10:54
@mkleen
Copy link
Copy Markdown
Contributor Author

mkleen commented Apr 9, 2026

I’ve incorporated all reviewer sfeedback and addressed the issues raised.

  • Fixed the regression affecting cache hits by ensuring the listing table cache is correctly referenced via the CacheManager.
  • Scoped the cache by including the table name in the cache key, allowing entries to be removed when a table is dropped.

With these changes, all tests are now passing. However, I’m not fully confident that the overall design is ideal.

I’d appreciate another round of review.

@mkleen mkleen requested a review from nuno-faria April 9, 2026 12:28
@mkleen
Copy link
Copy Markdown
Contributor Author

mkleen commented Apr 9, 2026

@martin-g Could you also do another review round please?

STORED AS PARQUET LOCATION 'test_files/scratch/encrypted_parquet/'

query error DataFusion error: Parquet error: Parquet error: Parquet file has an encrypted footer but decryption properties were not provided
query error Parquet error: Parquet error: Parquet file has an encrypted footer but decryption properties were not provided
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message changed from:

DataFusion error: Parquet error: Parquet error: Parquet file has an encrypted footer but decryption properties were not provided

to:

DataFusion error: Parquet error: Parquet error: Failed to fetch metadata for file Users/mkleen/datafusion/datafusion/sqllogictest/test_files/scratch/encrypted_parquet/YkbUUuKrhTO6FhwX_3.parquet: Parquet error: Parquet error: Parquet file has an encrypted footer but decryption properties were not provided

@mkleen
Copy link
Copy Markdown
Contributor Author

mkleen commented Apr 10, 2026

And sorry to everyone that this took so long. I was very busy and could not find the time.

Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleen

Thanks for the update here. I like the direction overall, but I found two issues that should be fixed before this lands. I also left one non-blocking suggestion around cache introspection.

return None;
}

let old_value = self.lru_queue.put(key.clone(), value);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change overall, but I think there is a bug in the replacement path for memory_used.

When an existing cache entry is overwritten, this path adds the new key size and value size, but if old_value is present it only subtracts old_entry.heap_size(). It does not subtract the old key's heap usage.

That means repeatedly refreshing the same file will slowly inflate memory_used, which can trigger eviction earlier than expected and make the memory bound inaccurate.

Could you update the overwrite path to subtract the previous key contribution as well? Please also add a regression test that does repeated put calls on the same TableScopedPath and verifies memory_used stays stable.

@@ -1418,8 +1427,11 @@ impl SessionContext {
schema.deregister_table(&table)?;
if table_type == TableType::Base
&& let Some(lfc) = self.runtime_env().cache_manager.get_list_files_cache()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this invalidation logic now only runs when both caches are enabled, because get_list_files_cache() and get_file_statistic_cache() are chained in the same if let.

That creates a cleanup gap. For example, if a session disables the list-files cache but keeps the file-statistics cache enabled, deregistering a table will leave the statistics entries behind. That leaves stale session-scoped state around and can leak memory if the table is re-registered.

Can we make these invalidations independent so each cache is cleaned up whenever it is present?

/// Updates the cache with a new memory limit in bytes.
fn update_cache_limit(&self, limit: usize);

/// Retrieves the information about the entries currently cached.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking thought: now that the file statistics cache key is TableScopedPath, list_entries() -> HashMap<Path, FileStatisticsCacheEntry> seems to flatten away the table dimension again.

That can make observability a bit confusing for helpers like statistics_cache(), especially if two tables point at the same object-store path.

It may be worth exposing TableScopedPath here, or otherwise carrying the table reference through the reported entry, so the debug and introspection surface matches the actual cache semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a default FileStatisticsCache implementation for the ListingTable Add limit to DefaultFileStatisticsCache

6 participants