Skip to content

Add storage operations to et_def.proto#216

Open
danmih-ixia wants to merge 1 commit into
mlcommons:mainfrom
danmih-ixia:storage_proposal
Open

Add storage operations to et_def.proto#216
danmih-ixia wants to merge 1 commit into
mlcommons:mainfrom
danmih-ixia:storage_proposal

Conversation

@danmih-ixia

Copy link
Copy Markdown

##Summary

Extends the Chakra execution-trace schema (schema/protobuf/et_def.proto) with a typed model for storage operations so traces can represent I/O alongside compute and communication nodes.

  • Adds three NodeType values — STORAGE_NODE, BATCH_NODE, CHECKPOINT_NODE — and an optional Node.storage field carrying a new StorageInfo message.
  • StorageInfo is layered by StorageLevel (FS / PAGE / BLOCK / DEVICE) with a oneof detail { FsOp, PageOp, BlockOp }. The device layer intentionally has no typed detail and falls back to Node.attr.
  • FsOp covers the POSIX syscalls whose byte traffic reaches storage (open/close/read/write/pread/pwrite/fsync/fdatasync/sync_file_range); metadata-only syscalls and mmap are deliberately out of scope.
  • PageOp covers page-cache events that are causally attributable to an in-scope FS syscall (add_to_cache, dirty, sync/async readahead); background reclaim and mmap faults are out of scope.
  • BlockOp covers bio/request lifecycle (Q/G/I/D/C) for the ops that flow from in-scope FS syscalls (READ, WRITE, FLUSH); discard/secure-erase/zone-management are out of scope.

@danmih-ixia danmih-ixia requested a review from a team as a code owner May 28, 2026 12:07
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@Ricefrog

Copy link
Copy Markdown

recheck

COMM_RECV_NODE = 6;
COMM_COLL_NODE = 7;
STORAGE_NODE = 8;
BATCH_NODE = 9;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq why is the batch node separate? or is it better named as DATALOADER_NODE?

@Ricefrog Ricefrog Jun 8, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this internally and decided that we will remove the BATCH_NODE and CHECKPOINT_NODE types by consolidating these two types under STORAGE_NODE in the following way:

We want to add either STORAGE_LEVEL_FRAMEWORK or STORAGE_LEVEL_MODEL to the StorageLevel enum.

STORAGE_LEVEL_MODEL/FRAMEWORK will have sub-types to differentiate between BATCH, CHECKPOINT, or KV-CACHE operations:

  • This sub-type may be a part of the opaque Node.attr
  • The sub-type may be represented in a more structured way
  • This will be determined after getting feedback from all involved parties

Like STORAGE_LEVEL_DEVICE, we intend for there not to be a corresponding field in the detail oneof in StorageInfo. This implies load-bearing attributes... we can share the attributes that our replayer implementation converges on to aid in Oana's replayer development.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a separate node type for framework level ops like DATALOADER, CHECKPOINT, KVCACHE nodes. and STORAGE_NODE.
Reasoning being datal-oading / checkpointing could have different semantics than purely storage nodes. For example data loader node may have the batch item information and could represent some pre-processing.
I think different node types will make replay more easier as well.
Just my thought on this.

@briancoutinho briancoutinho left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, added some questions and suggestions

// int sync_file_range(int fd, off_t offset, off_t nbytes, unsigned int flags);

message FsOp {
enum Op {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call it OpType or so as it sounds close to FsOp

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agree with this change for FsOp, PageOp, and BlockOp.

int32 fd = 3;
uint64 offset = 4; // byte offset within the file
uint64 size_bytes = 5; // bytes requested / transferred
uint32 flags = 6;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wanted to ask if 32bit bit field will be sufficient for the future

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of 32bit fields was based on the data types of the corresponding parameters in the listed Linux syscalls. If those function signatures ever change and the capturing system uses that new version of Linux, we believe a breaking proto change is acceptable because it makes that difference in the underlying system more explicit.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree it is unlikely to change so should be fine

// Two readahead variants distinguished by who triggered them.
// SYNC: cache-miss-driven during a read syscall — the user needs these
// pages now, the kernel widens the request to also prefetch ahead.
PAGE_OP_READAHEAD_SYNC = 3; // kprobe: page_cache_sync_ra

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this closely tied to Linux Page cache management code today? Just curious how we will handle this on future changees. IMO should be ok to add more enums if there is a semantic change

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I would employ a similar philosophy as above; make the proto changes breaking if the underlying Linux page cache changes are breaking rather than try to maintain backwards compatibility. The tracing model is inherently tied to the underlying system, so trying to maintain proto compatibility only introduces the possibility for subtle bugs down the line.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Resolved, just getting an error when I try to resolve it)

string device = 6; // Tensor object device location.
}

// =============================================================================

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the storage structure ops in storage.proto and include it in this file?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agree that this is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants