Skip to content

feat: add byte tracking#745

Open
OGKevin wants to merge 2 commits into
servo:mainfrom
OGKevin:ogkevin/push/uttvvtlxonpy
Open

feat: add byte tracking#745
OGKevin wants to merge 2 commits into
servo:mainfrom
OGKevin:ogkevin/push/uttvvtlxonpy

Conversation

@OGKevin

@OGKevin OGKevin commented Jun 16, 2026

Copy link
Copy Markdown

Byte Tracking

This PR introduced the ability for TreeSink implementations to know byte positions of Tokens. This is done by making BufferQueue track byte consumption and expose methods for Tokenizer to get this info, which than passes it forwards to TreeSink.

BufferQueue

BufferQueue gets a bytes_consumed field that tracks bytes manipulated via:

  • pop_except_from
  • advance_bytes_consumed
  • retreat_bytes_consumed
  • eat
  • next

The "consuming" methods (pop, eat, next) automatically advance the counter. While push_back and push_front don't move the counter. This is due to some code paths pushing front without actually consuming bytes, so this would corrupt the counter. Hence it's the callers responsibility to retreat.

Tokenizer

The main use case of this PR is to know where a token began in bytes, not necessarily the raw byte position of a character (what BufferQueue::bytes_consumed tracks.

Since Tokenizer is aware of "Tokens" e.g. Tag, Comment and Character tokens, it is in a good position to dictate what the correct byte is that needs to be reported to TreeSink.

For Tag, Comment and Doctype, the byte offset is that start of the token. So for tag <some-tag></some-tag>, its offset would be 0. And for tag p in <some-tag><p></p></some-tag> it would be 10.

For character tokens, it wold be that start of the first character.
e.g. <p>abc</p> the text abcwould have offset3`.

This is done by keeping track of when we saw the last < when processing a token. This is stored as token_start_byte. As we continue to process, we always write to last_token_end_byte at the end of each token processing so it can be used when we get CharacterTokens. In the above example, it would contain the byte of a. As when we get the character token abc we would be at byte 6 but we want to communicate that the character token started at 3.

TokenSink

Token sink get's a new

    #[cfg(feature = "source-positions")]
    fn set_current_byte(&self, _byte_offset: u64) {}

Method instead of updating

    /// Process a token.
    fn process_token(&self, token: Token, line_number: u64) -> TokenSinkResult<Self::Handle>;

And breaking the signature.

TreeSink

For folks who implement their own Dom struct, TreeSink is able to also pass the byte offset forward.

An example of this can be seen in source-positions-integration.rs

    impl TreeSink for ByteCapturingDOM {

Conclusion

BufferQueue holds raw position, Tokenizer converts raw position to an offset, and this gets passed down to TokenSink and TreeSink.

This was manually written ⌨️. Mainly to proof that human 🧠 and hands were used .


Supporting changes

Cargo

I had to make some small changes to the cargo.yaml files so that cargo would be happy to compile this project as a submodule: https://github.com/OGKevin/cadmus/blob/65cdaaabb5ab795c668b32a4206f8a49fc9e2661/Cargo.toml#L26-L28

xhtml-self-closing feature

This feature is needed to properly process self closing tags such as <title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.

xhtml-self-closing-integration.rs has test case that showcases this. Well, not the "bug" per se, just how this feature fixes the issue.

Ref: baskerville/plato#426 (comment)


🤖 was used to generate some of the rustdocs and tests. Was also used to educate me on the project and brainstorm implementation path.


Closes: #734

Ref:

@github-actions github-actions Bot added the V-non-breaking A non-breaking change label Jun 16, 2026
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from a332771 to 0d70705 Compare June 16, 2026 13:01
@github-actions github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 16, 2026
@simonwuelker

Copy link
Copy Markdown
Member

🤖 was used to generate some of the rustdocs and tests. Was also used to educate me on the project and brainstorm implementation path.

Note that servo does not accept AI-generated contributions: https://book.servo.org/contributing/getting-started.html#ai-contributions.

Admittedly we should do a better job of communicating that this applies to all repositories under the servo org, not just servo/servo.

@OGKevin

OGKevin commented Jun 17, 2026

Copy link
Copy Markdown
Author

Ahh, that might make things interesting now. I wasn't aware, mybad.

What now? As I see 2 solutions:

  1. Remove the tests and rustdocs, as thats the easiest way to remove all 🤖 touched code, but would also significantly lower the quality of the PR.
  2. Re-write them, but I'm now biased to what I've seen, reviewed and agreed with, so my manual rewrite might not be significantly different. Unless you have some feedback on it that makes it look diff. However, how would I then proof that was scaffolded by 🤖 is now 100% manually typed?

The tests turned out to be useful as they caught some gaps in the implementation, but since they've "served their purpose" I'm fine with removing them. As In this specific situation, I've similar tests in Cadmus that covers similar cases.

Do note that tests are the biggest diff:
image


I'm not challenging the usage of 🤖, just not sure how to move forward. Unfortunately, my daily life has come to 🤖 code reviewer. So I take the small manual implementations where I can, was just lazy when it came to tests and rustdocs 🙉 😭.

@simonwuelker

simonwuelker commented Jun 17, 2026

Copy link
Copy Markdown
Member

I'd be fine with you rewriting the tests and rustdoc.

but I'm now biased to what I've seen, reviewed and agreed with, so my manual rewrite might not be significantly different.

That's okay.

However, how would I then proof that was scaffolded by 🤖 is now 100% manually typed?

You don't need to prove that. We rely on the honesty of contributors when rejecting AI generated patches (unless the patch is obviously slop). Thanks for being upfront about it.

@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from 0d70705 to cdd361a Compare June 18, 2026 13:44
This commit introduced a new feature flag that enables the ability to
track byte offsets.

When the source-positions feature is enabled, the tokenizer will track
the number of UTF-8 bytes consumed from the input so far. This is done
by giving BufferQueue a `bytes_consumed` field, which is incremented
every time a character is consumed.

The changes in cargo.toml were needed to make this project load as a git
submodule in the Cadmus project.

The xhtml-self-closing feature was needed due to EPUBs using
XHTML-compatible self-closing on RCDATA/RAWTEXT elements.

Change-Id: 566446e2bca101b7fefdca639c1b4d26
Change-Id-Short: uttvvtlxonpy
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch 3 times, most recently from 1b0e47d to 16e417a Compare June 18, 2026 13:49
@github-actions github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 18, 2026
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from 16e417a to 74a33c7 Compare June 18, 2026 14:00
@github-actions github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 18, 2026
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from 74a33c7 to b51978b Compare June 19, 2026 20:27
@github-actions github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels Jun 19, 2026
Change-Id: 1cd53437f8710e57e9a9bb17f38bc2fc
Change-Id-Short: ynmuwvwskrsy
@OGKevin OGKevin force-pushed the ogkevin/push/uttvvtlxonpy branch from b51978b to 56499d6 Compare June 19, 2026 20:47
@github-actions github-actions Bot removed the V-non-breaking A non-breaking change label Jun 19, 2026
@github-actions github-actions Bot added the V-non-breaking A non-breaking change label Jun 19, 2026
@OGKevin

OGKevin commented Jun 20, 2026

Copy link
Copy Markdown
Author

I think this is ready for review 🤞🏾. I believe I've touched up on all the parts that were fully generated.

So far, I haven't encountered any bugs 🙏🏾, but would like to keep this project as a submodule, so I always have the ability to make changes if needed, since its now an integral part of Cadmus. That being said:

The only thing I'm unsure about are the changes in cargo.toml. Not sure if there is a way to add this project as a submodule. I've read https://doc.rust-lang.org/cargo/reference/overriding-dependencies.html 2 times now and the only solution that seems to work is by updating cargo.toml in this project.

I was hoping maybe this could work:

[patch.crates-io]
html5ever  = { path = "thirdparty/html5ever", package = "html5ever"}

But it doesn't.

error: failed to load source for dependency `html5ever`

Caused by:
  unable to update /home/kevin/code/github.com/ogkevin/cadmus/thirdparty/html5ever

Caused by:
  found a virtual manifest at `/home/kevin/code/github.com/ogkevin/cadmus/thirdparty/html5ever/Cargo.toml` instead of a package manifest
thirdparty/html5ever/
├── AUTHORS
├── Cargo.lock
├── Cargo.toml
├── COPYRIGHT
├── html5ever
├── LICENSE-APACHE
├── LICENSE-MIT
├── markup5ever
├── rcdom
├── README.md
├── RELEASING.MD
├── rustfmt.toml
├── target
├── tendril
├── web_atoms
└── xml5ever

8 directories, 9 files

Not sure if there is something I'm missing tho 🙈.

@OGKevin OGKevin marked this pull request as ready for review June 20, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

V-non-breaking A non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Byte-accurate source positions in TreeSink

2 participants