feat: add byte tracking#745
Conversation
a332771 to
0d70705
Compare
Note that servo does not accept AI-generated contributions: https://book.servo.org/contributing/getting-started.html#ai-contributions. Admittedly we should do a better job of communicating that this applies to all repositories under the servo org, not just |
|
I'd be fine with you rewriting the tests and rustdoc.
That's okay.
You don't need to prove that. We rely on the honesty of contributors when rejecting AI generated patches (unless the patch is obviously slop). Thanks for being upfront about it. |
0d70705 to
cdd361a
Compare
This commit introduced a new feature flag that enables the ability to track byte offsets. When the source-positions feature is enabled, the tokenizer will track the number of UTF-8 bytes consumed from the input so far. This is done by giving BufferQueue a `bytes_consumed` field, which is incremented every time a character is consumed. The changes in cargo.toml were needed to make this project load as a git submodule in the Cadmus project. The xhtml-self-closing feature was needed due to EPUBs using XHTML-compatible self-closing on RCDATA/RAWTEXT elements. Change-Id: 566446e2bca101b7fefdca639c1b4d26 Change-Id-Short: uttvvtlxonpy
1b0e47d to
16e417a
Compare
16e417a to
74a33c7
Compare
74a33c7 to
b51978b
Compare
Change-Id: 1cd53437f8710e57e9a9bb17f38bc2fc Change-Id-Short: ynmuwvwskrsy
b51978b to
56499d6
Compare
|
I think this is ready for review 🤞🏾. I believe I've touched up on all the parts that were fully generated. So far, I haven't encountered any bugs 🙏🏾, but would like to keep this project as a submodule, so I always have the ability to make changes if needed, since its now an integral part of Cadmus. That being said: The only thing I'm unsure about are the changes in cargo.toml. Not sure if there is a way to add this project as a submodule. I've read https://doc.rust-lang.org/cargo/reference/overriding-dependencies.html 2 times now and the only solution that seems to work is by updating cargo.toml in this project. I was hoping maybe this could work: [patch.crates-io]
html5ever = { path = "thirdparty/html5ever", package = "html5ever"}But it doesn't. Not sure if there is something I'm missing tho 🙈. |

Byte Tracking
This PR introduced the ability for
TreeSinkimplementations to know byte positions of Tokens. This is done by makingBufferQueuetrack byte consumption and expose methods forTokenizerto get this info, which than passes it forwards toTreeSink.BufferQueue
BufferQueuegets abytes_consumedfield that tracks bytes manipulated via:pop_except_fromadvance_bytes_consumedretreat_bytes_consumedeatnextThe "consuming" methods (pop, eat, next) automatically advance the counter. While
push_backandpush_frontdon't move the counter. This is due to some code paths pushing front without actually consuming bytes, so this would corrupt the counter. Hence it's the callers responsibility to retreat.Tokenizer
The main use case of this PR is to know where a token began in bytes, not necessarily the raw byte position of a character (what
BufferQueue::bytes_consumedtracks.Since Tokenizer is aware of "Tokens" e.g. Tag, Comment and Character tokens, it is in a good position to dictate what the correct byte is that needs to be reported to
TreeSink.For Tag, Comment and Doctype, the byte offset is that start of the token. So for tag
<some-tag></some-tag>, its offset would be0. And for tagpin<some-tag><p></p></some-tag>it would be10.For character tokens, it wold be that start of the first character.
e.g.
<p>abc</p> the textabcwould have offset3`.This is done by keeping track of when we saw the last
<when processing a token. This is stored astoken_start_byte. As we continue to process, we always write tolast_token_end_byteat the end of each token processing so it can be used when we getCharacterTokens. In the above example, it would contain the byte ofa. As when we get the character tokenabcwe would be at byte6but we want to communicate that the character token started at3.TokenSink
Token sink get's a new
Method instead of updating
And breaking the signature.
TreeSink
For folks who implement their own Dom struct,
TreeSinkis able to also pass the byte offset forward.An example of this can be seen in
source-positions-integration.rsConclusion
BufferQueueholds raw position, Tokenizer converts raw position to an offset, and this gets passed down to TokenSink and TreeSink.This was manually written ⌨️. Mainly to proof that human 🧠 and hands were used .
Supporting changes
Cargo
I had to make some small changes to the
cargo.yamlfiles so that cargo would be happy to compile this project as a submodule: https://github.com/OGKevin/cadmus/blob/65cdaaabb5ab795c668b32a4206f8a49fc9e2661/Cargo.toml#L26-L28xhtml-self-closingfeatureThis feature is needed to properly process self closing tags such as
<title/>, which might not be valid html5, but seems to be valid xhtml. Without this feature, html5ever swallows everything while looking for the closing title tag.xhtml-self-closing-integration.rshas test case that showcases this. Well, not the "bug" per se, just how this feature fixes the issue.Ref: baskerville/plato#426 (comment)
🤖
was used to generate some of the rustdocs and tests.Was also used to educate me on the project and brainstorm implementation path.Closes: #734
Ref: