Skip to content

Fix platform-dependent byte offset drift and Unicode alignment in tree-sitter-ng generators#444

Open
victorgveloso wants to merge 2 commits into
GumTreeDiff:mainfrom
victorgveloso:fix/windows-offset
Open

Fix platform-dependent byte offset drift and Unicode alignment in tree-sitter-ng generators#444
victorgveloso wants to merge 2 commits into
GumTreeDiff:mainfrom
victorgveloso:fix/windows-offset

Conversation

@victorgveloso

Copy link
Copy Markdown

Root Causes Identified (see #439 for more details):

  1. Line Ending Re-normalization: AbstractTreeSitterNgGenerator was reading files via BufferedReader.lines() (which strips line endings) and re-joining them with
    System.lineSeparator(). On Windows, this forced CRLF (2 bytes) onto files that originally used LF (1 byte), causing a +1 byte drift for every line in the file.
  2. Character vs. Byte Length: The calculateOffset helper was using String.length() (UTF-16 code units) instead of UTF-8 byte lengths. This caused offsets to drift in
    files containing multi-byte Unicode characters (e.g., Emojis or non-Latin scripts).

The Fix:

  • Implemented a "Raw Read" approach using a StringBuilder to preserve the original file content and line endings exactly as they appear on disk.
  • Standardized on \n splitting for internal line tracking while preserving the original carriage returns (\r) within the line strings.
  • Updated calculateOffset to use getBytes(StandardCharsets.UTF_8).length to ensure 100% alignment with Tree-sitter's internal byte-offset model.

Verification:
We replicated the bug by comparing the official Tree-sitter CLI (v0.26.9) against GumTree on a large Python project (FastAPI). We verified the fix across Windows 10
(PowerShell) and Linux (WSL/Ubuntu) environments.

I have added three new test cases to AbstractTreeSitterNgGeneratorTest that specifically target these discrepancies:

  • OffsetConsistency_testLFOffsets
  • OffsetConsistency_testCRLFOffsets
  • OffsetConsistency_testMultiByteOffsets (Failed previously on both platforms)
Test report before fix (Failing on Windows/WSL)

Task :gen.treesitter-ng:test

AbstractTreeSitterNgGeneratorTest > testMatchNodeOrAncestorTypes() PASSED

AbstractTreeSitterNgGeneratorTest > OffsetConsistency_testMultiByteOffsets() FAILED
org.opentest4j.AssertionFailedError: Line 2 should start at byte offset 7 after a 4-byte emoji and LF ==> expected: <7> but was: <5>
at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at app//org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
at app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
at app//org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:563)
at
app//com.github.gumtreediff.gen.treesitterng.AbstractTreeSitterNgGeneratorTest.OffsetConsistency_testMultiByteOffsets(AbstractTreeSitterNgGeneratorTest.java:71)

AbstractTreeSitterNgGeneratorTest > OffsetConsistency_testCRLFOffsets() FAILED
org.opentest4j.AssertionFailedError: Line 2 should start at byte offset 7 for CRLF content ==> expected: <7> but was: <6>
at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at app//org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
at app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
at app//org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:563)
at app//com.github.gumtreediff.gen.treesitterng.AbstractTreeSitterNgGeneratorTest.OffsetConsistency_testCRLFOffsets(AbstractTreeSitterNgGeneratorTest.java:58)

AbstractTreeSitterNgGeneratorTest > OffsetConsistency_testLFOffsets() PASSED

4 tests completed, 2 failed

Task :gen.treesitter-ng:test FAILED
-----BEGIN OPENSSH PRIVATE KEY-----

FAILURE: Build failed with an exception.

  • What went wrong:
    Execution failed for task ':gen.treesitter-ng:test'.

There were failing tests. See the results at: file:///mnt/c/Users/victor/Downloads/gumtree/gen.treesitter-ng/build/test-results/test/

  • Try:

Run with --scan to get full insights from a Build Scan (powered by Develocity).

BUILD FAILED in 11s
13 actionable tasks: 3 executed, 10 up-to-date

Test report after fix (Passing all 57 project tasks)

$ ./gradlew test
Starting a Gradle Daemon, 1 incompatible and 1 stopped Daemons could not be reused, use --status for details

Task :gen.jdt:compileJava
/mnt/c/Users/victor/Downloads/gumtree/gen.jdt/src/main/java/com/github/gumtreediff/gen/jdt/cd/CdJdtVisitor.java:553: warning: [deprecation] getExpression() in
SwitchCase has been deprecated
pushNode(node, node.getExpression() != null ? node.getExpression().toString() : "default");
^
/mnt/c/Users/victor/Downloads/gumtree/gen.jdt/src/main/java/com/github/gumtreediff/gen/jdt/cd/CdJdtVisitor.java:553: warning: [deprecation] getExpression() in
SwitchCase has been deprecated
pushNode(node, node.getExpression() != null ? node.getExpression().toString() : "default");
^
/mnt/c/Users/victor/Downloads/gumtree/gen.jdt/src/main/java/com/github/gumtreediff/gen/jdt/JdtVisitor.java:426: warning: [deprecation] TokenNameIdentifier in
ITerminalSymbols has been deprecated
if (token == ITerminalSymbols.TokenNameIdentifier) {
^
Note: /mnt/c/Users/victor/Downloads/gumtree/gen.jdt/src/main/java/com/github/gumtreediff/gen/jdt/AbstractJdtVisitor.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
3 warnings

Task :benchmark:compileJava
Note: /mnt/c/Users/victor/Downloads/gumtree/benchmark/src/main/java/com/github/gumtree/benchmark/RunOnDataset.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

Task :client.diff:compileJava
Note: /mnt/c/Users/victor/Downloads/gumtree/client.diff/src/main/java/com/github/gumtreediff/client/diff/swingdiff/DirectoryPanel.java uses unchecked or unsafe
operations.
Note: Recompile with -Xlint:unchecked for details.

Task :core:test

TestActionGenerator > testAlignChildren() PASSED

TestActionGenerator > testWithUnmappedRoot() PASSED

TestActionGenerator > testWithActionExample() PASSED

TestActionGenerator > testWithZsCustomExample() PASSED

TestActionGenerator > testWithActionExampleNoMove() PASSED

TestActionIoUtils > testActionsIoUtilsMove() PASSED

TestActionIoUtils > testActionsIoUtilsInsert() PASSED

TestActionIoUtils > testActionsIoUtilsUpdate() PASSED

TestAlgorithms > testLcs() PASSED

TestAlgorithms > testLcss() PASSED

TestAlgorithms > testHungarianAlgorithm() PASSED

TestAutoMatcher > testAutoMatcher() PASSED

TestCdMatcher > testLeafMatcher() PASSED

TestClassicGumtreeStability > testStability() PASSED

TestDefaultPriorityTreeQueue > testPopOpenWithHeight() PASSED

TestDefaultPriorityTreeQueue > testPopOpenWithSize() PASSED

TestDefaultPriorityTreeQueue > testPopOpenWithSizeAndMinPriority() PASSED

TestDefaultPriorityTreeQueue > testSynchronize() PASSED

TestDiff > testComputeWithReaders() PASSED

TestDirectoryComparator > testDirectoryComparatorOnTwoFolders() PASSED

TestDirectoryComparator > testPairFilesInvalidArguments() PASSED

TestDirectoryComparator > testPairAndUnpairFiles() PASSED

TestDirectoryComparator > testDirectoryComparatorOnFileAndFolder() PASSED

TestDirectoryComparator > testDirectoryComparatorOnNonExistentFiles() PASSED

TestDirectoryComparator > testDirectoryComparatorOnTwoFiles() PASSED

TestGumtreeMatcher > testMappingComparatorPosInParent() PASSED

TestGumtreeMatcher > testMinHeightThreshold() PASSED

TestGumtreeMatcher > testSimAndSizeThreshold() PASSED

TestGumtreeMatcher > testMappingComparatorPosInTree() PASSED

TestGumtreeProperties > testGreedyBottomUpMatcher() PASSED

TestGumtreeProperties > testAbstractSubtreeMatcher() PASSED

TestGumtreeProperties > testCompositeMatcher() PASSED

TestGumtreeProperties > testBottomUpMatcher() PASSED

TestGumtreeProperties > testChangeDistillerBottomUpMatcher() PASSED

TestGumtreeProperties > testChangeDistillerLeavesMatcher() PASSED

TestIdMatcher > testIdMatcher() PASSED

TestMappingComparators > testTwinMappings() PASSED

TestMappingStore > testMultiMappingStore() PASSED

TestMappingStore > testMappingStore() PASSED

TestMetadata > testExportInvalid1() PASSED

TestMetadata > testExportInvalid2() PASSED

TestMetadata > testPutNode() PASSED

TestMetadata > testExportCustom() PASSED

TestMetadata > testGlobalIterator() PASSED

TestMetadata > testLocalIterator() PASSED

TestOptimizedMatchers > testRtedThetaMatcher() PASSED

TestOptimizedMatchers > testChangeDistillerThetaParMatcher() PASSED

TestOptimizedMatchers > testClassicGumtreeThetaMatcher() PASSED

TestPair > testToString() PASSED

TestPair > testEquals() PASSED

TestRegistries > testTreeGenerators() PASSED

TestRegistries > testMatchers() PASSED

TestRtedMatcher > testRtedMatcher() PASSED

TestSequenceAlgorithms > testITreeLcssIsomorphism() PASSED

TestSequenceAlgorithms > testLcs() PASSED

TestSequenceAlgorithms > testITreeLcss() PASSED

TestSequenceAlgorithms > testStringLcss() PASSED

TestSimilarityMetrics > testOverlapSimilarity() PASSED

TestSimilarityMetrics > testChawatheSimilarity() PASSED

TestSimilarityMetrics > testJaccardSimilarity() PASSED

TestSimilarityMetrics > testDiceSimilarity() PASSED

TestTree > testGetDescendants() PASSED

TestTree > testImmutable() PASSED

TestTree > testChildUrl() PASSED

TestTree > testToString() PASSED

TestTree > testTypeThreading() PASSED

TestTree > testInsertChild() PASSED

TestTree > testGetParents() PASSED

TestTree > testIsomophism() PASSED

TestTree > testTypesAndLabels() PASSED

TestTree > testIsostructure() PASSED

TestTree > testTreesBetweenPositions() PASSED

TestTree > testDeepCopy() PASSED

TestTree > testSearchSubtree() PASSED

TestTree > testIsRoot() PASSED

TestTree > testFakeTree() PASSED

TestTree > testIsClone() PASSED

TestTree > testChildManipulation() PASSED

TestTreeClassifier > testOnlyRootsClassifier() PASSED

TestTreeClassifier > testAllNodesClassifier() PASSED

TestTreeIoUtils > testSerializeTree() PASSED

TestTreeIoUtils > testLineReader() PASSED

TestTreeIoUtils > testPrintTextTree() PASSED

TestTreeIoUtils > testDotFormatter() PASSED

TestTreeUtils > testPostOrder() PASSED

TestTreeUtils > testPreOrderList() PASSED

TestTreeUtils > testBfs() PASSED

TestTreeUtils > testDepth() PASSED

TestTreeUtils > testBfs2() PASSED

TestTreeUtils > testBfs3() PASSED

TestTreeUtils > testHash() PASSED

TestTreeUtils > testSize() PASSED

TestTreeUtils > testBfsList() PASSED

TestTreeUtils > testPostOrderNumbering() PASSED

TestTreeUtils > testLeafIterator() PASSED

TestTreeUtils > testPostOrder2() PASSED

TestTreeUtils > testPostOrder3() PASSED

TestTreeUtils > testHashValue() PASSED

TestTreeUtils > testHeight() PASSED

TestZsMatcher > testWithCustomExample() PASSED

TestZsMatcher > testWithSlideExample() PASSED

Task :gen.css:test

TestCssTreeGenerator > badSyntax() PASSED

TestCssTreeGenerator > testSimple() PASSED

Task :gen.javaparser:test

TestJavaParserGenerator > testBadSyntax() PASSED

TestJavaParserGenerator > testRange() PASSED

TestJavaParserGenerator > testSimpleSyntax(String, int, String) > [1] CompilationUnit, 12, package foo.bar; public class Foo { public int foo; } PASSED

TestJavaParserGenerator > testSimpleSyntax(String, int, String) > [2] CompilationUnit, 37, public class Foo { public List foo; public void foo() { for (A f : foo)
{ System.out.println(f); } } } PASSED

TestJavaParserGenerator > testSimpleSyntax(String, int, String) > [3] CompilationUnit, 23, public class Foo {
public void foo() {
new ArrayList().stream().forEach(a -> {});
}
} PASSED

Task :gen.jdt:test

TestJdtGenerator > testTagElement() PASSED

TestJdtGenerator > testPrefixExpression() PASSED

TestJdtGenerator > testArrayCreation() PASSED

TestJdtGenerator > testIds() PASSED

TestJdtGenerator > testComments2() PASSED

TestJdtGenerator > testClassReservedKeywords() PASSED

TestJdtGenerator > testMethodInvocation() PASSED

TestJdtGenerator > testJava8Syntax() PASSED

TestJdtGenerator > testSimpleSyntax() PASSED

TestJdtGenerator > testVarargs() PASSED

TestJdtGenerator > testComments() PASSED

TestJdtGenerator > testAssignment() PASSED

TestJdtGenerator > testGenericFunctionWithTypeParameter() PASSED

TestJdtGenerator > testTypeDefinition() PASSED

TestJdtGenerator > testClassReservedKeywords2() PASSED

TestJdtGenerator > testClassReservedKeywords3() PASSED

TestJdtGenerator > testJava5Syntax() PASSED

TestJdtGenerator > testPostfixExpression() PASSED

TestJdtGenerator > testEnumReservedKeywords() PASSED

TestJdtGenerator > badSyntax() PASSED

TestJdtGenerator > testRecordReservedKeywords() PASSED

TestJdtGenerator > testInfixOperator() PASSED

TestJdtMatching > testCase_1_20391Classic() SKIPPED

TestJdtMatching > testSpurious1WithClassicDefault() SKIPPED

TestJdtMatching > testSpurious1WithSimple() SKIPPED

TestJdtMatching > testCase_1_0007_Classic() SKIPPED

TestJdtMatching > testSpurious1WithClassic1_Default_0007d191fec7fe2d6a0c4e87594cb286a553f92c() SKIPPED

TestJdtMatching > testCase_1_0a66_Simple() SKIPPED

TestJdtMatching > testNotSpurious1() SKIPPED

TestJdtMatching > testSpurious1WithClassicConfiguredGreedySubtreeMatcher() SKIPPED

TestJdtMatching > testSpurious1WithClassicConfiguredGreedyBottomUpMatcher() SKIPPED

TestJdtMatching > testCase_1_0007_Simple() SKIPPED

TestJdtMatching > testSpurious1WithClassic_Configured_1_0007d191fec7fe2d6a0c4e87594cb286a553f92c() SKIPPED

TestJdtMatching > testSpurious1WithClassic_Configured4Passing_1_0007d191fec7fe2d6a0c4e87594cb286a553f92c() SKIPPED

TestJdtMatching > testSpurious1WithSimple_0007d191fec7fe2d6a0c4e87594cb286a553f92c() SKIPPED

Task :gen.js:test

TestJsGenerator > testStatement() PASSED

TestJsGenerator > testComment() PASSED

TestJsGenerator > testLambda() PASSED

TestJsGenerator > badSyntax() PASSED

Task :gen.json:test

TestJsonTreeGenerator > testSyntaxError1() PASSED

TestJsonTreeGenerator > testSyntaxError2() PASSED

TestJsonTreeGenerator > testSyntaxError3() PASSED

TestJsonTreeGenerator > testJsonArray() PASSED

TestJsonTreeGenerator > testJsonObject() PASSED
WARNING: A restricted method in java.lang.System has been called
WARNING: java.lang.System::load has been called by org.treesitter.utils.NativeUtils in an unnamed module
(file:/home/victor/.gradle/caches/modules-2/files-2.1/io.github.bonede/tree-sitter/0.26.6/a06f40ff61859e602985bc8850ebe28d3f54ebd0/tree-sitter-0.26.6.jar)
WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for callers in this module
WARNING: Restricted methods will be blocked in a future release unless native access is enabled

Task :gen.treesitter-ng:test

AbstractTreeSitterNgGeneratorTest > testMatchNodeOrAncestorTypes() PASSED

CMakeTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

CTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

CppTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

GoTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

HaskellTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

JavaScriptTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

JavaTreeSitterNgTreeGeneratorTest > testUnicodeInComment() PASSED

JavaTreeSitterNgTreeGeneratorTest > testUnicodeInString() PASSED

JavaTreeSitterNgTreeGeneratorTest > testCommentLine() PASSED

JavaTreeSitterNgTreeGeneratorTest > testAffectationOperatorChange() PASSED

JavaTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

KotlinTreeSitterNgTreeGeneratorTest > testUnicodeInString() PASSED

KotlinTreeSitterNgTreeGeneratorTest > testAffectationOperatorChange() PASSED

KotlinTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

OcamlTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

PhpTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

PythonTreeSitterNgTreeGeneratorTest > testComparisonOperators(String) > [1] < PASSED

PythonTreeSitterNgTreeGeneratorTest > testComparisonOperators(String) > [2] <= PASSED

PythonTreeSitterNgTreeGeneratorTest > testComparisonOperators(String) > [3] > PASSED

PythonTreeSitterNgTreeGeneratorTest > testComparisonOperators(String) > [4] >= PASSED

PythonTreeSitterNgTreeGeneratorTest > testComparisonOperators(String) > [5] == PASSED

PythonTreeSitterNgTreeGeneratorTest > testComparisonOperators(String) > [6] != PASSED

PythonTreeSitterNgTreeGeneratorTest > testBooleanOperators(String) > [1] and PASSED

PythonTreeSitterNgTreeGeneratorTest > testBooleanOperators(String) > [2] or PASSED

PythonTreeSitterNgTreeGeneratorTest > testAssignmentOperators(String) > [1] = PASSED

PythonTreeSitterNgTreeGeneratorTest > testAssignmentOperators(String) > [2] += PASSED

PythonTreeSitterNgTreeGeneratorTest > testAssignmentOperators(String) > [3] -= PASSED

PythonTreeSitterNgTreeGeneratorTest > testAssignmentOperators(String) > [4] *= PASSED

PythonTreeSitterNgTreeGeneratorTest > testAssignmentOperators(String) > [5] /= PASSED

PythonTreeSitterNgTreeGeneratorTest > testAssignmentOperators(String) > [6] //= PASSED

PythonTreeSitterNgTreeGeneratorTest > testAssignmentOperators(String) > [7] %= PASSED

PythonTreeSitterNgTreeGeneratorTest > testAssignmentOperators(String) > [8] **= PASSED

PythonTreeSitterNgTreeGeneratorTest > testString() PASSED

PythonTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

RTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

RubyTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

RustTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

SwiftTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

TsxTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

TypeScriptTreeSitterNgTreeGeneratorTest > testHelloWorld() PASSED

Task :gen.xml:test

TestXmlTreeGenerator > testSimpleSyntax() PASSED

TestXmlTreeGenerator > testXmlDeclaration() PASSED

Task :gen.yaml:test

TestYamlGenerator > testSyntaxError() PASSED

TestYamlGenerator > testSimpleSyntax() PASSED

[Incubating] Problems report is available at: file:///mnt/c/Users/victor/Downloads/gumtree/build/reports/problems/problems-report.html

BUILD SUCCESSFUL in 1m 52s
57 actionable tasks: 45 executed, 12 up-to-date
Consider enabling configuration cache to speed up this build: https://docs.gradle.org/9.5.1/userguide/configuration_cache_enabling.html

@tsantalis

Copy link
Copy Markdown

@jrfaller
This PR fixes a critical bug when using gen.treesitter-ng:4.0.0-beta6 in Windows OS.
I would be grateful if you merged this PR and made a new maven release as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants