Skip to content

Core: Add V4 location relativization utilities#16174

Open
anoopj wants to merge 7 commits into
apache:mainfrom
anoopj:v4-location-util
Open

Core: Add V4 location relativization utilities#16174
anoopj wants to merge 7 commits into
apache:mainfrom
anoopj:v4-location-util

Conversation

@anoopj
Copy link
Copy Markdown
Contributor

@anoopj anoopj commented Apr 30, 2026

Add utility methods for converting between absolute and relative file paths. These will be used by v4 metadata to store locations relative to the table location.

Proposal: https://s.apache.org/iceberg-spec-relative-path
Spec PR: #15630

@github-actions github-actions Bot added the core label Apr 30, 2026
@anoopj anoopj moved this to In review in V4: metadata tree Apr 30, 2026
* hdfs://}). Per RFC 3986, a scheme ends at the first {@code :} and cannot contain {@code /}, so
* a colon that appears after a slash is not a scheme delimiter.
*/
static boolean isAbsolute(String path) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is the right way to check for the presence of a scheme. In the relative case, it could potentially do two full passes through the string, which is slow. It also will return true for invalid paths and it will return false for valid URIs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to do a faster check.

* against bare local paths.
*/
static String resolve(String path, String tableLocation) {
if (path == null || isAbsolute(path) || !isAbsolute(tableLocation)) {
Copy link
Copy Markdown
Contributor

@danielcweeks danielcweeks Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!isAbsolute(tableLocation) check doesn't make sense. By definition, the table location must be absolute or omitted, so this can only return false. At this point, the table location should have been provided (either from metadata or from the catalog). We should assert that it's not null, but probably not check at this point that it is a valid uri (I feel we should assume that check has already been performed).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

return path;
}

Preconditions.checkArgument(path.startsWith("/"), "Relative path must start with /: %s", path);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a requirement of a relative component. We don't define or enforce path separators, so we shouldn't require it here. For example with S3, there's no requirement that path separators exist in the keyspace.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be a bit defensive, but I do agree that it deviates from the spec. Fixed.

Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
@anoopj anoopj force-pushed the v4-location-util branch from 1b8accb to ce7d435 Compare May 1, 2026 00:05
@anoopj anoopj requested a review from danielcweeks May 1, 2026 00:09
anoopj added 3 commits May 1, 2026 08:06
Add isAbsolute, resolve, and relativize methods for converting
between absolute and relative file paths. These will be used by
v4 metadata to store locations relative to the table location.
@anoopj anoopj force-pushed the v4-location-util branch from ce7d435 to cbfb44b Compare May 1, 2026 15:07

import org.apache.iceberg.relocated.com.google.common.base.Preconditions;

class RelativePathUtil {
Copy link
Copy Markdown
Contributor

@danielcweeks danielcweeks May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to change this to something more like LocationUtils, which unfortunately already exists under org.apache.iceberg.util. There's only one method in that class, so maybe we deprecate that and move it here to preserve the package private access for now add this functionality there and either leave it protected until we use it or just expose it as public.

We should avoid making this purely about relative operations and also we've tried to standardize more around location than path

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially added it to LocationUtil, but created a new class to keep these methods package-private for now. I will move it to LocationUtil and make it public.

@anoopj anoopj requested a review from danielcweeks May 6, 2026 21:59
Comment thread core/src/main/java/org/apache/iceberg/util/LocationUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/LocationUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/LocationUtil.java Outdated
* characters, per <a href="https://datatracker.ietf.org/doc/html/rfc3986#section-3.1">RFC 3986
* section 3.1</a>.
*/
private static boolean hasScheme(String location) {
Copy link
Copy Markdown
Contributor

@rdblue rdblue May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to try to speed this up, but I'm concerned about being outside of the RFC. Section 3.1 says that the scheme can contain +, -, and .:

scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

Rather than only using isLetterOrDigit, I think this should also cover ch == '+' || ch == '-' || ch == '.'.

There is also no length check. I'm torn on this because I don't think we are effectively limiting many cases by choosing a limit. At the same time, how often are we going to hit paths of only alphanumeric characters without /?

The false-positive case is that we have a long relative path directory name that is not a scheme, since we think schemes are almost always < 10 chars. Initial directory names are almost always "data/" or "metadata/", which will exit the loop on / at index 4 or 8. By that logic, this only helps if we're checking very strange paths that are not absolute, not regular table paths, and have long alphanumeric directory names. Doesn't seem worth the potential bug for this optimization to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually agree. Simplified.

Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestRelativePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
@anoopj anoopj force-pushed the v4-location-util branch from 681076f to fa10500 Compare May 8, 2026 15:12
@anoopj anoopj requested review from nastra and rdblue May 8, 2026 15:23
Comment thread core/src/test/java/org/apache/iceberg/util/TestLocationUtil.java Outdated
}

@Test
public void testRelativizeLocationNotUnderTableLocation() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like for this to test both a table/path mismatch and a bucket mismatch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

Copy link
Copy Markdown
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is correct. There are a couple minor comments on the tests, but otherwise I think this is good.

* location is returned as-is.
*/
public static String relativizeLocation(String tableLocation, String location) {
if (location.startsWith(tableLocation)) {
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix-collision bug: startsWith matches sibling locations that share a prefix with the table location.

relativizeLocation("s3://bucket/table", "s3://bucket/table_v2/data/00000-0.parquet")
  -> "_v2/data/00000-0.parquet"   // wrong; this location is NOT under tableLocation

The new testRelativizeLocationNotUnderTableLocation covers different-bucket and different-path cases, but neither shares a prefix with table, so this case still slips through. Worth a test for the table vs table_v2 style collision.

One possible solution if the path separator is / char.

  if (location.startsWith(tableLocation)) {
    String rest = location.substring(tableLocation.length());
    if (rest.isEmpty() || rest.startsWith("/")) {
      return rest;
    }
  }

  return location;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this defensive check in the initial version of this PR, but removed it because I got feedback that we should not do assumptions about path separators. cc @danielcweeks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here isn't really about whether / is a separator — at the storage layer, S3 is a flat keyspace (bucket + opaque object key), and there's no folder concept in S3 itself. Requiring path-separator semantics at that level isn't right.

The actual problem with byte-prefix startsWith is relocation correctness, which is the whole purpose of supporting relative paths.

Same-location round-trip looks fine in isolation:

tableLocation = "s3://bucket/table"
file          = "s3://bucket/table_v2/file"   (a sibling of the table, not a child)
relativize    -> "_v2/file"
resolve back  -> "s3://bucket/table_v2/file"  ✓ round-trips

But move the table to s3://newbucket/newtable and resolve the same stored relative form:

resolve("s3://newbucket/newtable", "_v2/file")
  -> "s3://newbucket/newtable_v2/file"

The file that originally lived at s3://bucket/table_v2/file (a sibling of the old table) now resolves to s3://newbucket/newtable_v2/file, with no reason that key should exist or contain the right data. A file that isn't structurally a child of the table location cannot survive relocation when stored as a relative path, and the whole point of relative paths is to make relocation safe.

The spec PR's rule is "If a file's absolute path shares a common prefix with the table location, the relative portion should be stored. Otherwise, the absolute path should be stored." — but it doesn't pin down what "common prefix" means. Byte-prefix permits the case above; a boundary/segment-aware notion forbids it. The relocation argument points to the latter — otherwise relative paths are unsafe for any sibling-prefix layout.

cc @rdblue / @danielcweeks: should s3://bucket/table_v2/file be relativizable against s3://bucket/table? If not, the spec should make that explicit, and relativizeLocation should refuse to produce a relative form in that case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scenario Steven described is definitely possible. I had called out this in the proposal doc.

I agree that this is something we need to consider in the spec.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to add defensive checks if it can be done efficiently in the reference implementation, but I'm not convinced there's a way to cleanly solve for this in the spec. This scenario should resolve to an absolute path, but I don't think we can provide a concrete solution without relying on a separator character.

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario should resolve to an absolute path

Great. This is the first thing that we need to agree on.

but I don't think we can provide a concrete solution without relying on a separator character.

This is the tricky part. The location is an URI (not a s3 object key string). It might be reasonable to assume / as a separator character for hierarchies?

The spec already mentioned the path separator. Can we require that the relative portion must start with the path separator?

Paths used as prefixes must not end in a path separator. 
The relative portion is appended to the prefix without
 introduction of any additional separator characters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we require that the relative portion must start with the path separator

Alternatively, we can flip it require that the table location must end with a path separator. Both options solve this problem.

Implementors will find it a bit surprising that the relative path starts with a /. For instance, if you use the Rust Url libraries you get surprising results:

use url::Url;

fn main() {
      let base = Url::parse("s3://bucket/mytable").unwrap();

      let iceberg_url = base.join("/data/file.parquet").unwrap();
      # prints out: Iceberg absolute: s3://bucket/data/file.parquet (mytable is dropped)
      println!("Iceberg absolute: {iceberg_url}");
}

This can be avoided if we flip it and require the location must end with the separator instead. The downside of this approach is that we might lose the early termination done in the current version of this PR and may have to look at more characters for a colon instead. But honestly it may not make much of a difference in modern CPUs if JVM is using JIT compilation with SIMD instructions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was originally pushback against including the separator at the end of the table prefix (I don't remember the specific argument, but I'll look into that). I don't see a good reason at this point to not change the separator location. That would also align better with what people intuitively think of when looking at a path (relative paths typically don't start with a /).

I'll follow up on this.

Comment thread core/src/main/java/org/apache/iceberg/util/LocationUtil.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/util/TestLocationUtil.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

5 participants