[SPEC] Add relative paths to v4 spec#15630
Conversation
steveloughran
left a comment
There was a problem hiding this comment.
Commented a bit on .. in path resolution.
it'd be a good test to submit a v3 manifest with file:/tables/table1/../../etc/passwd as a path and see if relativizing it detected the invalid path at that point
rambleraptor
left a comment
There was a problem hiding this comment.
I've got a couple stylistic things to help improve readability
stevenzwu
left a comment
There was a problem hiding this comment.
overall, it looks good to me. just some minor comments/questions
wypoon
left a comment
There was a problem hiding this comment.
I feel that you're trying to avoid mentioning path separators, and that that makes things unclear and confusing. I feel that it makes more sense to say that table location does not end in a path separator, that relative paths do not begin with a path separator, and when appending relative paths, we need to add the path separators in the appropriate places.
dedfe71 to
9749107
Compare
Co-authored-by: Talat Uyarer <talatuyarer@users.noreply.github.com> Co-authored-by: Daniel Weeks <danielcweeks@users.noreply.github.com>
9749107 to
68647e1
Compare
| * **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. | ||
| * **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. | ||
|
|
||
| Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. |
There was a problem hiding this comment.
This seems to imply that resolution was previously allowed, which isn't the case. I like the older wording that things like . and .. have no special meaning.
I would bring back the older version and then add the link to the RFC as a clarification: The relative resolution components defined by the RFC have no special meaning and are opaque to Iceberg.
There was a problem hiding this comment.
We never explicitly prohibited it and I don't feel like we should be adding requirements that weren't there previously.
I didn't change the wording, so I'm not sure what you're referring to. I added the reference to the URI that describes these behaviors since other people were confused about what kind path evaluations we were referring to.
| | Relative Path | v4 | s3://bucket/db/table | data/00000-0.parquet | s3://bucket/db/table/data/00000-0.parquet | Path parts are joined on `/` | | ||
| | Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/00000-0.parquet | hdfs://wh/db/table/data/00000-0.parquet | Absolute path is used | | ||
| | Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/00000-0.parquet | s3://bucket/db/table/data/00000-0.parquet | Fully-qualified path is used | | ||
| | Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/00000-0.parquet | hdfs:/wh/db/table/data/00000-0.parquet | Scheme is prepended for consistency | |
There was a problem hiding this comment.
Should this use file: or hdfs:? I'd lean toward file: because I think that was more common. But I could be wrong. I just haven't seen many paths that are //namenode:8020/wh/db/table.
There was a problem hiding this comment.
No, file: is not more common. The typical configuration for hdfs clusters will include the default fs config in the core-site.xml. I don't know of any practical situations where people are migrating from local file systems.
The example you give would typically be encoded in the cluster configuration like:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode-host:8020</value>
<description>The default filesystem URI</description>
</property>
</configuration>
References to files would just be encoded as: /wh/db/table
The default fs will be handled by the hdfs FileSystem implementation. The real use case we're covering is these types of deployments in hdfs.
| | _optional_ | _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). | | ||
| | | | _required_ | **`next-row-id`** | A `long` higher than all assigned row IDs; the next snapshot’s `first-row-id`. See [Row Lineage](#row-lineage). | | ||
| | | | _optional_ | **`encryption-keys`** | A list (optional) of [encryption keys](#encryption-keys) used for table encryption. | | ||
| === "v4" |
There was a problem hiding this comment.
I didn't expect this to add a new table for v4. Is this the same as the old table, except with a v4 requirement column?
There was a problem hiding this comment.
The purpose of adding the tabs was to separate V4 as we go to new structure. Why would we update the old table and add tabs with not additional tabs if we're going to just remove that and add the tab later?
| * If `write.data.path` is an absolute path, it is used directly as the base for new data files. | ||
| * If `write.data.path` is a relative path, the base is the table location joined to the `write.data.path` value with a URI separator `/`. | ||
|
|
||
| When persisting paths into metadata, writers should relativize paths against the table location (see [Path Relativization](#path-relativization)). If a file's absolute path shares a common prefix with the table location, the relative portion should be stored. Otherwise, the absolute path should be stored. |
There was a problem hiding this comment.
If allowed by the table version?
| * If the path contains a URI scheme, it is absolute and is used without modification. | ||
| * If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. | ||
|
|
||
| Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. |
There was a problem hiding this comment.
Two issues with this paragraph after the latest edit:
-
must→shouldintroduces a real ambiguity. With the new join-with-/rule on line 202, a permitted table location ofs3://bucket/table/plus a relativedata/fileproducess3://bucket/table//data/file(double slash). The spec doesn't say whether implementations must reject, normalize, or accept that. Either keepmust, or add a normalization clause covering trailing-separator table locations. -
"without consideration of any additional separator characters" is a bit unclear. The new wording can be read as "any separators already present are ignored," which contradicts line 202 (the join adds a
/). Suggest rewriting to: "no additional separator characters are introduced beyond the single/join character."
| * If `write.data.path` is an absolute path, it is used directly as the base for new data files. | ||
| * If `write.data.path` is a relative path, the base is the table location joined to the `write.data.path` value with a URI separator `/`. | ||
|
|
||
| When persisting paths into metadata, writers should relativize paths against the table location (see [Path Relativization](#path-relativization)). If a file's absolute path shares a common prefix with the table location, the relative portion should be stored. Otherwise, the absolute path should be stored. |
There was a problem hiding this comment.
This recommendation wasn't updated to match the new boundary-aware rule in Path Relativization. "Shares a common prefix" is the original ambiguous phrasing — the same one that allows s3://bucket/table_v2/file to be "under" s3://bucket/table under a byte-prefix interpretation.
Suggested rewording to mirror the normative section:
When persisting paths into metadata, writers should relativize paths against the table location (see Path Relativization). If a file's absolute path starts with the table location followed by the URI separator
/, the relative portion (after removing the prefix and separator) should be stored. Otherwise, the absolute path should be stored.
| * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. | ||
| * **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. | ||
| * **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. | ||
|
|
There was a problem hiding this comment.
The spec defines Absolute path and Relative path in the bullets above, but fully-qualified path is used here (and on lines 136, 206, 1741, and as a row label in the example table at line 214) without a definition.
The term is doing work that the two defined terms can't quite cover. Line 206 says "Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths." That implies a fully-qualified path may not have a URI scheme — which conflicts with the v4 Absolute path definition (which requires one). The "Missing scheme" example row at line 215 reinforces this: a pre-v4 path can lack a URI scheme yet still be treated as a usable, non-relative path.
Two cleanup options:
-
Define the term, e.g.:
Fully-qualified path -- A path used in v3 and earlier metadata that is treated as ready-to-use without resolution against the table location. May contain a URI scheme; when omitted, an implementation-defined default scheme is prepended on read.
-
Drop the term and use Absolute path everywhere, with a clarifying clause that pre-v4 paths omitting a URI scheme are treated as absolute after the default scheme is prepended.
Option 2 feels cleaner — one fewer term, and the migration rule on lines 1741-1742 reads more naturally as "v3 paths are absolute paths (with the scheme prepended if missing)."
There was a problem hiding this comment.
I disagree with this (I'm not sure if this was AI generated or your opinion). Versions prior to V4 were defined in the spec already as either "URI with scheme" or "fully-qualified". Those were the existing terms in the spec.
I don't think we should go back to further define those terms as we may introduce new requirements on older versions. The new terms apply to V4 and the behaviors being introduced in this revision.
You need to take into consideration backward compatibility, which means that we cannot apply option 2 as it would redefine prior versions of the spec.
|
|
||
| Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. | ||
|
|
||
| Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. |
There was a problem hiding this comment.
this sentence reads a bit awkward. is this more clear?
"Paths in pre-v4 manifests are fully-qualified. When a pre-v4 path omits a URI scheme, readers must prepend a scheme to produce a v4-consistent absolute path."
|
|
||
| #### Table Location Specification | ||
|
|
||
| When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should and provide a table's location |
There was a problem hiding this comment.
Two issues in this paragraph:
- Grammatical error: "catalogs should and provide a table's location" is missing a verb (e.g. "manage and provide"), and the sentence has no closing period.
- Implicit actor: "the table location must be provided" doesn't say by whom, then the next clause silently assumes the catalog.
Suggested rewrite:
When the
locationfield is present in table metadata, it is used directly as the table's base location. When thelocationfield is not present (v4 and later), the catalog must provide the table location. How the catalog persists or determines a table's location is outside the scope of this spec.
Changes:
- Names the catalog as the actor in the second sentence.
- Folds "catalogs should and provide a table's location" into the new third sentence, replacing the broken clause and adding the missing period.
- "is not a table-level concern" → "is outside the scope of this spec" — clearer about what the spec is leaving unspecified.
There was a problem hiding this comment.
This is also probably AI picking up prior revisions of the spec (it's using exact wording from earlier that was already rejected).
I fixed the grammatical error, the other suggestions are the opposite of what the goal of this section is. We're not trying to say who provides the location, we also say the catalog should provide the location.
|
|
||
| ### Version 4 | ||
|
|
||
| Relative path support is added in v4. |
There was a problem hiding this comment.
I feel we probably need to introduce sub headers for Version 4, as we have so many changes scoped for V4. if each feature has multiple paragraphs, it is hard to track.
The alternative is that each feature only has bullet points. E.g. the row lineage in V3 section below has a large bullet list.
I am slightly favor the former.
| * When not present, the table location must be managed externally and provided when loading the metadata | ||
| * Location fields in all metadata structures may contain relative paths | ||
| * Writers should produce relative paths by default for files that reside under the table location | ||
| * Absolute paths must be used for files that do not share a common prefix with the table location |
There was a problem hiding this comment.
Same "shares a common prefix" ambiguity flagged in the inline at line 1905. Line 1750 above ("files that reside under the table location") is also informal terminology for what is now formally defined in Path Relativization.
Suggested rewording for lines 1750–1751:
- Writers should produce relative paths by default for files whose absolute path starts with the table location followed by the URI separator
/.- Absolute paths must be used for all other files.
This way the writer rule is exactly the Path Relativization rule applied — no chance of producing a relative form that doesn't round-trip safely under table relocation.
Adds text to the spec for relative paths.
See full proposal at #13141