fix(rest): skip Hadoop-only vended storage credentials during resolution#3241
fix(rest): skip Hadoop-only vended storage credentials during resolution#3241plusplusjiajia wants to merge 1 commit into
Conversation
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
d68ade8 to
a1315d1
Compare
a1315d1 to
4a4f18c
Compare
|
Hi @kevinjqliu — would you have a chance to take a look at this when you get cycles? Small fix to _resolve_storage_credentials (the function from #3042). |
| _PLANNING_RESPONSE_ADAPTER = TypeAdapter(PlanningResponse) | ||
|
|
||
|
|
||
| def _is_hadoop_only_config(config: Properties) -> bool: |
There was a problem hiding this comment.
Config is a dict[str, str] according to the OpenAPI doc
| # Java S3FileIO falls back to the "s3" ROOT_PREFIX credential; scope it to | ||
| # schemes pyarrow's S3FileSystem handles so non-S3 schemes (gs://, abfs://, | ||
| # etc.) don't get handed s3.* keys. | ||
| if best_match is None and location.startswith(("s3://", "s3a://", "s3n://", "oss://")): |
There was a problem hiding this comment.
Wouldn't s3:// get caught by line 477 already?
| # Java S3FileIO falls back to the "s3" ROOT_PREFIX credential; scope it to | ||
| # schemes pyarrow's S3FileSystem handles so non-S3 schemes (gs://, abfs://, | ||
| # etc.) don't get handed s3.* keys. | ||
| if best_match is None and location.startswith(("s3://", "s3a://", "s3n://", "oss://")): |
There was a problem hiding this comment.
I understand that we want s3 prefixed credentials to get mapped to s3a + s3n. What's oss here?
Rationale for this change
REST catalogs can return multiple StorageCredential entries per table to serve different client runtimes. A common pattern is one entry with Hadoop-style fs.* keys alongside a second entry with canonical s3.* / gs.* keys consumed by the cloud-native SDKs).
Java's FileIO implementations each filter vended credentials down to their own key namespace. S3FileIO.clientForStoragePath() only consumes entries with an s3-prefixed label (S3FileIO.java:413-414) and, when no URI prefix matches the storage path, falls back to the client keyed at the root "s3" prefix. pyiceberg has no HadoopFileIO, so Hadoop-style credential bundles have no consumer on the Python side; but _resolve_storage_credentials did a blind longest-prefix URI match across the full credential list, so when a Hadoop-style entry happened to be the longest URI-prefix match for a given location the Python FileIO ended up with fs.* keys it cannot use, and silently fell through to unauthenticated access.