fix: tables schema command replace manual _delta_log JSON parsing with DeltaTable.schema() from the deltalake library#229
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes fab tables schema failing on Delta tables whose pre-checkpoint JSON logs were vacuumed, by switching schema extraction to the Delta protocol via the deltalake library.
Changes:
- Replaced manual
_delta_log/*.jsonscanning withdeltalake.DeltaTable(...).schema()for checkpoint-aware schema extraction. - Added
deltalake>=0.18.0to project dependencies. - Renamed typo’d error constant
ERROR_INVALID_DETLA_TABLEtoERROR_INVALID_DELTA_TABLEand wired it into the tables schema command.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/fabric_cli/core/fab_constant.py | Fixes typo in error constant name; minor formatting cleanup. |
| src/fabric_cli/commands/tables/fab_tables_schema.py | Refactors schema retrieval to use deltalake over ABFSS + token, removing fragile log parsing. |
| pyproject.toml | Adds deltalake dependency required for robust schema extraction. |
| .changes/unreleased/fixed-20260430-130558.yaml | Adds release note entry for the schema extraction fix. |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <[email protected]>
shirasassoon
left a comment
There was a problem hiding this comment.
@pkontek thanks for submitting this PR to address a legitimate bug in Fabric CLI! I have started reviewing the PR and added some comments. Please take a look at your earliest convenience.
Co-authored-by: Copilot Autofix powered by AI <[email protected]>
|
@shirasassoon is there anything else I can do to close this pull request? |
ayeshurun
left a comment
There was a problem hiding this comment.
Thanks for tackling this, @pkontek 🙏 Appreciate the thorough test scaffolding and the changie entry.
Before we can land it, could you take a look at CI? The "Test on Python 3.12" job is red and the other Python versions got cancelled by fail-fast.
| return schema_fields | ||
| except (DeltaError, json.JSONDecodeError, ValueError) as exc: | ||
| raise FabricCLIError( | ||
| f"Failed to extract the table schema. Please ensure the path points to a valid Delta table: {exc}", |
There was a problem hiding this comment.
I believe exception strings can include the full abfss://… URI (containing workspace & item GUIDs), internal Rust diagnostics, and storage-backend details. That violates the supportability/security guidance in commands.instructions.md ("Never log tokens, cookies, correlation IDs, or PII").
There was a problem hiding this comment.
The previous implementation went through fab_api_onelake (centralized retries, timeouts, telemetry, request-id surfacing, and respect for FAB_API_ENDPOINT_ONELAKE testing hooks via fab_api_client.py). The new code constructs an abfss:// URI and hands the bearer token directly to delta-rs's Rust object_store.
That conflicts with commands.instructions.md - "All API calls must be invoked via centralized request utilities", "Use the common API client and common retry/backoff policy - don't hand-roll session logic".
Concretely we lose:
- Bounded
--timeoutsemantics - Exponential-backoff retries on transient 5xx/429
- Request-ID propagation surfaced via OnelakeAPIError
- Correlated debug logging
Please wrap the DeltaTable(...) construction in a thin helper under fabric_cli/client/ (e.g., fab_delta_client.py) that owns storage_options assembly, and error handling for exceptions → FabricCLIError mapping, and provides a single seam for future retry/timeout/telemetry.
| if args.schema: | ||
| local_path = f"Tables/{args.schema}/{args.table_name}" | ||
| else: | ||
| local_path = f"Tables/{args.table_name}" |
There was a problem hiding this comment.
Please pass the resolved context.local_path from fab_tables.py::schema_command (either directly or via an args.table_local_path populated in add_table_props_to_args). Drop the manual f"Tables/..." reconstruction.
| if args.schema: | ||
| local_path = f"Tables/{args.schema}/{args.table_name}" | ||
| else: | ||
| local_path = f"Tables/{args.table_name}" |
There was a problem hiding this comment.
The old code defensively called utils.remove_dot_suffix(local_path) (default suffix .Shortcut). add_table_props_to_args sets args.table_name = table_path[-1] without stripping .Shortcut. If a user runs fab table schema /ws.Workspace/lh.Lakehouse/Tables/my_table.Shortcut, the new URI becomes …/Tables/my_table.Shortcut, which won't exist in storage.
PLease add a regression test for the .Shortcut path and either (a) call remove_dot_suffix(args.table_name) defensively here, or (b) confirm in add_table_props_to_args that the suffix is normalized upstream.
There was a problem hiding this comment.
command_support.yaml lists table schema as supported for lakehouse, warehouse, kql_database, mirrored_database, semantic_model, sql_database. The new hardcoded Tables/… URI assumes the lakehouse/warehouse OneLake layout. Several of those item types do not expose Delta tables under Tables/ the same way (notably semantic_model).
Please prove (via test) that the new URI works for all listed item types, or narrow add a validation and fail in code for the non supported item in deltatable flow.
| commit_data = json.loads(obj) | ||
| if "metaData" in commit_data: | ||
| metadata = commit_data["metaData"] | ||
| schema = metadata["schemaString"] |
There was a problem hiding this comment.
The old code returned json.loads(metaData.schemaString)["fields"], the Delta protocol's field schema. The new code returns json.loads(DeltaTable.schema().to_json())["fields"], delta-rs's serialization of the Arrow schema. For complex types (nested struct, decimal, timestamp, map), the JSON shapes can differ ("type": "long" vs "type": "integer", nested vs string representation, etc.).
Could you please add a unit test with at least one nested/decimal/timestamp field and assert the exact output structure to lock the contract? Otherwise this is a silent breaking change for users who pipe --output_format json into scripts.
| raise FabricCLIError( | ||
| "Failed to extract the table schema. Please ensure the path points to a valid Delta table", | ||
| fab_constant.ERROR_INVALID_DETLA_TABLE, | ||
| "Failed to obtain access token.", |
There was a problem hiding this comment.
"Failed to obtain access token." already exists as AuthErrors.access_token_failed(). Per commands.instructions.md, reuse messages from src/fabric_cli/errors/.py:
from fabric_cli.errors import ErrorMessages
raise FabricCLIError(
ErrorMessages.Auth.access_token_failed(),
fab_constant.ERROR_AUTHENTICATION_FAILED,
)
There was a problem hiding this comment.
The tests don't actually prove the fix.
Every test mocks DeltaTable, so none of them exercises the compacted-log scenario the PR is fixing. The tests prove the dispatch/mapping logic, not the bug fix.
Please add at least one test that uses a real local Delta table fixture (delta-rs can write a tiny one in a tmp dir) with a checkpoint and no pre-checkpoint JSON logs, that's the actual regression test for #228.
| from tests.test_commands.commands_parser import CLIExecutor | ||
|
|
||
|
|
||
| class TestTablesSchemaUnit: |
There was a problem hiding this comment.
The test.instructions.md classifies pure unit tests (no network, no VCR) under tests/test_core/** or tests/test_utils/**. THis class TestTablesSchemaUnit is a pure unit test class - please move it to tests/test_core/ (or splitting the file) and leaving only TestTablesSchemaIntegration under tests/test_commands/.
There was a problem hiding this comment.
Because DeltaTable and FabAuth are both mocked in TestTablesSchemaIntegration, the 258-line test_table_schema_success.yaml cassette only exercises workspace+lakehouse CRUD that already has coverage elsewhere. Consider dropping the cassette and converting the integration test into a pure dispatch test (parser → command function call) without VCR — much less maintenance burden.
📥 Pull Request
✨ Description of new changes
Summary: The
fab tables schemacommand failed with[InvalidDeltaTable] Failed to extract the table schemafor Delta tables where pre-checkpoint JSON commit log files had been cleaned up. The previous implementation manually scanned_delta_log/*.jsonfiles via the OneLake API looking for ametaDataentry — an approach that breaks once Delta Lake compacts logs into.checkpoint.parquetfiles and removes the preceding JSON entries.Context: Delta Lake creates a checkpoint every 10 transactions by default and retains log files according to the configured retention policy. Once older JSON commit files are removed, the
metaDataentry (which carries the schema) is no longer available in the remaining JSON logs, causing the command to fail even for healthy, accessible tables.Dependencies: Adds
deltalake>=0.18.0as a new dependency. Thedeltalakelibrary correctly resolves the table schema from both JSON logs and Parquet checkpoints using the standard Delta protocol, making the implementation robust and protocol-compliant.Changes:
_delta_logJSON parsing withDeltaTable.schema()from thedeltalakelibrarydeltalake>=0.18.0to project dependencies inpyproject.tomlfab_tables_schema.pyby removing ~50 lines of fragile log-parsing logic