Add fct_tides_vehicle_locations dbt model (closes #4837)#5216
Add fct_tides_vehicle_locations dbt model (closes #4837)#5216chrisyamas wants to merge 9 commits intomainfrom
Conversation
|
Warehouse report: Failed to add ci-report to a comment. Review the ci-report in the Summary. |
Impacted ExposuresNo exposures are impacted by the changes in this PR. Changed models
|
|
Terraform plan in iac/cal-itp-data-infra-staging/airflow/us Plan: 3 to add, 4 to change, 0 to destroy.Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
!~ update in-place
Terraform will perform the following actions:
# google_storage_bucket_object.calitp-staging-composer-catalog will be updated in-place
!~ resource "google_storage_bucket_object" "calitp-staging-composer-catalog" {
!~ content = (sensitive value)
!~ crc32c = "7vbSEg==" -> (known after apply)
!~ detect_md5hash = "gzQlzyAjYlTGiWPOSPmt/Q==" -> "different hash"
!~ generation = 1777921775322636 -> (known after apply)
id = "calitp-staging-composer-data/warehouse/target/catalog.json"
!~ md5hash = "gzQlzyAjYlTGiWPOSPmt/Q==" -> (known after apply)
name = "data/warehouse/target/catalog.json"
# (16 unchanged attributes hidden)
}
# google_storage_bucket_object.calitp-staging-composer-dags["dbt_project.yml"] will be updated in-place
!~ resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~ crc32c = "cIuoNQ==" -> (known after apply)
!~ detect_md5hash = "bsZgcfmK985tISFYJCt+qg==" -> "different hash"
!~ generation = 1777669801966208 -> (known after apply)
id = "calitp-staging-composer-data/warehouse/dbt_project.yml"
!~ md5hash = "bsZgcfmK985tISFYJCt+qg==" -> (known after apply)
name = "data/warehouse/dbt_project.yml"
# (17 unchanged attributes hidden)
}
# google_storage_bucket_object.calitp-staging-composer-dags["models/mart/tides/_mart_tides.yml"] will be created
+ resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+ bucket = "calitp-staging-composer"
+ content = (sensitive value)
+ content_type = (known after apply)
+ crc32c = (known after apply)
+ detect_md5hash = "different hash"
+ generation = (known after apply)
+ id = (known after apply)
+ kms_key_name = (known after apply)
+ md5hash = (known after apply)
+ md5hexhash = (known after apply)
+ media_link = (known after apply)
+ name = "data/warehouse/models/mart/tides/_mart_tides.yml"
+ output_name = (known after apply)
+ self_link = (known after apply)
+ source = "../../../../warehouse/models/mart/tides/_mart_tides.yml"
+ storage_class = (known after apply)
}
# google_storage_bucket_object.calitp-staging-composer-dags["models/mart/tides/fct_tides_vehicle_locations.sql"] will be created
+ resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+ bucket = "calitp-staging-composer"
+ content = (sensitive value)
+ content_type = (known after apply)
+ crc32c = (known after apply)
+ detect_md5hash = "different hash"
+ generation = (known after apply)
+ id = (known after apply)
+ kms_key_name = (known after apply)
+ md5hash = (known after apply)
+ md5hexhash = (known after apply)
+ media_link = (known after apply)
+ name = "data/warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+ output_name = (known after apply)
+ self_link = (known after apply)
+ source = "../../../../warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+ storage_class = (known after apply)
}
# google_storage_bucket_object.calitp-staging-composer-dags["seeds/_seeds.yml"] will be updated in-place
!~ resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~ crc32c = "7/62ZA==" -> (known after apply)
!~ detect_md5hash = "auu3vnNdExPQiA88ThI9DA==" -> "different hash"
!~ generation = 1776453636837026 -> (known after apply)
id = "calitp-staging-composer-data/warehouse/seeds/_seeds.yml"
!~ md5hash = "auu3vnNdExPQiA88ThI9DA==" -> (known after apply)
name = "data/warehouse/seeds/_seeds.yml"
# (17 unchanged attributes hidden)
}
# google_storage_bucket_object.calitp-staging-composer-dags["seeds/tides_publication_keys.csv"] will be created
+ resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+ bucket = "calitp-staging-composer"
+ content = (sensitive value)
+ content_type = (known after apply)
+ crc32c = (known after apply)
+ detect_md5hash = "different hash"
+ generation = (known after apply)
+ id = (known after apply)
+ kms_key_name = (known after apply)
+ md5hash = (known after apply)
+ md5hexhash = (known after apply)
+ media_link = (known after apply)
+ name = "data/warehouse/seeds/tides_publication_keys.csv"
+ output_name = (known after apply)
+ self_link = (known after apply)
+ source = "../../../../warehouse/seeds/tides_publication_keys.csv"
+ storage_class = (known after apply)
}
# google_storage_bucket_object.calitp-staging-composer-manifest will be updated in-place
!~ resource "google_storage_bucket_object" "calitp-staging-composer-manifest" {
!~ content = (sensitive value)
!~ crc32c = "ruSOBg==" -> (known after apply)
!~ detect_md5hash = "Mw4Cul2QM1zWeUWwGhMlmw==" -> "different hash"
!~ generation = 1777921776550660 -> (known after apply)
id = "calitp-staging-composer-data/warehouse/target/manifest.json"
!~ md5hash = "Mw4Cul2QM1zWeUWwGhMlmw==" -> (known after apply)
name = "data/warehouse/target/manifest.json"
# (16 unchanged attributes hidden)
}
Plan: 3 to add, 4 to change, 0 to destroy.📝 Plan generated in Deploy dbt #1822 |
|
Terraform plan in iac/cal-itp-data-infra/airflow/us Plan: 3 to add, 2 to change, 0 to destroy.Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
!~ update in-place
Terraform will perform the following actions:
# google_storage_bucket_object.calitp-composer-dags["dbt_project.yml"] will be updated in-place
!~ resource "google_storage_bucket_object" "calitp-composer-dags" {
!~ crc32c = "cIuoNQ==" -> (known after apply)
!~ detect_md5hash = "bsZgcfmK985tISFYJCt+qg==" -> "different hash"
!~ generation = 1777669782489514 -> (known after apply)
id = "calitp-composer-data/warehouse/dbt_project.yml"
!~ md5hash = "bsZgcfmK985tISFYJCt+qg==" -> (known after apply)
name = "data/warehouse/dbt_project.yml"
# (17 unchanged attributes hidden)
}
# google_storage_bucket_object.calitp-composer-dags["models/mart/tides/_mart_tides.yml"] will be created
+ resource "google_storage_bucket_object" "calitp-composer-dags" {
+ bucket = "calitp-composer"
+ content = (sensitive value)
+ content_type = (known after apply)
+ crc32c = (known after apply)
+ detect_md5hash = "different hash"
+ generation = (known after apply)
+ id = (known after apply)
+ kms_key_name = (known after apply)
+ md5hash = (known after apply)
+ md5hexhash = (known after apply)
+ media_link = (known after apply)
+ name = "data/warehouse/models/mart/tides/_mart_tides.yml"
+ output_name = (known after apply)
+ self_link = (known after apply)
+ source = "../../../../warehouse/models/mart/tides/_mart_tides.yml"
+ storage_class = (known after apply)
}
# google_storage_bucket_object.calitp-composer-dags["models/mart/tides/fct_tides_vehicle_locations.sql"] will be created
+ resource "google_storage_bucket_object" "calitp-composer-dags" {
+ bucket = "calitp-composer"
+ content = (sensitive value)
+ content_type = (known after apply)
+ crc32c = (known after apply)
+ detect_md5hash = "different hash"
+ generation = (known after apply)
+ id = (known after apply)
+ kms_key_name = (known after apply)
+ md5hash = (known after apply)
+ md5hexhash = (known after apply)
+ media_link = (known after apply)
+ name = "data/warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+ output_name = (known after apply)
+ self_link = (known after apply)
+ source = "../../../../warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+ storage_class = (known after apply)
}
# google_storage_bucket_object.calitp-composer-dags["seeds/_seeds.yml"] will be updated in-place
!~ resource "google_storage_bucket_object" "calitp-composer-dags" {
!~ crc32c = "7/62ZA==" -> (known after apply)
!~ detect_md5hash = "auu3vnNdExPQiA88ThI9DA==" -> "different hash"
!~ generation = 1776457910260376 -> (known after apply)
id = "calitp-composer-data/warehouse/seeds/_seeds.yml"
!~ md5hash = "auu3vnNdExPQiA88ThI9DA==" -> (known after apply)
name = "data/warehouse/seeds/_seeds.yml"
# (17 unchanged attributes hidden)
}
# google_storage_bucket_object.calitp-composer-dags["seeds/tides_publication_keys.csv"] will be created
+ resource "google_storage_bucket_object" "calitp-composer-dags" {
+ bucket = "calitp-composer"
+ content = (sensitive value)
+ content_type = (known after apply)
+ crc32c = (known after apply)
+ detect_md5hash = "different hash"
+ generation = (known after apply)
+ id = (known after apply)
+ kms_key_name = (known after apply)
+ md5hash = (known after apply)
+ md5hexhash = (known after apply)
+ media_link = (known after apply)
+ name = "data/warehouse/seeds/tides_publication_keys.csv"
+ output_name = (known after apply)
+ self_link = (known after apply)
+ source = "../../../../warehouse/seeds/tides_publication_keys.csv"
+ storage_class = (known after apply)
}
Plan: 3 to add, 2 to change, 0 to destroy.📝 Plan generated in Deploy dbt #1822 |
|
regional-subfeed fixed-route feeds. Closes #4837. - remove from model text |
|
|
@vevetron thanks, addressing your three comments above and the two follow-ups from the call (HAVING clauses on Slack, the in-code TIDES
|
Impacted ExposuresNo exposures are impacted by the changes in this PR. Changed models
|
lauriemerrell
left a comment
There was a problem hiding this comment.
A couple comments, think broad strokes look ok.
| DATETIME(vp.location_timestamp, vp.schedule_feed_timezone) AS event_timestamp, | ||
|
|
||
| vp.trip_id AS trip_id_performed, | ||
| -- trip_id_scheduled left NULL for MVP; deriving requires a reliable |
There was a problem hiding this comment.
not sure if this is true -- per GTFS spec, trip_id in VP should reference schedule unless the schedule_relationship is one of a few specific values.... so maybe this should be based on that?
also, this is in the intent of the trip_instance_key identifier (to allow joins of a specific trip across feed types), so can use that for lookup if desired
There was a problem hiding this comment.
good catch, agreed. per the GTFS-RT spec the VP trip_id does reference the schedule whenever trip.schedule_relationship is one of the in-schedule values, so deriving trip_id_scheduled from that conditional is the right shape. trip_instance_key is also viable as the join key.
leaving this PR's trip_id_scheduled as NULL for MVP and filing a follow-up to wire up the conditional + the join. happy to bump that into scope here if you'd prefer it land together.
| USING (gtfs_dataset_key) | ||
| ), | ||
|
|
||
| -- TIDES requires location_ping_id strictly unique; the upstream key is |
There was a problem hiding this comment.
noting that base64_url is part of key so if two results have same value they need to have same URL... so that won't actually be a substantive tie break
There was a problem hiding this comment.
right, both ordering columns were degenerate at this grain since location_timestamp and base64_url are components of the upstream key. fixed in a fixup: pulled the dedup up to the source CTE and ordered by _extract_ts DESC (most-recently-extracted wins), which differs across the duplicates. trailing deduped CTE collapsed into the source. verified 0 dups on a sampled service_date in staging.
| (https://tides-transit.org/main/). Sourced from `fct_vehicle_locations` | ||
| and filtered via `dim_provider_gtfs_data` to public, customer-facing or | ||
| regional-subfeed fixed-route feeds. | ||
| meta: |
There was a problem hiding this comment.
don't think we use publish.product anywhere else, what is intent for that as meta? can/should we be defining a dbt exposure? that is our standard for published items. see for example https://github.com/cal-itp/data-infra/blob/main/warehouse/models/mart/gtfs_schedule_latest/_gtfs_schedule_latest.yml#L1270-L1314 and https://github.com/cal-itp/data-infra/blob/main/airflow/dags/publish_gtfs.py and https://github.com/cal-itp/data-infra/blob/main/airflow/plugins/operators/dbt_manifest_to_metadata_operator.py#L88-L90 for the GTFS --> CKAN publish flow
There was a problem hiding this comment.
good call, exposure is the right shape. dropping the publish.product meta in this PR. adding a single california_tides exposure as part of PR 5220 (so both fct_tides_vehicle_locations and fct_tides_trips_performed ref()s resolve in the same checkout), modeled on the GTFS california_open_data block. owner / methodology fields filled in; meta.destinations now filled in via PR 5229 (publishing pipeline).
| FROM {{ ref('fct_vehicle_locations') }} | ||
| ), | ||
|
|
||
| -- dim_provider_gtfs_data fans out: a single vehicle_positions feed can be |
There was a problem hiding this comment.
I am a little confused by the logic here -- my instinct would be to either group by VP URL where it meets these criteria (public facing etc.) and array_agg the organization info so that all orgs can be used later in the publish process OR just select distinct on the non-org columns and ignore the organization parts.
Basically, not sure how ending up with a VP feed tagged with one of its organizations meets future needs -- if we need orgs, we should keep all of them and handle unnesting or whatever in the publish process. If we don't need all the orgs then let's just drop and publish under the VP URL and publish org related metadata separately.
There was a problem hiding this comment.
yeah agreed the fan-out collapse is doing more than it should. going with your second framing: dropping organization_name / organization_ntd_id from both fact tables since orgs aren't part of the TIDES spec anyway.
the public_subfeed_agencies CTE shrinks to a public_subfeed_keys CTE that's just SELECT DISTINCT on dataset keys matching the public-customer-facing-or-regional-subfeed criterion, no QUALIFY needed. govcbus / SD MTS multi-org reality moves to publish-side metadata in PR 5229, separate from the TIDES tables themselves.
|
|
|
We were working with the City of Hermosa Beach which is interested in studying on-time-performance. I believe the thought was to narrow the initial output to just the agencies that traverse Hermosa Beach for development purposes only to check on costs and implementation before expanding statewide. |
Yeah, it is what I remember too. :) |
|
thanks both, the scope has been narrowed! went ahead and implemented it as MVP this morning rather than wait for our sync since the seed-based mechanism is small enough that landing it gives us something concrete to react to. scope is now three feeds via a new
filter is an INNER JOIN on the seed inside both PR description on this one is updated to match. PR 5220 (trips_performed) inherits the same seed; PRs 5229 and 5230 (the publishing pipeline + staging bucket, drafts up now) inherit the narrowing automatically since they consume the views. one open question worth flagging for our sync today is whether it would be good to formalize a |
|
Warehouse report 📦 Checks/potential follow-upsChecks indicate the following action items may be necessary.
New models 🌱calitp_warehouse.mart.tides.fct_tides_vehicle_locations DAGLegend (in order of precedence)
|
Impacted ExposuresNo exposures are impacted by the changes in this PR. Changed models
|
Adds mart_gtfs.fct_tides_vehicle_locations, the first TIDES-conformant model in the Cal-ITP warehouse. Reshapes fct_vehicle_locations into the TIDES vehicle_locations schema and filters to public, customer-facing or regional-subfeed fixed-route GTFS-RT feeds via dim_provider_gtfs_data. The model produces the BigQuery table only. Per-agency parquet export (#4693), the CDN-fronted public bucket (#4700), and the file validator (#4839) are tracked separately. Validated in christopher_mart_gtfs sandbox over a 24-day window (2026-03-20 to 2026-04-30): 96M rows across 100 agencies, 99.991% unique location_ping_id, zero NULL on TIDES required-not-null fields, zero violations on TIDES bounds and enum constraints.
- Drop the file-level comment header on fct_tides_vehicle_locations.sql (matches the existing fct_vehicle_locations style). - Trim CTE-level comments to keep WHY (agency fan-out, NULL-trip caveat, defensive dedup rationale) and drop WHAT comments that just restate the SQL below. - Specify ASC on the second ORDER BY column to satisfy sqlfluff AM03. - Use GROUP BY 1 instead of repeating the column name (matches existing Cal-ITP usage and avoids the AM06 alias-mismatch risk). - Yml: tighten column descriptions, reuse anchor refs (*rt_service_date, *rt_vehicle_id, *rt_vp_stop_id, *gtfs_rt_dt, *base64_url, *gtfs_dataset_key_desc) where they apply, and wrap test where clauses in config: blocks to match the existing pattern on fct_vehicle_locations.
…_LOOKBACK_DAYS The var was renamed in main by #5178 (Laurie's incremental-vs-microbatch docs cleanup) after this branch was started. Match the new name so dbt compile passes.
…ssue refs
Per Vivek's call feedback (and Slack follow-up about HAVING clauses):
- Move the model out of mart_gtfs into its own mart/tides folder. Add a
mart_tides schema in dbt_project.yml so it materializes in
<user>_mart_tides rather than <user>_mart_gtfs. The model is a TIDES
product, not a GTFS mart, and grouping the eventual peers
(trips_performed, stop_visits) under one folder keeps the boundary clear.
- Convert from incremental microbatch to view. The downstream consumer is
the per-agency export Airflow job querying once per cycle, not
interactive analytics, so paying compute on every read is fine and we
avoid carrying a materialized 96M-row copy.
- Drop partition_by, cluster_by, full_refresh, on_schema_change, event_time,
batch_size, begin, lookback config keys that don't apply to a view.
- Refactor public_subfeed_agencies CTE from
ANY_VALUE(... HAVING MIN organization_name) GROUP BY 1 to
QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY organization_name) = 1.
Matches the team pattern used elsewhere in the warehouse (every other
"pick one canonical row per group" site uses QUALIFY ROW_NUMBER).
- Add model-level meta: { publish.product: tides } in the new yml,
parallel to the existing publish.* / ckan.* dotted-namespace meta keys
used on CKAN-published models.
- Drop "Closes #4837" from the model description (belongs in the PR body,
not the warehouse).
- Replace the inline TIDES issue #252 reference with the full GitHub URL.
…tive The previous QUALIFY ordered by event_timestamp DESC, base64_url ASC. Both columns are degenerate at the location_ping_id grain: location_timestamp and base64_url are components of the upstream `key`, so they're constant across rows that share a key. Move the dedup up to the source CTE and order by `_extract_ts DESC` so most-recently-extracted wins. The trailing `deduped` CTE collapses into the source CTE.
…lause Adds a `mart.tides: +enabled: true` line under data_tests in dbt_project.yml matching the existing `mart.payments` re-enable pattern. The model has six column-level tests (not_null on location_ping_id / event_timestamp / vehicle_id, accepted_values on current_status / trip_type, and unique_proportion on location_ping_id); all six pass against staging. The two accepted_values where clauses were `__rt_sampled__ AND <col> IS NOT NULL`. The rt_sampled_where_clause macro only substitutes on an exact `__rt_sampled__` match, so the compound form was emitting the literal token to BigQuery and failing. Trimmed to bare `__rt_sampled__`, matching the convention used everywhere else in the warehouse. accepted_values silently ignores NULLs already, so the IS NOT NULL filter was redundant.
…roduct Orgs aren't part of the TIDES spec (vehicle_locations.schema.json defines no organization fields). The agency-collapse CTE shrinks to a SELECT DISTINCT on `vehicle_positions_gtfs_dataset_key`, no QUALIFY needed. Both `organization_name` and `organization_ntd_id` come out of the model and out of _mart_tides.yml. The publish.product meta block is gone in favor of a real dbt exposure. The exposure itself lands on PR 5220 alongside fct_tides_trips_performed so both ref()s resolve in the same checkout. Per-agency / per-org metadata for the publish flow lives separately in PR 4 (next sprint).
d90e087 to
529d4e2
Compare
Impacted ExposuresNo exposures are impacted by the changes in this PR. Changed models
|


Description
Describe your changes and why you're making them. Please include the context, motivation, and relevant dependencies.
Resolves #4837
Adds
mart_gtfs.fct_tides_vehicle_locations, the first TIDES-conformant model in the warehouse. Reshapesfct_vehicle_locationsinto the TIDESvehicle_locationsschema and filters to public, customer-facing or regional-subfeed fixed-route GTFS-RT feeds viadim_provider_gtfs_data. The model produces the BigQuery table only; per-agency parquet export (#4693), CDN-fronted public bucket (#4700), and file validator (#4839) are tracked separately.A few design decisions worth flagging:
fct_vehicle_locationsdrops NULLtrip_idrows upstream, so deadhead and layover pings are not in the export. TIDES doesn't requiretrip_id_performed, so a future change could source fromfct_vehicle_positions_messagesto keep them.dim_provider_gtfs_datarecords multiple organization rows per VP feed when a feed is shared across agencies (govcbus.com is shared by 7 cities; the SD MTS feed is shared with the airport). The model collapses to one canonical org per feed (lex-smallest org name) to prevent fan-out duplication. Per-agency demuxing can happen at the export step.fct_vehicle_locations.keyis documented as "almost unique" upstream. The model adds a defensiveQUALIFY ROW_NUMBERper microbatch; residual cross-batch dups are 0.0089% (8,538 of 96M), which fits the upstreamunique_proportion at_least 0.999threshold but isn't strict TIDESunique: true. Open to tightening upstream if you'd rather.dim_provider_gtfs_databutvehicle_positions_gtfs_dataset_keyis NULL andcustomer_facingis FALSE, so Hermosa is not in this export. Worth confirming whether Hermosa is being onboarded or whether the seed-agency framing in Build a process to convert GTFS-RT Vehicle Positions data > TIDES Vehicle Locations using Hermosa Beach data #4837 was meant generically.TIDES = Transit Integrated Data Exchange Specification, https://tides-transit.org/main/.
Type of change
How has this been tested?
Include commands/logs/screenshots as relevant.
If making changes to dbt models, make sure they were created or update on Staging. Please run the command
uv run dbt run -s CHANGED_MODEL --target staginganduv run dbt test -s CHANGED_MODEL --target staging, then include the output in this section of the PR.uv run dbt run -s +fct_tides_vehicle_locations --target staging uv run dbt test -s fct_tides_vehicle_locations --target stagingMaterialized in
cal-itp-data-infra-staging.christopher_mart_gtfs.fct_tides_vehicle_locations. 24-day window from 2026-03-20 to 2026-04-30:location_ping_id(8,538 dups; under teamunique_proportion at_least 0.999)location_ping_id,event_timestamp,vehicle_id,trip_id_performedcurrent_statusenum mapping correct (no raw GTFS-RT values leak through)Top agencies by ping count:
Post-merge follow-ups
Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.
Two follow-up PRs ready locally and waiting on this one:
chore/tides-validation-harness— Frictionless validator undervalidation/tides/(closes part of Build a validator for those TIDES files #4839)feat/tides-trips-performed— second TIDES table sourced fromfct_observed_tripsWill open follow-up issues for the upstream
fct_vehicle_locations.keystrict-uniqueness option and for the column-level RT test binding behavior (also affects existingfct_vehicle_locations).