Skip to content

Add fct_tides_vehicle_locations dbt model (closes #4837)#5216

Draft
chrisyamas wants to merge 9 commits intomainfrom
feat/tides-vehicle-locations
Draft

Add fct_tides_vehicle_locations dbt model (closes #4837)#5216
chrisyamas wants to merge 9 commits intomainfrom
feat/tides-vehicle-locations

Conversation

@chrisyamas
Copy link
Copy Markdown
Contributor

Description

Describe your changes and why you're making them. Please include the context, motivation, and relevant dependencies.

Resolves #4837

Adds mart_gtfs.fct_tides_vehicle_locations, the first TIDES-conformant model in the warehouse. Reshapes fct_vehicle_locations into the TIDES vehicle_locations schema and filters to public, customer-facing or regional-subfeed fixed-route GTFS-RT feeds via dim_provider_gtfs_data. The model produces the BigQuery table only; per-agency parquet export (#4693), CDN-fronted public bucket (#4700), and file validator (#4839) are tracked separately.

A few design decisions worth flagging:

  • fct_vehicle_locations drops NULL trip_id rows upstream, so deadhead and layover pings are not in the export. TIDES doesn't require trip_id_performed, so a future change could source from fct_vehicle_positions_messages to keep them.
  • dim_provider_gtfs_data records multiple organization rows per VP feed when a feed is shared across agencies (govcbus.com is shared by 7 cities; the SD MTS feed is shared with the airport). The model collapses to one canonical org per feed (lex-smallest org name) to prevent fan-out duplication. Per-agency demuxing can happen at the export step.
  • fct_vehicle_locations.key is documented as "almost unique" upstream. The model adds a defensive QUALIFY ROW_NUMBER per microbatch; residual cross-batch dups are 0.0089% (8,538 of 96M), which fits the upstream unique_proportion at_least 0.999 threshold but isn't strict TIDES unique: true. Open to tightening upstream if you'd rather.
  • "City of Hermosa Beach" exists in dim_provider_gtfs_data but vehicle_positions_gtfs_dataset_key is NULL and customer_facing is FALSE, so Hermosa is not in this export. Worth confirming whether Hermosa is being onboarded or whether the seed-agency framing in Build a process to convert GTFS-RT Vehicle Positions data > TIDES Vehicle Locations using Hermosa Beach data #4837 was meant generically.

TIDES = Transit Integrated Data Exchange Specification, https://tides-transit.org/main/.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

How has this been tested?

Include commands/logs/screenshots as relevant.

If making changes to dbt models, make sure they were created or update on Staging. Please run the command uv run dbt run -s CHANGED_MODEL --target staging and uv run dbt test -s CHANGED_MODEL --target staging, then include the output in this section of the PR.

uv run dbt run -s +fct_tides_vehicle_locations --target staging
uv run dbt test -s fct_tides_vehicle_locations --target staging

Materialized in cal-itp-data-infra-staging.christopher_mart_gtfs.fct_tides_vehicle_locations. 24-day window from 2026-03-20 to 2026-04-30:

  • 96,049,006 rows, 100 distinct agencies
  • 99.991% unique location_ping_id (8,538 dups; under team unique_proportion at_least 0.999)
  • Zero NULL on location_ping_id, event_timestamp, vehicle_id, trip_id_performed
  • Zero violations on TIDES bounds (lat/lon, heading, speed, odometer, trip_stop_sequence ≥ 1)
  • current_status enum mapping correct (no raw GTFS-RT values leak through)

Top agencies by ping count:

Agency Pings Vehicles
LA Metro 18,720,458 2,384
SFMTA 11,917,329 895
OCTA 6,977,247 458
AC Transit 6,565,813 502
VTA 5,837,699 526

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

  • No action required
  • Actions required (specified below)

Two follow-up PRs ready locally and waiting on this one:

Will open follow-up issues for the upstream fct_vehicle_locations.key strict-uniqueness option and for the column-level RT test binding behavior (also affects existing fct_vehicle_locations).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Warehouse report: Failed to add ci-report to a comment. Review the ci-report in the Summary.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Impacted Exposures

No exposures are impacted by the changes in this PR.

Changed models

  • models/mart/gtfs/fct_tides_vehicle_locations.sql

If any impacted exposures are unexpected, verify that your changes do not unintentionally affect downstream consumers.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Terraform plan in iac/cal-itp-data-infra-staging/airflow/us

Plan: 3 to add, 4 to change, 0 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!~  update in-place

Terraform will perform the following actions:

  # google_storage_bucket_object.calitp-staging-composer-catalog will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-catalog" {
!~      content             = (sensitive value)
!~      crc32c              = "7vbSEg==" -> (known after apply)
!~      detect_md5hash      = "gzQlzyAjYlTGiWPOSPmt/Q==" -> "different hash"
!~      generation          = 1777921775322636 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/target/catalog.json"
!~      md5hash             = "gzQlzyAjYlTGiWPOSPmt/Q==" -> (known after apply)
        name                = "data/warehouse/target/catalog.json"
#        (16 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["dbt_project.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~      crc32c              = "cIuoNQ==" -> (known after apply)
!~      detect_md5hash      = "bsZgcfmK985tISFYJCt+qg==" -> "different hash"
!~      generation          = 1777669801966208 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/dbt_project.yml"
!~      md5hash             = "bsZgcfmK985tISFYJCt+qg==" -> (known after apply)
        name                = "data/warehouse/dbt_project.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/mart/tides/_mart_tides.yml"] will be created
+   resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+       bucket         = "calitp-staging-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/_mart_tides.yml"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/_mart_tides.yml"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/mart/tides/fct_tides_vehicle_locations.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+       bucket         = "calitp-staging-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["seeds/_seeds.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~      crc32c              = "7/62ZA==" -> (known after apply)
!~      detect_md5hash      = "auu3vnNdExPQiA88ThI9DA==" -> "different hash"
!~      generation          = 1776453636837026 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/seeds/_seeds.yml"
!~      md5hash             = "auu3vnNdExPQiA88ThI9DA==" -> (known after apply)
        name                = "data/warehouse/seeds/_seeds.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["seeds/tides_publication_keys.csv"] will be created
+   resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+       bucket         = "calitp-staging-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/seeds/tides_publication_keys.csv"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/seeds/tides_publication_keys.csv"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-staging-composer-manifest will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-manifest" {
!~      content             = (sensitive value)
!~      crc32c              = "ruSOBg==" -> (known after apply)
!~      detect_md5hash      = "Mw4Cul2QM1zWeUWwGhMlmw==" -> "different hash"
!~      generation          = 1777921776550660 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/target/manifest.json"
!~      md5hash             = "Mw4Cul2QM1zWeUWwGhMlmw==" -> (known after apply)
        name                = "data/warehouse/target/manifest.json"
#        (16 unchanged attributes hidden)
    }

Plan: 3 to add, 4 to change, 0 to destroy.

📝 Plan generated in Deploy dbt #1822

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Terraform plan in iac/cal-itp-data-infra/airflow/us

Plan: 3 to add, 2 to change, 0 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!~  update in-place

Terraform will perform the following actions:

  # google_storage_bucket_object.calitp-composer-dags["dbt_project.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "cIuoNQ==" -> (known after apply)
!~      detect_md5hash      = "bsZgcfmK985tISFYJCt+qg==" -> "different hash"
!~      generation          = 1777669782489514 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/dbt_project.yml"
!~      md5hash             = "bsZgcfmK985tISFYJCt+qg==" -> (known after apply)
        name                = "data/warehouse/dbt_project.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/tides/_mart_tides.yml"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/_mart_tides.yml"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/_mart_tides.yml"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/tides/fct_tides_vehicle_locations.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["seeds/_seeds.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "7/62ZA==" -> (known after apply)
!~      detect_md5hash      = "auu3vnNdExPQiA88ThI9DA==" -> "different hash"
!~      generation          = 1776457910260376 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/seeds/_seeds.yml"
!~      md5hash             = "auu3vnNdExPQiA88ThI9DA==" -> (known after apply)
        name                = "data/warehouse/seeds/_seeds.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer-dags["seeds/tides_publication_keys.csv"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/seeds/tides_publication_keys.csv"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/seeds/tides_publication_keys.csv"
+       storage_class  = (known after apply)
    }

Plan: 3 to add, 2 to change, 0 to destroy.

📝 Plan generated in Deploy dbt #1822

@vevetron
Copy link
Copy Markdown
Contributor

vevetron commented May 1, 2026

regional-subfeed fixed-route feeds. Closes #4837. - remove from model text

@vevetron
Copy link
Copy Markdown
Contributor

vevetron commented May 1, 2026

  • should be it's own grouping of queries, not in the "mart" space. maybe - mart_tides?
  • I think it should be a view and not materialized. Access pattern is likely to be incredibly infrequent and storing the processed data for this is probably unncecessary.

@chrisyamas
Copy link
Copy Markdown
Contributor Author

chrisyamas commented May 2, 2026

@vevetron thanks, addressing your three comments above and the two follow-ups from the call (HAVING clauses on Slack, the in-code TIDES #22 / #252 reference on screen-share). I just force-pushed commit which:

  • moved Closes #4837 out of the model description and lives in the PR body now. Moved the model out of mart_gtfs into its own mart/tides/ folder with a mart_tides schema, so it materializes in <user>_mart_tides rather than <user>_mart_gtfs. Same destination for the two stacked follow-up PRs (chore/tides-validation-harness and feat/tides-trips-performed), which carry the same shape and will go up for review once this lands
  • converted from incremental microbatch to a view, agreed on the access-pattern reasoning. Dropped partition_by, cluster_by, event_time, batch_size, begin, lookback, full_refresh, and on_schema_change config keys that don't apply to a view
  • added a model-level meta: { publish.product: tides } in the new yml, going with that key to fit the existing publish.* / ckan.* dotted-namespace convention I see on the CKAN-published models (e.g., dim_gtfs_datasets_latest). Happy to switch to a literal dbt tags: ['tides_product'] instead if you'd prefer — I noticed the warehouse doesn't currently use dbt tags: anywhere, which is why I went with the meta key
  • for the HAVING clauses (re: your Slack follow-up), I refactored the public_subfeed_agencies CTE from ANY_VALUE(... HAVING MIN organization_name) + GROUP BY 1 to QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY organization_name ASC) = 1. So this should match the pattern used elsewhere in the warehouse for picking one canonical row per group?
  • for the in-code TIDES issue reference you flagged on screen-share, dropped the parenthetical and replaced it with the full URL https://github.com/TIDES-transit/TIDES/issues/252 in both the SQL comment and the yml description

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Impacted Exposures

No exposures are impacted by the changes in this PR.

Changed models

  • models/mart/tides/fct_tides_vehicle_locations.sql

If any impacted exposures are unexpected, verify that your changes do not unintentionally affect downstream consumers.

Copy link
Copy Markdown
Contributor

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments, think broad strokes look ok.

DATETIME(vp.location_timestamp, vp.schedule_feed_timezone) AS event_timestamp,

vp.trip_id AS trip_id_performed,
-- trip_id_scheduled left NULL for MVP; deriving requires a reliable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is true -- per GTFS spec, trip_id in VP should reference schedule unless the schedule_relationship is one of a few specific values.... so maybe this should be based on that?

also, this is in the intent of the trip_instance_key identifier (to allow joins of a specific trip across feed types), so can use that for lookup if desired

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, agreed. per the GTFS-RT spec the VP trip_id does reference the schedule whenever trip.schedule_relationship is one of the in-schedule values, so deriving trip_id_scheduled from that conditional is the right shape. trip_instance_key is also viable as the join key.

leaving this PR's trip_id_scheduled as NULL for MVP and filing a follow-up to wire up the conditional + the join. happy to bump that into scope here if you'd prefer it land together.

USING (gtfs_dataset_key)
),

-- TIDES requires location_ping_id strictly unique; the upstream key is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noting that base64_url is part of key so if two results have same value they need to have same URL... so that won't actually be a substantive tie break

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, both ordering columns were degenerate at this grain since location_timestamp and base64_url are components of the upstream key. fixed in a fixup: pulled the dedup up to the source CTE and ordered by _extract_ts DESC (most-recently-extracted wins), which differs across the duplicates. trailing deduped CTE collapsed into the source. verified 0 dups on a sampled service_date in staging.

(https://tides-transit.org/main/). Sourced from `fct_vehicle_locations`
and filtered via `dim_provider_gtfs_data` to public, customer-facing or
regional-subfeed fixed-route feeds.
meta:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think we use publish.product anywhere else, what is intent for that as meta? can/should we be defining a dbt exposure? that is our standard for published items. see for example https://github.com/cal-itp/data-infra/blob/main/warehouse/models/mart/gtfs_schedule_latest/_gtfs_schedule_latest.yml#L1270-L1314 and https://github.com/cal-itp/data-infra/blob/main/airflow/dags/publish_gtfs.py and https://github.com/cal-itp/data-infra/blob/main/airflow/plugins/operators/dbt_manifest_to_metadata_operator.py#L88-L90 for the GTFS --> CKAN publish flow

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, exposure is the right shape. dropping the publish.product meta in this PR. adding a single california_tides exposure as part of PR 5220 (so both fct_tides_vehicle_locations and fct_tides_trips_performed ref()s resolve in the same checkout), modeled on the GTFS california_open_data block. owner / methodology fields filled in; meta.destinations now filled in via PR 5229 (publishing pipeline).

FROM {{ ref('fct_vehicle_locations') }}
),

-- dim_provider_gtfs_data fans out: a single vehicle_positions feed can be
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused by the logic here -- my instinct would be to either group by VP URL where it meets these criteria (public facing etc.) and array_agg the organization info so that all orgs can be used later in the publish process OR just select distinct on the non-org columns and ignore the organization parts.

Basically, not sure how ending up with a VP feed tagged with one of its organizations meets future needs -- if we need orgs, we should keep all of them and handle unnesting or whatever in the publish process. If we don't need all the orgs then let's just drop and publish under the VP URL and publish org related metadata separately.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agreed the fan-out collapse is doing more than it should. going with your second framing: dropping organization_name / organization_ntd_id from both fact tables since orgs aren't part of the TIDES spec anyway.

the public_subfeed_agencies CTE shrinks to a public_subfeed_keys CTE that's just SELECT DISTINCT on dataset keys matching the public-customer-facing-or-regional-subfeed criterion, no QUALIFY needed. govcbus / SD MTS multi-org reality moves to publish-side metadata in PR 5229, separate from the TIDES tables themselves.

@erikamov
Copy link
Copy Markdown
Contributor

erikamov commented May 4, 2026

Hermosa Beach was selected to be the first candidate to share their data in previous meetings. For other agencies they would need to check if they agree to share the data. So we would need to filter the results to generate only for specific agencies.
@evansiroky and @vevetron, should we keep filtering only for Hermosa Beach and ignore this customer_facing is FALSE?

@evansiroky
Copy link
Copy Markdown
Member

We were working with the City of Hermosa Beach which is interested in studying on-time-performance. I believe the thought was to narrow the initial output to just the agencies that traverse Hermosa Beach for development purposes only to check on costs and implementation before expanding statewide.

@erikamov
Copy link
Copy Markdown
Contributor

erikamov commented May 4, 2026

We were working with the City of Hermosa Beach which is interested in studying on-time-performance. I believe the thought was to narrow the initial output to just the agencies that traverse Hermosa Beach for development purposes only to check on costs and implementation before expanding statewide.

Yeah, it is what I remember too. :)

@chrisyamas
Copy link
Copy Markdown
Contributor Author

thanks both, the scope has been narrowed! went ahead and implemented it as MVP this morning rather than wait for our sync since the seed-based mechanism is small enough that landing it gives us something concrete to react to.

scope is now three feeds via a new tides_publication_keys seed:

  • Beach Cities Transit (Hermosa local operator, BCT JPA): 9edf45e373638700ca420b1e588efdaf
  • LA Metro Bus (south bay routes 102, 130, 232, 344): 1745f7d9b9fa48cdbc8ea282e60602bd
  • Torrance Transit (Swiftly feed): 46b00e5c738a0ebf93522371d9899627

filter is an INNER JOIN on the seed inside both fct_tides_vehicle_locations and fct_tides_trips_performed. the existing public_customer_facing_or_regional_subfeed_fixed_route filter stays in place; the seed is additive narrowing on top of it. row count drops to 18.3M (from ~96M) on the 9-day vehicle_locations window and 99,960 (from 590,253) on the 8-day trips_performed window.

PR description on this one is updated to match. PR 5220 (trips_performed) inherits the same seed; PRs 5229 and 5230 (the publishing pipeline + staging bucket, drafts up now) inherit the narrowing automatically since they consume the views.

one open question worth flagging for our sync today is whether it would be good to formalize a tides_publication_consent flag on dim_provider_gtfs_data long-term (defaulting to the existing public-customer-facing flag, overridable per agency), or keep the seed as the publication-list mechanism going forward. seed is the lightest mvp; flag would be more durable. happy to take either direction.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Warehouse report 📦

Checks/potential follow-ups

Checks indicate the following action items may be necessary.

  • For new models, do they all have a surrogate primary key that is tested to be not-null and unique?

New models 🌱

calitp_warehouse.mart.tides.fct_tides_vehicle_locations

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Impacted Exposures

No exposures are impacted by the changes in this PR.

Changed models

  • models/mart/tides/fct_tides_vehicle_locations.sql

If any impacted exposures are unexpected, verify that your changes do not unintentionally affect downstream consumers.

Christopher Yamas and others added 9 commits May 5, 2026 15:38
Adds mart_gtfs.fct_tides_vehicle_locations, the first TIDES-conformant
model in the Cal-ITP warehouse. Reshapes fct_vehicle_locations into the
TIDES vehicle_locations schema and filters to public, customer-facing
or regional-subfeed fixed-route GTFS-RT feeds via dim_provider_gtfs_data.

The model produces the BigQuery table only. Per-agency parquet export
(#4693), the CDN-fronted public bucket (#4700), and the file validator
(#4839) are tracked separately.

Validated in christopher_mart_gtfs sandbox over a 24-day window
(2026-03-20 to 2026-04-30): 96M rows across 100 agencies, 99.991%
unique location_ping_id, zero NULL on TIDES required-not-null fields,
zero violations on TIDES bounds and enum constraints.
- Drop the file-level comment header on fct_tides_vehicle_locations.sql
  (matches the existing fct_vehicle_locations style).
- Trim CTE-level comments to keep WHY (agency fan-out, NULL-trip caveat,
  defensive dedup rationale) and drop WHAT comments that just restate the
  SQL below.
- Specify ASC on the second ORDER BY column to satisfy sqlfluff AM03.
- Use GROUP BY 1 instead of repeating the column name (matches existing
  Cal-ITP usage and avoids the AM06 alias-mismatch risk).
- Yml: tighten column descriptions, reuse anchor refs (*rt_service_date,
  *rt_vehicle_id, *rt_vp_stop_id, *gtfs_rt_dt, *base64_url,
  *gtfs_dataset_key_desc) where they apply, and wrap test where clauses
  in config: blocks to match the existing pattern on fct_vehicle_locations.
…_LOOKBACK_DAYS

The var was renamed in main by #5178 (Laurie's incremental-vs-microbatch
docs cleanup) after this branch was started. Match the new name so
dbt compile passes.
…ssue refs

Per Vivek's call feedback (and Slack follow-up about HAVING clauses):

- Move the model out of mart_gtfs into its own mart/tides folder. Add a
  mart_tides schema in dbt_project.yml so it materializes in
  <user>_mart_tides rather than <user>_mart_gtfs. The model is a TIDES
  product, not a GTFS mart, and grouping the eventual peers
  (trips_performed, stop_visits) under one folder keeps the boundary clear.
- Convert from incremental microbatch to view. The downstream consumer is
  the per-agency export Airflow job querying once per cycle, not
  interactive analytics, so paying compute on every read is fine and we
  avoid carrying a materialized 96M-row copy.
- Drop partition_by, cluster_by, full_refresh, on_schema_change, event_time,
  batch_size, begin, lookback config keys that don't apply to a view.
- Refactor public_subfeed_agencies CTE from
  ANY_VALUE(... HAVING MIN organization_name) GROUP BY 1 to
  QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY organization_name) = 1.
  Matches the team pattern used elsewhere in the warehouse (every other
  "pick one canonical row per group" site uses QUALIFY ROW_NUMBER).
- Add model-level meta: { publish.product: tides } in the new yml,
  parallel to the existing publish.* / ckan.* dotted-namespace meta keys
  used on CKAN-published models.
- Drop "Closes #4837" from the model description (belongs in the PR body,
  not the warehouse).
- Replace the inline TIDES issue #252 reference with the full GitHub URL.
…tive

The previous QUALIFY ordered by event_timestamp DESC, base64_url ASC. Both
columns are degenerate at the location_ping_id grain: location_timestamp
and base64_url are components of the upstream `key`, so they're constant
across rows that share a key. Move the dedup up to the source CTE and
order by `_extract_ts DESC` so most-recently-extracted wins. The trailing
`deduped` CTE collapses into the source CTE.
…lause

Adds a `mart.tides: +enabled: true` line under data_tests in dbt_project.yml
matching the existing `mart.payments` re-enable pattern. The model has six
column-level tests (not_null on location_ping_id / event_timestamp /
vehicle_id, accepted_values on current_status / trip_type, and
unique_proportion on location_ping_id); all six pass against staging.

The two accepted_values where clauses were `__rt_sampled__ AND <col> IS
NOT NULL`. The rt_sampled_where_clause macro only substitutes on an exact
`__rt_sampled__` match, so the compound form was emitting the literal token
to BigQuery and failing. Trimmed to bare `__rt_sampled__`, matching the
convention used everywhere else in the warehouse. accepted_values silently
ignores NULLs already, so the IS NOT NULL filter was redundant.
…roduct

Orgs aren't part of the TIDES spec (vehicle_locations.schema.json defines
no organization fields). The agency-collapse CTE shrinks to a SELECT
DISTINCT on `vehicle_positions_gtfs_dataset_key`, no QUALIFY needed. Both
`organization_name` and `organization_ntd_id` come out of the model and
out of _mart_tides.yml.

The publish.product meta block is gone in favor of a real dbt exposure.
The exposure itself lands on PR 5220 alongside fct_tides_trips_performed
so both ref()s resolve in the same checkout.

Per-agency / per-org metadata for the publish flow lives separately
in PR 4 (next sprint).
@chrisyamas chrisyamas force-pushed the feat/tides-vehicle-locations branch from d90e087 to 529d4e2 Compare May 5, 2026 19:38
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Impacted Exposures

No exposures are impacted by the changes in this PR.

Changed models

  • models/mart/tides/fct_tides_vehicle_locations.sql

If any impacted exposures are unexpected, verify that your changes do not unintentionally affect downstream consumers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build a process to convert GTFS-RT Vehicle Positions data > TIDES Vehicle Locations using Hermosa Beach data

5 participants