Skip to content

Add fct_tides_trips_performed dbt model#5220

Draft
chrisyamas wants to merge 4 commits intofeat/tides-vehicle-locationsfrom
feat/tides-trips-performed
Draft

Add fct_tides_trips_performed dbt model#5220
chrisyamas wants to merge 4 commits intofeat/tides-vehicle-locationsfrom
feat/tides-trips-performed

Conversation

@chrisyamas
Copy link
Copy Markdown
Contributor

Description

Describe your changes and why you're making them. Please include the context, motivation, and relevant dependencies.

Adds mart_tides.fct_tides_trips_performed, sourced from fct_observed_trips and joined to fct_scheduled_trips for route metadata and to fct_tides_vehicle_locations for canonical vehicle_id per trip. Filtered to public, customer-facing or regional-subfeed fixed-route GTFS feeds via dim_provider_gtfs_data.

Stacked on feat/tides-vehicle-locations. Lives alongside fct_tides_vehicle_locations in the new mart/tides/ folder and carries the same publish.product: tides model meta. Includes a relationships test asserting that every vehicle_id here resolves to at least one row in fct_tides_vehicle_locations.

A few design decisions worth flagging:

  • Filtered to appeared_in_vp = TRUE upstream so every row has a derivable vehicle_id. Trips that only appeared in trip_updates (no VP) are excluded.
  • Each trip is assigned its most-frequent vehicle_id during VP coverage (APPROX_TOP_COUNT(vehicle_id, 1)). Vehicles that change mid-trip get the dominant one.
  • schedule_trip_start / schedule_trip_end cast to DATETIME using feed_timezone from fct_scheduled_trips. Falls back to America/Los_Angeles when the feed timezone is NULL — defensible default for California, flag if you'd rather a different fallback.
  • Materialized as a view, matching fct_tides_vehicle_locations. Earlier prototype materialized as a table to work around microbatch's auto-filter on event_time refs (the vehicle_per_trip CTE needs cross-dt access). Views don't have that filter, so the constraint goes away. The downstream consumer is the per-agency export Airflow job, so paying compute on each read is fine.
  • Agency-collapse CTE uses QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY organization_name) = 1, matching the pattern used elsewhere in the warehouse.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

How has this been tested?

Include commands/logs/screenshots as relevant.

If making changes to dbt models, make sure they were created or update on Staging. Please run the command uv run dbt run -s CHANGED_MODEL --target staging and uv run dbt test -s CHANGED_MODEL --target staging, then include the output in this section of the PR.

uv run dbt run -s +fct_tides_trips_performed --target staging
uv run dbt test -s fct_tides_trips_performed --target staging

Materialized in cal-itp-data-infra-staging.christopher_mart_tides.fct_tides_trips_performed. 8-day window from 2026-04-23 to 2026-04-30 (numbers from the earlier table-materialized run; re-running as a view in the new mart_tides dataset is on the post-merge follow-up list):

  • 553,084 rows, 99 distinct agencies
  • Zero PK duplicates on (service_date, trip_id_performed)
  • Zero NULL on service_date, trip_id_performed, vehicle_id
  • schedule_relationship distribution: 385,845 Scheduled / 165,672 NULL / 1,398 Canceled / 163 Added / 6 Duplicated

Top agencies by trip count: LA Metro (105,082), SFMTA (67,062), San Diego International Airport (51,005), AC Transit (33,753), VTA (25,781).

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

  • No action required

  • Actions required (specified below)

  • Update the Frictionless validation harness to include trips_performed schema validation.

  • Open follow-up issue for stop_visits model and for trip_start_stop_id / trip_end_stop_id derivation (deferred past MVP, requires stop_times join).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Warehouse report: Failed to add ci-report to a comment. Review the ci-report in the Summary.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Impacted Exposures

No exposures are impacted by the changes in this PR.

Changed models

  • models/mart/tides/fct_tides_trips_performed.sql

If any impacted exposures are unexpected, verify that your changes do not unintentionally affect downstream consumers.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Terraform plan in iac/cal-itp-data-infra/airflow/us

Plan: 4 to add, 2 to change, 0 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!~  update in-place

Terraform will perform the following actions:

  # google_storage_bucket_object.calitp-composer-dags["dbt_project.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "cIuoNQ==" -> (known after apply)
!~      detect_md5hash      = "bsZgcfmK985tISFYJCt+qg==" -> "different hash"
!~      generation          = 1777669782489514 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/dbt_project.yml"
!~      md5hash             = "bsZgcfmK985tISFYJCt+qg==" -> (known after apply)
        name                = "data/warehouse/dbt_project.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/tides/_mart_tides.yml"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/_mart_tides.yml"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/_mart_tides.yml"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/tides/fct_tides_trips_performed.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/fct_tides_trips_performed.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/fct_tides_trips_performed.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/tides/fct_tides_vehicle_locations.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["seeds/_seeds.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "7/62ZA==" -> (known after apply)
!~      detect_md5hash      = "auu3vnNdExPQiA88ThI9DA==" -> "different hash"
!~      generation          = 1776457910260376 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/seeds/_seeds.yml"
!~      md5hash             = "auu3vnNdExPQiA88ThI9DA==" -> (known after apply)
        name                = "data/warehouse/seeds/_seeds.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer-dags["seeds/tides_publication_keys.csv"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/seeds/tides_publication_keys.csv"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/seeds/tides_publication_keys.csv"
+       storage_class  = (known after apply)
    }

Plan: 4 to add, 2 to change, 0 to destroy.

📝 Plan generated in Deploy dbt #1823

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Terraform plan in iac/cal-itp-data-infra-staging/airflow/us

Plan: 4 to add, 4 to change, 0 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!~  update in-place

Terraform will perform the following actions:

  # google_storage_bucket_object.calitp-staging-composer-catalog will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-catalog" {
!~      content             = (sensitive value)
!~      crc32c              = "7vbSEg==" -> (known after apply)
!~      detect_md5hash      = "gzQlzyAjYlTGiWPOSPmt/Q==" -> "different hash"
!~      generation          = 1777921775322636 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/target/catalog.json"
!~      md5hash             = "gzQlzyAjYlTGiWPOSPmt/Q==" -> (known after apply)
        name                = "data/warehouse/target/catalog.json"
#        (16 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["dbt_project.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~      crc32c              = "cIuoNQ==" -> (known after apply)
!~      detect_md5hash      = "bsZgcfmK985tISFYJCt+qg==" -> "different hash"
!~      generation          = 1777669801966208 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/dbt_project.yml"
!~      md5hash             = "bsZgcfmK985tISFYJCt+qg==" -> (known after apply)
        name                = "data/warehouse/dbt_project.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/mart/tides/_mart_tides.yml"] will be created
+   resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+       bucket         = "calitp-staging-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/_mart_tides.yml"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/_mart_tides.yml"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/mart/tides/fct_tides_trips_performed.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+       bucket         = "calitp-staging-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/fct_tides_trips_performed.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/fct_tides_trips_performed.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/mart/tides/fct_tides_vehicle_locations.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+       bucket         = "calitp-staging-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/tides/fct_tides_vehicle_locations.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["seeds/_seeds.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~      crc32c              = "7/62ZA==" -> (known after apply)
!~      detect_md5hash      = "auu3vnNdExPQiA88ThI9DA==" -> "different hash"
!~      generation          = 1776453636837026 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/seeds/_seeds.yml"
!~      md5hash             = "auu3vnNdExPQiA88ThI9DA==" -> (known after apply)
        name                = "data/warehouse/seeds/_seeds.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["seeds/tides_publication_keys.csv"] will be created
+   resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+       bucket         = "calitp-staging-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/seeds/tides_publication_keys.csv"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/seeds/tides_publication_keys.csv"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-staging-composer-manifest will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-manifest" {
!~      content             = (sensitive value)
!~      crc32c              = "ruSOBg==" -> (known after apply)
!~      detect_md5hash      = "Mw4Cul2QM1zWeUWwGhMlmw==" -> "different hash"
!~      generation          = 1777921776550660 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/target/manifest.json"
!~      md5hash             = "Mw4Cul2QM1zWeUWwGhMlmw==" -> (known after apply)
        name                = "data/warehouse/target/manifest.json"
#        (16 unchanged attributes hidden)
    }

Plan: 4 to add, 4 to change, 0 to destroy.

📝 Plan generated in Deploy dbt #1823

GROUP BY 1, 2
),

-- Same shared-feed agency collapse as fct_tides_vehicle_locations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as there re: org logic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed alongside L4 on PR 5216: dropping organization_name / organization_ntd_id from this table too. the public_subfeed_agencies CTE shrinks to a public_subfeed_keys SELECT DISTINCT, no QUALIFY. orgs aren't part of the TIDES spec; metadata moves to publish-side in PR 5229.

),

-- TIDES requires (service_date, trip_id_performed) unique. fct_observed_trips
-- can have multiple rows per PK when the same trip appears in multiple feeds;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only dedupe within a given feed (VP URL or VP GTFS dataset), don't dedupe across feeds. There is no reason to expect that trip ID performed should be unique across feeds and dropping like this will result in trips basically arbitrarily missing from some feeds just because they used the same ID as another feed. TIDES isn't really designed for this cross-agency use-case (all IDs within TIDES should only be assumed to be unique within a given feed, not across feeds/agencies) and honestly maybe this is something that needs to get surfaced at the TIDES Spec level -- should there be an agency or feed identifier in this table for cross-agency use cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, the partition was too coarse. fixed: added vehicle_positions_gtfs_dataset_key to the partition so dedup happens within feed, not across feeds. trips that share trip_id_performed across different feeds now both survive. base64_url tie-breaker dropped from the ORDER BY (redundant once partitioned by feed).

re-materialized with the fix on an 8-day window (2026-04-23 to 2026-04-30): 590,253 rows after, 562,378 distinct (service_date, trip_id_performed) pairs under the old grain. ~27,875 trips that were getting collapsed across feeds now survive as their own rows.

agreed on your TIDES-spec point: cross-feed uniqueness is not something the spec guarantees, so the assumption "trip_id_performed is unique within feed + service_date only" is the right framing. worth a thread in TIDES-community channels separately; happy to take that on.

route metadata and to `fct_tides_vehicle_locations` for canonical
vehicle_id per trip. Filtered upstream to `appeared_in_vp = TRUE` so
every row has a derivable vehicle_id.
meta:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment I think this should be an exposure

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a single california_tides exposure on this PR (so both fct_tides_vehicle_locations and fct_tides_trips_performed refs resolve in the same checkout), modeled on the GTFS california_open_data exposure. publish.product meta block dropped from both yml entries. PR 5229 fills in meta.destinations for the public-bucket flow.

chrisyamas and others added 4 commits May 5, 2026 15:39
Adds mart_tides.fct_tides_trips_performed, the second TIDES-conformant
model in the Cal-ITP warehouse. Sources from fct_observed_trips joined
to fct_scheduled_trips for route metadata and to
fct_tides_vehicle_locations for canonical vehicle_id per trip.
Filtered to public, customer-facing or regional-subfeed fixed-route
GTFS feeds via dim_provider_gtfs_data.

Stacked on feat/tides-vehicle-locations. Includes a relationships
test asserting that every vehicle_id resolves to at least one row in
fct_tides_vehicle_locations for the same service_date.

Materialized as a view, in the same mart/tides folder as
fct_tides_vehicle_locations and tagged with the same publish.product
meta key. The downstream consumer is a per-agency Airflow export
running once per cycle, so paying compute on each read is cheaper
than carrying a materialized copy.

Validated against christopher_mart_gtfs sandbox earlier as a table
materialization, 8-day window: 553,084 rows, 0 PK duplicates, 0 NULL
on TIDES required-not-null fields (service_date, trip_id_performed,
vehicle_id), 99 agencies, schedule_relationship enum mapping correct
(Scheduled/Canceled/Added/Duplicated). Re-running as a view in the
new mart_tides dataset is a follow-up before merge.
@chrisyamas chrisyamas force-pushed the feat/tides-trips-performed branch from eef0ca0 to fcf8e48 Compare May 5, 2026 19:39
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Warehouse report 📦

Checks/potential follow-ups

Checks indicate the following action items may be necessary.

  • For new models, do they all have a surrogate primary key that is tested to be not-null and unique?

New models 🌱

calitp_warehouse.mart.tides.fct_tides_trips_performed

calitp_warehouse.mart.tides.fct_tides_vehicle_locations

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Impacted Exposures

The following exposures are downstream of models changed in this PR:

Changed models

  • models/mart/tides/fct_tides_trips_performed.sql

If any impacted exposures are unexpected, verify that your changes do not unintentionally affect downstream consumers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants