Skip to content

Deduplicate Enghouse ticket_results#5144

Open
stevenschrayer wants to merge 3 commits intomainfrom
4964-deduplicate-enghouse-ticket_results
Open

Deduplicate Enghouse ticket_results#5144
stevenschrayer wants to merge 3 commits intomainfrom
4964-deduplicate-enghouse-ticket_results

Conversation

@stevenschrayer
Copy link
Copy Markdown
Contributor

@stevenschrayer stevenschrayer commented Apr 23, 2026

Description

Describe your changes and why you're making them. Please include the context, motivation, and relevant dependencies.

Resolves #4964

Modifies the deduplication step to order by non-null datetime columns, and keep the first one. Observations on the ticket indicated that duplicate records had a mix of non-null and null date time records, while all other columns were equivalent.

This broadly fixes the duplicate data issue, though I did find two pairs of records that survive, as noted in the testing block.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

How has this been tested?

Include commands/logs/screenshots as relevant.

If making changes to dbt models, make sure they were created or update on Staging. Please run the command uv run dbt run -s CHANGED_MODEL --target staging and uv run dbt test -s CHANGED_MODEL --target staging, then include the output in this section of the PR.

I had to run this test query against prod because I didn't have duplicate records available in staging
-- Sanity check: current row count vs. post-fix row count, and the delta (rows dropped by dedup).
WITH pre_fix AS (
  SELECT *
  FROM `cal-itp-data-infra-staging.staging.stg_enghouse__ticket_results`
),

ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY
        operator_id, id, ticket_id, station_name, amount, reason, tap_id,
        ticket_type, line, start_station, end_station, ticket_code, additional_infos
      ORDER BY
        CASE WHEN start_dttm   IS NOT NULL THEN 0 ELSE 1 END ASC,
        CASE WHEN end_dttm     IS NOT NULL THEN 0 ELSE 1 END ASC,
        CASE WHEN created_dttm IS NOT NULL THEN 0 ELSE 1 END ASC
    ) AS row_num
  FROM pre_fix
),

post_fix AS (
  SELECT * FROM ranked WHERE row_num = 1
)

SELECT
  (SELECT COUNT(*) FROM pre_fix)  AS pre_fix_rows,
  (SELECT COUNT(*) FROM post_fix) AS post_fix_rows,
  (SELECT COUNT(*) FROM pre_fix) - (SELECT COUNT(*) FROM post_fix) AS rows_dropped;
Row	pre_fix_rows	post_fix_rows	rows_dropped
1	9016	8979	37
> uv run dbt run --select stg_enghouse__ticket_results
17:34:27  Found 628 models, 178 data tests, 16 seeds, 227 sources, 4 exposures, 1089 macros
17:34:27  
17:34:27  Concurrency: 8 threads (target='staging')
17:34:27  
17:34:35  1 of 1 START sql view model staging.stg_enghouse__ticket_results ............... [RUN]
17:34:38  1 of 1 OK created sql view model staging.stg_enghouse__ticket_results .......... [CREATE VIEW (0 processed) in 2.38s]
17:34:38  
17:34:38  Finished running 1 view model in 0 hours 0 minutes and 10.58 seconds (10.58s).
17:34:50  
17:34:50  Completed successfully
17:34:50  
17:34:50  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1
Separately, I tested the deduplication itself and found that there are two sets of 2 records each that survive deduplication. Each pair has identical tap_ids, but very different other values.
WITH source AS (
    SELECT * FROM `cal-itp-data-infra-staging`.`external_enghouse`.`ticket_results`
),

clean_columns AS (
    SELECT
        

CASE
    WHEN TRIM(Operator_Id) = ""
        THEN NULL
    ELSE TRIM(Operator_Id)
END

 AS operator_id,
        

CASE
    WHEN TRIM(id) = ""
        THEN NULL
    ELSE TRIM(id)
END

 AS id,
        

CASE
    WHEN TRIM(ticket_id) = ""
        THEN NULL
    ELSE TRIM(ticket_id)
END

 AS ticket_id,
        

CASE
    WHEN TRIM(station_name) = ""
        THEN NULL
    ELSE TRIM(station_name)
END

 AS station_name,
        ROUND(SAFE_CAST(amount AS NUMERIC) / 100.0, 2) AS amount,
        

CASE
    WHEN TRIM(clearing_id) = ""
        THEN NULL
    ELSE TRIM(clearing_id)
END

 AS clearing_id,
        

CASE
    WHEN TRIM(reason) = ""
        THEN NULL
    ELSE TRIM(reason)
END

 AS reason,
        

CASE
    WHEN TRIM(tap_id) = ""
        THEN NULL
    ELSE TRIM(tap_id)
END

 AS tap_id,
        

CASE
    WHEN TRIM(ticket_type) = ""
        THEN NULL
    ELSE TRIM(ticket_type)
END

 AS ticket_type,
        SAFE_CAST(created_dttm AS TIMESTAMP) AS created_dttm,
        

CASE
    WHEN TRIM(line) = ""
        THEN NULL
    ELSE TRIM(line)
END

 AS line,
        

CASE
    WHEN TRIM(start_station) = ""
        THEN NULL
    ELSE TRIM(start_station)
END

 AS start_station,
        

CASE
    WHEN TRIM(end_station) = ""
        THEN NULL
    ELSE TRIM(end_station)
END

 AS end_station,
        SAFE_CAST(start_dttm AS TIMESTAMP) AS start_dttm,
        SAFE_CAST(end_dttm AS TIMESTAMP) AS end_dttm,
        

CASE
    WHEN TRIM(ticket_code) = ""
        THEN NULL
    ELSE TRIM(ticket_code)
END

 AS ticket_code,
        

CASE
    WHEN TRIM(additional_infos) = ""
        THEN NULL
    ELSE TRIM(additional_infos)
END

 AS additional_infos,
        to_hex(md5(cast(coalesce(cast(operator_id as string), '') || '-' || coalesce(cast(id as string), '') || '-' || coalesce(cast(ticket_id as string), '') || '-' || coalesce(cast(station_name as string), '') || '-' || coalesce(cast(amount as string), '') || '-' || coalesce(cast(reason as string), '') || '-' || coalesce(cast(tap_id as string), '') || '-' || coalesce(cast(ticket_type as string), '') || '-' || coalesce(cast(line as string), '') || '-' || coalesce(cast(start_station as string), '') || '-' || coalesce(cast(end_station as string), '') || '-' || coalesce(cast(ticket_code as string), '') || '-' || coalesce(cast(additional_infos as string), '') as string))) AS _content_hash
    FROM source
),

deduplicated AS (
    SELECT * FROM (
        SELECT
            *,
            ROW_NUMBER() OVER (
                PARTITION BY _content_hash
                ORDER BY
                    CASE WHEN start_dttm IS NOT NULL THEN 0 ELSE 1 END ASC,
                    CASE WHEN end_dttm IS NOT NULL THEN 0 ELSE 1 END ASC,
                    CASE WHEN created_dttm IS NOT NULL THEN 0 ELSE 1 END ASC
            ) AS row_num
        FROM clean_columns
    )
    WHERE row_num = 1
),

stg_enghouse__ticket_results AS (
    SELECT
        operator_id,
        id,
        ticket_id,
        station_name,
        amount,
        clearing_id,
        reason,
        tap_id,
        ticket_type,
        created_dttm,
        line,
        start_station,
        end_station,
        start_dttm,
        end_dttm,
        ticket_code,
        additional_infos,
        _content_hash
    FROM deduplicated
)

SELECT COUNT(*) FROM stg_enghouse__ticket_results
HAVING COUNT(*) > 1
` uv run dbt test -s stg_enghouse__ticket_results+`
uv run dbt test -s stg_enghouse__ticket_results+
13:51:16  Running with dbt=1.10.8
13:51:19  Registered adapter: bigquery=1.10.2
13:51:23  Found 628 models, 178 data tests, 16 seeds, 227 sources, 4 exposures, 1091 macros
13:51:23  
13:51:23  Concurrency: 8 threads (target='staging')
13:51:23  
13:51:27  1 of 1 START test dbt_utils_expression_is_true_v2_payments_reliability_weekly_unlabeled_routes_enghouse_n_all_rides_total_unlabeled_rides  [RUN]
13:51:28  1 of 1 PASS dbt_utils_expression_is_true_v2_payments_reliability_weekly_unlabeled_routes_enghouse_n_all_rides_total_unlabeled_rides  [PASS in 1.47s]
13:51:28  
13:51:28  Finished running 1 test in 0 hours 0 minutes and 5.35 seconds (5.35s).
13:51:29  
13:51:29  Completed successfully
13:51:29  
13:51:29  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

  • No action required
  • Actions required (specified below)

@stevenschrayer stevenschrayer linked an issue Apr 23, 2026 that may be closed by this pull request
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Warehouse report: Failed to add ci-report to a comment. Review the ci-report in the Summary.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Terraform plan in iac/cal-itp-data-infra-staging/airflow/us

Plan: 1 to add, 9 to change, 1 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!~  update in-place
-   destroy

Terraform will perform the following actions:

  # google_storage_bucket_object.calitp-staging-composer["dags/airtable_issue_management.py"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer" {
!~      crc32c              = "F3GRiw==" -> (known after apply)
!~      detect_md5hash      = "jfUfWHhCEBMLm+jeORHn7w==" -> "different hash"
!~      generation          = 1777081402145446 -> (known after apply)
        id                  = "calitp-staging-composer-dags/airtable_issue_management.py"
!~      md5hash             = "jfUfWHhCEBMLm+jeORHn7w==" -> (known after apply)
        name                = "dags/airtable_issue_management.py"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer["plugins/operators/airtable_issues_email_operator.py"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer" {
!~      crc32c              = "teQqow==" -> (known after apply)
!~      detect_md5hash      = "eopXX15B6gJXv314s81Xgg==" -> "different hash"
!~      generation          = 1777081402147724 -> (known after apply)
        id                  = "calitp-staging-composer-plugins/operators/airtable_issues_email_operator.py"
!~      md5hash             = "eopXX15B6gJXv314s81Xgg==" -> (known after apply)
        name                = "plugins/operators/airtable_issues_email_operator.py"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer["plugins/operators/airtable_issues_update_operator.py"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer" {
!~      crc32c              = "IAu8XA==" -> (known after apply)
!~      detect_md5hash      = "qBYQP0FDh4xB1EW6MMOqaw==" -> "different hash"
!~      generation          = 1777081402143462 -> (known after apply)
        id                  = "calitp-staging-composer-plugins/operators/airtable_issues_update_operator.py"
!~      md5hash             = "qBYQP0FDh4xB1EW6MMOqaw==" -> (known after apply)
        name                = "plugins/operators/airtable_issues_update_operator.py"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-catalog will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-catalog" {
!~      content             = (sensitive value)
!~      crc32c              = "PCZSkg==" -> (known after apply)
!~      detect_md5hash      = "emi1LB4jwlju+91Lb2/ikw==" -> "different hash"
!~      generation          = 1777081402704275 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/target/catalog.json"
!~      md5hash             = "emi1LB4jwlju+91Lb2/ikw==" -> (known after apply)
        name                = "data/warehouse/target/catalog.json"
#        (16 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/intermediate/payments/_int_payments.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~      crc32c              = "AkMXcg==" -> (known after apply)
!~      detect_md5hash      = "v/JBoD58WEdM/XZscf2Ufw==" -> "different hash"
!~      generation          = 1776453636833756 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/models/intermediate/payments/_int_payments.yml"
!~      md5hash             = "v/JBoD58WEdM/XZscf2Ufw==" -> (known after apply)
        name                = "data/warehouse/models/intermediate/payments/_int_payments.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/intermediate/payments/int_payments__enghouse_ticket_results_deduped.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
+       bucket         = "calitp-staging-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/intermediate/payments/int_payments__enghouse_ticket_results_deduped.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/intermediate/payments/int_payments__enghouse_ticket_results_deduped.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/mart/payments/fct_payments_rides_enghouse.sql"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~      crc32c              = "TnAu0g==" -> (known after apply)
!~      detect_md5hash      = "xBUm/8jgAixqjWGlHPa0ow==" -> "different hash"
!~      generation          = 1776796010157437 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/models/mart/payments/fct_payments_rides_enghouse.sql"
!~      md5hash             = "xBUm/8jgAixqjWGlHPa0ow==" -> (known after apply)
        name                = "data/warehouse/models/mart/payments/fct_payments_rides_enghouse.sql"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/mart/transit_database/fct_close_expired_issues.sql"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~      crc32c              = "jurzxA==" -> (known after apply)
!~      detect_md5hash      = "FNZIWVUFTxMYmmQO3b4Yzw==" -> "different hash"
!~      generation          = 1777081402132841 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/models/mart/transit_database/fct_close_expired_issues.sql"
!~      md5hash             = "FNZIWVUFTxMYmmQO3b4Yzw==" -> (known after apply)
        name                = "data/warehouse/models/mart/transit_database/fct_close_expired_issues.sql"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/mart/transit_database/fct_close_rt_completeness_issues.sql"] will be destroyed
  # (because key ["models/mart/transit_database/fct_close_rt_completeness_issues.sql"] is not in for_each map)
-   resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
-       bucket              = "calitp-staging-composer" -> null
-       content_type        = "text/plain; charset=utf-8" -> null
-       crc32c              = "hlte0g==" -> null
-       detect_md5hash      = "qh2Ssdf2gAuvEPszrY1Ihg==" -> null
-       event_based_hold    = false -> null
-       generation          = 1777081402144466 -> null
-       id                  = "calitp-staging-composer-data/warehouse/models/mart/transit_database/fct_close_rt_completeness_issues.sql" -> null
-       md5hash             = "qh2Ssdf2gAuvEPszrY1Ihg==" -> null
-       md5hexhash          = "aa1d92b1d7f6800baf10fb33ad8d4886" -> null
-       media_link          = "https://storage.googleapis.com/download/storage/v1/b/calitp-staging-composer/o/data%2Fwarehouse%2Fmodels%2Fmart%2Ftransit_database%2Ffct_close_rt_completeness_issues.sql?generation=1777081402144466&alt=media" -> null
-       metadata            = {} -> null
-       name                = "data/warehouse/models/mart/transit_database/fct_close_rt_completeness_issues.sql" -> null
-       output_name         = "data/warehouse/models/mart/transit_database/fct_close_rt_completeness_issues.sql" -> null
-       self_link           = "https://www.googleapis.com/storage/v1/b/calitp-staging-composer/o/data%2Fwarehouse%2Fmodels%2Fmart%2Ftransit_database%2Ffct_close_rt_completeness_issues.sql" -> null
-       source              = "../../../../warehouse/models/mart/transit_database/fct_close_rt_completeness_issues.sql" -> null
-       storage_class       = "STANDARD" -> null
-       temporary_hold      = false -> null
#        (6 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-dags["models/staging/payments/enghouse/stg_enghouse__ticket_results.sql"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-dags" {
!~      crc32c              = "pSICtA==" -> (known after apply)
!~      detect_md5hash      = "DuHPTMzsaFO6KUjZCoxkvw==" -> "different hash"
!~      generation          = 1769734706517996 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/models/staging/payments/enghouse/stg_enghouse__ticket_results.sql"
!~      md5hash             = "DuHPTMzsaFO6KUjZCoxkvw==" -> (known after apply)
        name                = "data/warehouse/models/staging/payments/enghouse/stg_enghouse__ticket_results.sql"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-manifest will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-manifest" {
!~      content             = (sensitive value)
!~      crc32c              = "pXu3nA==" -> (known after apply)
!~      detect_md5hash      = "u82/0ZJPyawniK5Kr112gQ==" -> "different hash"
!~      generation          = 1777081404101058 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/target/manifest.json"
!~      md5hash             = "u82/0ZJPyawniK5Kr112gQ==" -> (known after apply)
        name                = "data/warehouse/target/manifest.json"
#        (16 unchanged attributes hidden)
    }

Plan: 1 to add, 9 to change, 1 to destroy.

📝 Plan generated in Deploy dbt #1735

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Terraform plan in iac/cal-itp-data-infra/airflow/us

Plan: 1 to add, 3 to change, 0 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!~  update in-place

Terraform will perform the following actions:

  # google_storage_bucket_object.calitp-composer-dags["models/intermediate/payments/_int_payments.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "AkMXcg==" -> (known after apply)
!~      detect_md5hash      = "v/JBoD58WEdM/XZscf2Ufw==" -> "different hash"
!~      generation          = 1776457910933366 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/models/intermediate/payments/_int_payments.yml"
!~      md5hash             = "v/JBoD58WEdM/XZscf2Ufw==" -> (known after apply)
        name                = "data/warehouse/models/intermediate/payments/_int_payments.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/intermediate/payments/int_payments__enghouse_ticket_results_deduped.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/intermediate/payments/int_payments__enghouse_ticket_results_deduped.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/intermediate/payments/int_payments__enghouse_ticket_results_deduped.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/payments/fct_payments_rides_enghouse.sql"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "TnAu0g==" -> (known after apply)
!~      detect_md5hash      = "xBUm/8jgAixqjWGlHPa0ow==" -> "different hash"
!~      generation          = 1776796001653822 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/models/mart/payments/fct_payments_rides_enghouse.sql"
!~      md5hash             = "xBUm/8jgAixqjWGlHPa0ow==" -> (known after apply)
        name                = "data/warehouse/models/mart/payments/fct_payments_rides_enghouse.sql"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/staging/payments/enghouse/stg_enghouse__ticket_results.sql"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "pSICtA==" -> (known after apply)
!~      detect_md5hash      = "DuHPTMzsaFO6KUjZCoxkvw==" -> "different hash"
!~      generation          = 1769734710416800 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/models/staging/payments/enghouse/stg_enghouse__ticket_results.sql"
!~      md5hash             = "DuHPTMzsaFO6KUjZCoxkvw==" -> (known after apply)
        name                = "data/warehouse/models/staging/payments/enghouse/stg_enghouse__ticket_results.sql"
#        (17 unchanged attributes hidden)
    }

Plan: 1 to add, 3 to change, 0 to destroy.

📝 Plan generated in Deploy dbt #1735

@stevenschrayer stevenschrayer force-pushed the 4964-deduplicate-enghouse-ticket_results branch from ffdaac2 to 36126e1 Compare April 23, 2026 17:50
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Warehouse report 📦

Checks/potential follow-ups

Checks indicate the following action items may be necessary.

  • For new models, do they all have a surrogate primary key that is tested to be not-null and unique?

New models 🌱

calitp_warehouse.intermediate.payments.int_payments__enghouse_ticket_results_deduped

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White

@stevenschrayer stevenschrayer changed the title chore: update deduplication based on null datetime Deduplicate Enghouse tap data Apr 24, 2026
@stevenschrayer stevenschrayer force-pushed the 4964-deduplicate-enghouse-ticket_results branch from 36126e1 to 59ec096 Compare April 24, 2026 13:25
@stevenschrayer stevenschrayer marked this pull request as ready for review April 24, 2026 13:53
@stevenschrayer stevenschrayer changed the title Deduplicate Enghouse tap data Deduplicate Enghouse ticket_results Apr 24, 2026
Copy link
Copy Markdown
Contributor

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry..... I am being indecisive.........

SELECT
*,
ROW_NUMBER() OVER (PARTITION BY _content_hash ORDER BY (SELECT NULL)) AS row_num
ROW_NUMBER() OVER (
Copy link
Copy Markdown
Contributor

@lauriemerrell lauriemerrell Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little conflicted about this logic, thinking out loud.

One definite note: In the data so far it doesn't look like there are any rows where start_dttm isn't populated but end_dttm is (in fact looks like end_dttm might always be null....?), so I don't think we need to include it in the logic here.

My general question though is whether to use this logic (which is ordering preference based on which field is populated) or to use logic based on ordering by the timestamps themselves, probably the most recent row?

And now that I'm saying this I'm wondering if we actually want to combine the rows rather than drop one....

Ok.... I think I have talked myself into: Let's make an intermediate model where we group by _content_hash and combine these rows and just take the earliest start_dttm and created_dttm for the rest of the content? If that makes sense. So don't add this logic here, instead let the dups stay in staging and add another step where we dedupe by combining the rows to keep as much information as possible?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into an int_ model and updated the model that referenced it

SELECT * FROM (
SELECT
*,
ROW_NUMBER() OVER (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sorry I think my last comment was confusing but -- rather than doing it this way, can we actually group by content hash and take the min of these columns (check how that handles nulls) so that we populate as many as possible, and update the docs to note that these columns might be a combination of input rows?

Rather than just keeping one dttm value

@github-actions
Copy link
Copy Markdown

Impacted Exposures

No exposures are impacted by the changes in this PR.

Changed models

  • models/intermediate/payments/int_payments__enghouse_ticket_results_deduped.sql
  • models/mart/payments/fct_payments_rides_enghouse.sql
  • models/staging/payments/enghouse/stg_enghouse__ticket_results.sql

If any impacted exposures are unexpected, verify that your changes do not unintentionally affect downstream consumers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deduplicate Enghouse ticket_results

2 participants