Skip to content

OpenTargets tests pin exact live values that drift across data releases (diseases/depmap/interactions/pharmacogenetics) #249

Description

@Elarwei001

Summary

OpenTargets is a live database that is re-released regularly. Several gget opentargets unit tests assert the exact current values returned by the live API, so they break every time OpenTargets ships a new data release — even on pull requests that never touch the opentargets module. gget itself is returning correct, current data in these cases; only the pinned expectations are stale.

This is part of the dev CI failures tracked in the CI context issue #243.

Affected tests and what they pin

  • test_opentargets and test_opentargets_diseases — pin the top associated disease id and score (recorded EFO_0000274 / 0.7297; live now MONDO_0004980 / 0.728). The ranking and scores change every release.
  • test_opentargets_depmap — pins an md5 hash of the full result; the underlying DepMap rows change between releases.
  • test_opentargets_depmap_filter — pins exact filtered rows for one tissue; the per-tissue screens vary across releases (and can be empty for a given run).
  • test_opentargets_interactions and test_opentargets_interactions_no_limit — pin a specific interaction-partner Ensembl id and a result hash; interaction sets are updated each release.
  • test_opentargets_pharmacogenetics — pins a specific genotype (CC); row order and the surfaced genotype drift across releases.

Why this is a problem

Because the hatch-test job runs the whole suite with no network/flaky marker, these exact-match assertions fail on unrelated PRs and mask real signal. They are testing the upstream data contents, not gget's behaviour.

Proposal

Loosen these tests to structural and invariant assertions that still catch genuine shape/regression breaks but do not pin volatile data. Concretely (listed as plain words to avoid auto-linking):

  • first, assert the expected columns are present and non-empty where a result is guaranteed;
  • second, assert value formats/dtypes rather than exact values — for example a disease id matches an ontology-prefix pattern such as ^(MONDO|EFO|HP|Orphanet|...)_\d+, an association score is a float in the closed interval zero to one, an interaction partner gene id matches ^ENSG\d+, and pharmacogenetics genotypes are nucleotide pairs over the alphabet A/C/G/T;
  • third, for filtered queries, assert the filter invariant (every returned row matches the filter value) and allow an empty result rather than pinning specific rows;
  • fourth, drop the md5 result-hash assertions, which are inherently release-pinned.

This keeps the tests meaningful (they break if gget returns the wrong shape, wrong columns, malformed ids, or empty where data is guaranteed) while surviving normal upstream data drift.

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions