Skip to content

SAM tag to assign un-projectable reads to a reference position#4948

Open
eizengaj-roche wants to merge 5 commits into
vgteam:masterfrom
eizengaj-roche:surject-off-ref-pos
Open

SAM tag to assign un-projectable reads to a reference position#4948
eizengaj-roche wants to merge 5 commits into
vgteam:masterfrom
eizengaj-roche:surject-off-ref-pos

Conversation

@eizengaj-roche

Copy link
Copy Markdown
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

  • vg giraffe and vg surject HTSlib output can annotate mapped but off-reference reads with their approximate reference location with --off-ref-position

Description

Reads inside large insertions (or other complicated motifs) can be mapped in graph space but converted into unmapped reads during surject. This PR introduces the NR (nearest reference) tag to the HTSlib output, gated behind an option --off-ref-position, which encodes the reference position that is closest to the reference-orientation-relative start position. This is accomplished via graph traversals. My benchmarking suggests that this adds a negligible amount to the run time, due to the relative rarity of these reads.

The format of the NR tag is a structured string that looks like this: NR:Z:chr1:1500-,>100, which is interpreted as:

  • The nearest reference position is chr1:1500 (1-based).
  • The minimum distance to the reference position was 100 bp, walking forward (>) from the read start position.
  • The walk that achieved this distance encountered the reference in the reverse orienation (-).

The reads with the NR tag still remain unmapped, as indicated by the mapping position SAM columns and the mapped bitwise flag. However, one debatable design decision I made is that these BAM records now retain their graph-space MAPQ, rather than clamping it to 0 as previously. My read of the SAM format specification is that this is not prohibited, and specifically

Bit 0x4 [of the bitwise flag] is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800.

That said, it's possible that downstream tools might not appreciate this somewhat non-standard combination of data. Another possibility would be to encode the MAPQ into the structured string. I'm open to either option, personally.

@adamnovak adamnovak left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously had to deal with MAPQ not being zeroed in unmapped reads, which we fixed in #4600. It didn't seem to cause too much trouble except that we weren't expecting it downstream.

But if, per the spec, no assumptions can be made about the values of those fields for unmapped reads, can we assume that their values are always preserved by things like samtools sort? It's fine to let vg fill it in, but if you actually intend to interchange the records between tools I think the MAPQ would be safer in a tag.

It also makes more sense to put the MAPQ in a tag if we really intend these to be "unmapped" reads, and not "reads that we mapped but we can't tell you the position of in BAM".

I think really semantically these are mapped reads, though? Like, you ought to be able to have a primary record for a read that uses an NR tag, and a secondary for the same read that's on the reference.

Comment thread src/hts_alignment_emitter.hpp Outdated
/// When surjecting, produce and report supplementary alignments
ALIGNMENT_EMITTER_FLAG_HTS_SUPPLEMENTARY = 32,
/// When surjecting, annote off-reference reads with their nearest reference position
ALIGNMENT_EMITTER_FLAG_OFF_REF_POSITION = 64,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this maybe should be _HTS_ since it only affects HTS formats.

Comment thread src/surjector.hpp
/// the final two fields in return value indicate 1) the distance traversed to reach the NR position from
/// the end of the alignment that would be further in the direction of low coordinates of the oriented
/// path, and 2) true if the traversal direction was in the direction of low coordinates, else false
tuple<string, int64_t, bool, size_t, bool> nearest_ref_pos(const Alignment& source, const PathPositionHandleGraph* graph,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have an oriented position on a string contig that we could use here instead of the first 3 elements?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the suggestion here a Protobuf Position object?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that doesn't sound right.

We have Region:

vg/src/region.hpp

Lines 15 to 19 in 187cc8d

struct Region {
string seq;
int64_t start = -1;
int64_t end = -1;
};

But that's a region and not an oriented location.

I think there just isn't a better type to use right now, but the more we work with paths the more we probably want one in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, yeah. My read is that this a future development though, unless you feel like the lack of a more semantically explicit type is a merge blocking issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants