SAM tag to assign un-projectable reads to a reference position#4948
SAM tag to assign un-projectable reads to a reference position#4948eizengaj-roche wants to merge 5 commits into
Conversation
adamnovak
left a comment
There was a problem hiding this comment.
We previously had to deal with MAPQ not being zeroed in unmapped reads, which we fixed in #4600. It didn't seem to cause too much trouble except that we weren't expecting it downstream.
But if, per the spec, no assumptions can be made about the values of those fields for unmapped reads, can we assume that their values are always preserved by things like samtools sort? It's fine to let vg fill it in, but if you actually intend to interchange the records between tools I think the MAPQ would be safer in a tag.
It also makes more sense to put the MAPQ in a tag if we really intend these to be "unmapped" reads, and not "reads that we mapped but we can't tell you the position of in BAM".
I think really semantically these are mapped reads, though? Like, you ought to be able to have a primary record for a read that uses an NR tag, and a secondary for the same read that's on the reference.
| /// When surjecting, produce and report supplementary alignments | ||
| ALIGNMENT_EMITTER_FLAG_HTS_SUPPLEMENTARY = 32, | ||
| /// When surjecting, annote off-reference reads with their nearest reference position | ||
| ALIGNMENT_EMITTER_FLAG_OFF_REF_POSITION = 64, |
There was a problem hiding this comment.
I think this maybe should be _HTS_ since it only affects HTS formats.
| /// the final two fields in return value indicate 1) the distance traversed to reach the NR position from | ||
| /// the end of the alignment that would be further in the direction of low coordinates of the oriented | ||
| /// path, and 2) true if the traversal direction was in the direction of low coordinates, else false | ||
| tuple<string, int64_t, bool, size_t, bool> nearest_ref_pos(const Alignment& source, const PathPositionHandleGraph* graph, |
There was a problem hiding this comment.
We don't have an oriented position on a string contig that we could use here instead of the first 3 elements?
There was a problem hiding this comment.
Is the suggestion here a Protobuf Position object?
There was a problem hiding this comment.
No, that doesn't sound right.
We have Region:
Lines 15 to 19 in 187cc8d
But that's a region and not an oriented location.
I think there just isn't a better type to use right now, but the more we work with paths the more we probably want one in the future.
There was a problem hiding this comment.
Probably, yeah. My read is that this a future development though, unless you feel like the lack of a more semantically explicit type is a merge blocking issue.
Changelog Entry
To be copied to the draft changelog by merger:
vg giraffeandvg surjectHTSlib output can annotate mapped but off-reference reads with their approximate reference location with--off-ref-positionDescription
Reads inside large insertions (or other complicated motifs) can be mapped in graph space but converted into unmapped reads during
surject. This PR introduces the NR (nearest reference) tag to the HTSlib output, gated behind an option--off-ref-position, which encodes the reference position that is closest to the reference-orientation-relative start position. This is accomplished via graph traversals. My benchmarking suggests that this adds a negligible amount to the run time, due to the relative rarity of these reads.The format of the NR tag is a structured string that looks like this:
NR:Z:chr1:1500-,>100, which is interpreted as:chr1:1500(1-based).100bp, walking forward (>) from the read start position.-).The reads with the NR tag still remain unmapped, as indicated by the mapping position SAM columns and the mapped bitwise flag. However, one debatable design decision I made is that these BAM records now retain their graph-space MAPQ, rather than clamping it to 0 as previously. My read of the SAM format specification is that this is not prohibited, and specifically
That said, it's possible that downstream tools might not appreciate this somewhat non-standard combination of data. Another possibility would be to encode the MAPQ into the structured string. I'm open to either option, personally.