Skip to content

Dup names#4950

Open
faithokamoto wants to merge 16 commits into
masterfrom
dup-names
Open

Dup names#4950
faithokamoto wants to merge 16 commits into
masterfrom
dup-names

Conversation

@faithokamoto

Copy link
Copy Markdown
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

  • Descriptive error when creating a graph from FASTAs with duplicate names

Description

Resolves #547 in a way. Now instead of non-obvious problems when a sequence name is duplicated, we give a big beautiful obvious error. Old behavior was that the last of the duplicately named sequences was used. New behavior is that we tell you that there are duplicate names. Test cases also added.

@faithokamoto

Copy link
Copy Markdown
Contributor Author

great it can't find my commit (to be fair I am also confused about where the commit lives)

@faithokamoto

Copy link
Copy Markdown
Contributor Author

This is failing now because it needs some fixes I put in #4945

@adamnovak adamnovak mentioned this pull request Jun 30, 2026
@adamnovak

Copy link
Copy Markdown
Member

OK, here's the problem:

[anovak1@mustard test]$ rm -f xx.fa xx.fa.fai
sed "s/y/x/" small/xy.fa > xx.fa
vg construct -r xx.fa > /dev/null
index file xx.fa.fai not found, generating...
[anovak1@mustard test]$ echo $?
0
[anovak1@mustard test]$ vg construct -r xx.fa > /dev/null
Sequence "x" appears multiple times
[anovak1@mustard test]$ echo $?
1

The check for duplicate entries in the FASTA is in the index load code. If you run the test more than once locally without deleting the index, the FASTA is already indexed and the index load code detects the duplication. If you run the test in a clean repo (like on the Mac CI), the index gets built and we don't actually go through the index load code, so the duplicate check never happens.

@adamnovak

Copy link
Copy Markdown
Member

This now uses vgteam/vcflib#29

@adamnovak

Copy link
Copy Markdown
Member

If someone indexes a FASTA with duplicate records with samtools, which warns and drops duplicates from the index, then vg won't be able to detect that there are duplicates in the file when we open it, because we'll just query the index.

That might be good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vg construct from fasta path name selection could be better.

2 participants