Skip to content

Correct NM tag calculation #1536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

michaelgatzen
Copy link
Contributor

Fixes #1535

  • Adding enum BaseComparisonMode to toggle different modes of comparing bases
    • MatchExact: bases match when they are equal
    • MatchAmbiguity: bases match when they are equal or the read base can be expressed using an ambiguity code in the reference
    • NMTagMode: bases match when they are equal AND from [AaCcGgTt]. This is compliant to the SAM spec.
  • Modified different versions of base comparison methods accordingly
  • Modified unit tests to catch off-spec behavior

return readBaseMatchesRefBaseWithAmbiguity(readBase, refBase);
}
case NMTagMode:
// TODO Different treatment for bisulfite?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are bisulfite sequences relevant here?

- NM calculation had also been performed in calculateMdAndNmTags in one pass along with calculating the MD tag. The NM tag calculation was now changed to use the calculateSamNmTag method.
- Added integration tests that catch off-spec behavior.
- Implemented correct handling of the "=" base.
@lbergelson
Copy link
Member

@yfarjoun I have a recollection of you having an opinion about this?

@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 9, 2021

me too....mostly it was due to being conservative and not liking change, I think....not really a good reason. Perhaps @tfenne has a more substantial opinion?

@michaelgatzen
Copy link
Contributor Author

Just for reference, the reason why this came up is that DRAGEN apparently produces reads with Ns in them, which is the only time that this becomes relevant. This would lead to ValidateSamFile reporting an error on DRAGEN BAMs.

@tfenne
Copy link
Member

tfenne commented Feb 9, 2021

@yfarjoun My opinion is similar - I worry about change. But if the default behavior in tools remains the same, and there aren't backwards compatibility breaks, I don't object.

FWIW a lot of UMI consensus tools generate reads with Ns in them (e.g. fgbio), so this does come up quite a bit in those situations. And there are variant callers that use NM to filter reads, so there are use cases where a chance in this behavior would have a significant change on downstream pipelines.

@michaelgatzen
Copy link
Contributor Author

Ok I see. The methods here are indeed used for tools like MergeBamAlignment, so this will result in downstream change (for reads with ambiguity codes in them). This change would make it more spec compliant, but I understand the hesitation about change.

One other option is to keep it the way it was and be more liberal in SamFileValidator, so that we don't report errors for files that are spec-compliant. What do people think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SAM NM tag calculation is off-spec
4 participants