Correct NM tag calculation #1536

michaelgatzen · 2021-02-08T21:05:02Z

Adding enum BaseComparisonMode to toggle different modes of comparing bases
- MatchExact: bases match when they are equal
- MatchAmbiguity: bases match when they are equal or the read base can be expressed using an ambiguity code in the reference
- NMTagMode: bases match when they are equal AND from [AaCcGgTt]. This is compliant to the SAM spec.
Modified different versions of base comparison methods accordingly
Modified unit tests to catch off-spec behavior

michaelgatzen · 2021-02-08T21:06:33Z

src/main/java/htsjdk/samtools/util/SequenceUtil.java

+                    return readBaseMatchesRefBaseWithAmbiguity(readBase, refBase);
+                }
+            case NMTagMode:
+                // TODO Different treatment for bisulfite?


Are bisulfite sequences relevant here?

- NM calculation had also been performed in calculateMdAndNmTags in one pass along with calculating the MD tag. The NM tag calculation was now changed to use the calculateSamNmTag method. - Added integration tests that catch off-spec behavior. - Implemented correct handling of the "=" base.

lbergelson · 2021-02-09T16:36:44Z

@yfarjoun I have a recollection of you having an opinion about this?

yfarjoun · 2021-02-09T17:03:13Z

me too....mostly it was due to being conservative and not liking change, I think....not really a good reason. Perhaps @tfenne has a more substantial opinion?

michaelgatzen · 2021-02-09T17:13:18Z

Just for reference, the reason why this came up is that DRAGEN apparently produces reads with Ns in them, which is the only time that this becomes relevant. This would lead to ValidateSamFile reporting an error on DRAGEN BAMs.

tfenne · 2021-02-09T18:14:20Z

@yfarjoun My opinion is similar - I worry about change. But if the default behavior in tools remains the same, and there aren't backwards compatibility breaks, I don't object.

FWIW a lot of UMI consensus tools generate reads with Ns in them (e.g. fgbio), so this does come up quite a bit in those situations. And there are variant callers that use NM to filter reads, so there are use cases where a chance in this behavior would have a significant change on downstream pipelines.

michaelgatzen · 2021-02-10T11:36:27Z

Ok I see. The methods here are indeed used for tools like MergeBamAlignment, so this will result in downstream change (for reads with ambiguity codes in them). This change would make it more spec compliant, but I understand the hesitation about change.

One other option is to keep it the way it was and be more liberal in SamFileValidator, so that we don't report errors for files that are spec-compliant. What do people think?

michaelgatzen added 4 commits February 8, 2021 19:07

Temporary fix to NM tag calculation

a37ed32

Fix for spec-compliant calculation of NM tag

e999a82

Added tests for SAM NM tag compliance

c5fda33

Using more appropriate exception type

1eb08ee

michaelgatzen commented Feb 8, 2021

View reviewed changes

kockan mentioned this pull request Jun 13, 2024

ValidateSamFile wrong NM tag computation broadinstitute/picard#1963

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct NM tag calculation #1536

Correct NM tag calculation #1536

Uh oh!

michaelgatzen commented Feb 8, 2021

Uh oh!

michaelgatzen Feb 8, 2021

Uh oh!

lbergelson commented Feb 9, 2021

Uh oh!

yfarjoun commented Feb 9, 2021

Uh oh!

michaelgatzen commented Feb 9, 2021

Uh oh!

tfenne commented Feb 9, 2021

Uh oh!

michaelgatzen commented Feb 10, 2021

Uh oh!

Uh oh!

Correct NM tag calculation #1536

Are you sure you want to change the base?

Correct NM tag calculation #1536

Uh oh!

Conversation

michaelgatzen commented Feb 8, 2021

Uh oh!

michaelgatzen Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

lbergelson commented Feb 9, 2021

Uh oh!

yfarjoun commented Feb 9, 2021

Uh oh!

michaelgatzen commented Feb 9, 2021

Uh oh!

tfenne commented Feb 9, 2021

Uh oh!

michaelgatzen commented Feb 10, 2021

Uh oh!

Uh oh!