Skip to content

view -e EXPR diverges from bcftools when EXPR is pure sample-scope (FMT/...) #341

@jeromekelleher

Description

@jeromekelleher

Summary

vcztools view -e EXPR (and bcftools_filter.BcftoolsFilter(exclude=EXPR)) diverges from bcftools view -e EXPR when EXPR is a pure sample-scope expression (FMT/-prefixed, no
variant-scope conjunct). The bug is that vcztools negates the per-sample mask cell-by-cell before the np.any-across-samples collapse, while bcftools negates the row-level decisio
n after the per-sample collapse.

Reproduction

Using the in-tree tests/data/vcf/sample.vcf.gz:

$ bcftools view --no-version -e 'FMT/DP<5' tests/data/vcf/sample.vcf.gz | grep -v '^##' | awk '{print $1, $2}'                                                                      
#CHROM POS                                 
19 111                                       
19 112                                                                                                                                                                              
20 1235237                                                                                                                                                                          
X 10                                         
                                                                                          
$ vcztools view --no-version -e 'FMT/DP<5' tests/data/vcf/sample.vcz.zip | grep -v '^##' | awk '{print $1, $2}'
#CHROM POS                                                                                                                                                                          19 111                                                                                                                                                                              19 112                                                                                                                                                                              
20 14370
20 17330                                                                                                                                                                            
20 1110696                                                                                                                                                                          
20 1230237                                                                                                                                                                          
20 1235237                                                                                                                                                                          X 10                               

vcztools keeps four extra rows (14370, 17330, 1110696, 1230237). Each has at least one sample with DP<5, so under bcftools' row-level semantics -e 'FMT/DP<5' excludes them. vcz
tools includes them.

Why

bcftools view -e EXPR excludes a site iff EXPR is true at the site, where for sample-scope EXPR "true at the site" means np.any(per_sample_EXPR, axis=1). So the row decisio
n is not np.any(per_sample_EXPR, axis=1).

vcztools BcftoolsFilter.evaluate (vcztools/bcftools_filter.py:721–735) instead computes np.logical_not(per_sample_EXPR) cell-by-cell, then the reader (vcztools/retrieval.py
:1139) collapses with np.any(..., axis=1). That's np.any(not per_sample_EXPR, axis=1), which is "at least one sample fails EXPR" — not "no sample passes EXPR".

For FMT/DP=[., 4, 2] and EXPR = 'FMT/DP<5':

  • per-sample EXPR: [False, True, True] (missing → False)
  • bcftools: not np.any([F, T, T]) = not True = False → row excluded
  • vcztools: np.any(not [F, T, T]) = np.any([T, F, F]) = True → row kept

For mixed-scope expressions like -e '(FMT/DP>=8) && POS>100000' (the existing test at tests/test_bcftools_validation.py:81) the bug doesn't surface because the AND with POS>10 0000 is already 1-D, so the expression evaluator collapses the FMT operand with np.any inside && before the per-row negate runs. Pure sample-scope -e is the only path that
exposes the divergence.

Fix sketch

In BcftoolsFilter.evaluate, when self.invert is set and the result is 2-D sample-scope, collapse first then negate:

result = self.parse_result[0].eval(chunk_data)
if self.scope == "sample" and self.invert:
    # bcftools row-level negate: row is excluded iff any sample passes EXPR.
    result = np.logical_not(np.any(result, axis=1))
elif self.invert:
    result = np.logical_not(result)
if self.scope == "variant" and result.ndim == 2:
    result = np.any(result, axis=1)
return result

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions