Summary
vcztools view -e EXPR (and bcftools_filter.BcftoolsFilter(exclude=EXPR)) diverges from bcftools view -e EXPR when EXPR is a pure sample-scope expression (FMT/-prefixed, no
variant-scope conjunct). The bug is that vcztools negates the per-sample mask cell-by-cell before the np.any-across-samples collapse, while bcftools negates the row-level decisio
n after the per-sample collapse.
Reproduction
Using the in-tree tests/data/vcf/sample.vcf.gz:
$ bcftools view --no-version -e 'FMT/DP<5' tests/data/vcf/sample.vcf.gz | grep -v '^##' | awk '{print $1, $2}'
#CHROM POS
19 111
19 112
20 1235237
X 10
$ vcztools view --no-version -e 'FMT/DP<5' tests/data/vcf/sample.vcz.zip | grep -v '^##' | awk '{print $1, $2}'
#CHROM POS 19 111 19 112
20 14370
20 17330
20 1110696
20 1230237
20 1235237 X 10
vcztools keeps four extra rows (14370, 17330, 1110696, 1230237). Each has at least one sample with DP<5, so under bcftools' row-level semantics -e 'FMT/DP<5' excludes them. vcz
tools includes them.
Why
bcftools view -e EXPR excludes a site iff EXPR is true at the site, where for sample-scope EXPR "true at the site" means np.any(per_sample_EXPR, axis=1). So the row decisio
n is not np.any(per_sample_EXPR, axis=1).
vcztools BcftoolsFilter.evaluate (vcztools/bcftools_filter.py:721–735) instead computes np.logical_not(per_sample_EXPR) cell-by-cell, then the reader (vcztools/retrieval.py
:1139) collapses with np.any(..., axis=1). That's np.any(not per_sample_EXPR, axis=1), which is "at least one sample fails EXPR" — not "no sample passes EXPR".
For FMT/DP=[., 4, 2] and EXPR = 'FMT/DP<5':
- per-sample
EXPR: [False, True, True] (missing → False)
- bcftools:
not np.any([F, T, T]) = not True = False → row excluded
- vcztools:
np.any(not [F, T, T]) = np.any([T, F, F]) = True → row kept
For mixed-scope expressions like -e '(FMT/DP>=8) && POS>100000' (the existing test at tests/test_bcftools_validation.py:81) the bug doesn't surface because the AND with POS>10 0000 is already 1-D, so the expression evaluator collapses the FMT operand with np.any inside && before the per-row negate runs. Pure sample-scope -e is the only path that
exposes the divergence.
Fix sketch
In BcftoolsFilter.evaluate, when self.invert is set and the result is 2-D sample-scope, collapse first then negate:
result = self.parse_result[0].eval(chunk_data)
if self.scope == "sample" and self.invert:
# bcftools row-level negate: row is excluded iff any sample passes EXPR.
result = np.logical_not(np.any(result, axis=1))
elif self.invert:
result = np.logical_not(result)
if self.scope == "variant" and result.ndim == 2:
result = np.any(result, axis=1)
return result
Summary
vcztools view -e EXPR(andbcftools_filter.BcftoolsFilter(exclude=EXPR)) diverges frombcftools view -e EXPRwhenEXPRis a pure sample-scope expression (FMT/-prefixed, novariant-scope conjunct). The bug is that vcztools negates the per-sample mask cell-by-cell before the np.any-across-samples collapse, while bcftools negates the row-level decisio
n after the per-sample collapse.
Reproduction
Using the in-tree
tests/data/vcf/sample.vcf.gz:vcztools keeps four extra rows (14370, 17330, 1110696, 1230237). Each has at least one sample with
DP<5, so under bcftools' row-level semantics-e 'FMT/DP<5'excludes them. vcztools includes them.
Why
bcftools view -e EXPRexcludes a site iffEXPRis true at the site, where for sample-scopeEXPR"true at the site" meansnp.any(per_sample_EXPR, axis=1). So the row decision is
not np.any(per_sample_EXPR, axis=1).vcztools
BcftoolsFilter.evaluate(vcztools/bcftools_filter.py:721–735) instead computesnp.logical_not(per_sample_EXPR)cell-by-cell, then the reader (vcztools/retrieval.py:1139) collapses with
np.any(..., axis=1). That'snp.any(not per_sample_EXPR, axis=1), which is "at least one sample fails EXPR" — not "no sample passes EXPR".For
FMT/DP=[., 4, 2]andEXPR = 'FMT/DP<5':EXPR:[False, True, True](missing → False)not np.any([F, T, T])=not True= False → row excludednp.any(not [F, T, T])=np.any([T, F, F])= True → row keptFor mixed-scope expressions like
-e '(FMT/DP>=8) && POS>100000'(the existing test attests/test_bcftools_validation.py:81) the bug doesn't surface because the AND withPOS>10 0000is already 1-D, so the expression evaluator collapses the FMT operand withnp.anyinside&&before the per-row negate runs. Pure sample-scope-eis the only path thatexposes the divergence.
Fix sketch
In
BcftoolsFilter.evaluate, whenself.invertis set and the result is 2-D sample-scope, collapse first then negate: