view -e EXPR diverges from bcftools when EXPR is pure sample-scope (FMT/...)

# Summary                                                                                                                                                                                                                       
`vcztools view -e EXPR` (and `bcftools_filter.BcftoolsFilter(exclude=EXPR)`) diverges from `bcftools view -e EXPR` when `EXPR` is a pure sample-scope expression (FMT/-prefixed, no 
variant-scope conjunct). The bug is that vcztools negates the per-sample mask cell-by-cell *before* the np.any-across-samples collapse, while bcftools negates the row-level decisio
n *after* the per-sample collapse.
                                                                                                                                                                                    
## Reproduction                                                                                                                                                                     
                                             
Using the in-tree `tests/data/vcf/sample.vcf.gz`:                                         
                                             
```                                                                                       
$ bcftools view --no-version -e 'FMT/DP<5' tests/data/vcf/sample.vcf.gz | grep -v '^##' | awk '{print $1, $2}'                                                                      
#CHROM POS                                 
19 111                                       
19 112                                                                                                                                                                              
20 1235237                                                                                                                                                                          
X 10                                         
                                                                                          
$ vcztools view --no-version -e 'FMT/DP<5' tests/data/vcf/sample.vcz.zip | grep -v '^##' | awk '{print $1, $2}'
#CHROM POS                                                                                                                                                                          19 111                                                                                                                                                                              19 112                                                                                                                                                                              
20 14370
20 17330                                                                                                                                                                            
20 1110696                                                                                                                                                                          
20 1230237                                                                                                                                                                          
20 1235237                                                                                                                                                                          X 10                               
```                                                                                                                                                                                 

vcztools keeps four extra rows (14370, 17330, 1110696, 1230237). Each has at least one sample with `DP<5`, so under bcftools' row-level semantics `-e 'FMT/DP<5'` excludes them. vcz
tools includes them.

## Why

`bcftools view -e EXPR` excludes a site iff `EXPR` is true at the site, where for sample-scope `EXPR` "true at the site" means `np.any(per_sample_EXPR, axis=1)`. So the row decisio
n is `not np.any(per_sample_EXPR, axis=1)`.

vcztools `BcftoolsFilter.evaluate` (`vcztools/bcftools_filter.py`:721–735) instead computes `np.logical_not(per_sample_EXPR)` cell-by-cell, then the reader (`vcztools/retrieval.py`
:1139) collapses with `np.any(..., axis=1)`. That's `np.any(not per_sample_EXPR, axis=1)`, which is "at least one sample fails EXPR" — not "no sample passes EXPR".

For `FMT/DP=[., 4, 2]` and `EXPR = 'FMT/DP<5'`:

- per-sample `EXPR`: `[False, True, True]` (missing → False)
- bcftools: `not np.any([F, T, T])` = `not True` = **False** → row excluded
- vcztools: `np.any(not [F, T, T])` = `np.any([T, F, F])` = **True** → row kept

For mixed-scope expressions like `-e '(FMT/DP>=8) && POS>100000'` (the existing test at `tests/test_bcftools_validation.py:81`) the bug doesn't surface because the AND with `POS>10
0000` is already 1-D, so the expression evaluator collapses the FMT operand with `np.any` *inside* `&&` before the per-row negate runs. Pure sample-scope `-e` is the only path that
 exposes the divergence.

## Fix sketch

In `BcftoolsFilter.evaluate`, when `self.invert` is set and the result is 2-D sample-scope, collapse first then negate:

```python
result = self.parse_result[0].eval(chunk_data)
if self.scope == "sample" and self.invert:
    # bcftools row-level negate: row is excluded iff any sample passes EXPR.
    result = np.logical_not(np.any(result, axis=1))
elif self.invert:
    result = np.logical_not(result)
if self.scope == "variant" and result.ndim == 2:
    result = np.any(result, axis=1)
return result
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

view -e EXPR diverges from bcftools when EXPR is pure sample-scope (FMT/...) #341

Summary

Reproduction

Why

Fix sketch

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

view -e EXPR diverges from bcftools when EXPR is pure sample-scope (FMT/...) #341

Description

Summary

Reproduction

Why

Fix sketch

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions