Scale/perf testing `create` and `normalise` with large variant lists

I used publicly available [variants lists from Our Future Health](https://research.ourfuturehealth.org.uk/data-and-cohort/#our-future-health-data-files) to run some perf tests on operations that process variant lists.

I wrote a [script](https://github.com/sgkit-dev/vczstore/blob/scale-perf-testing/bin/ofh-public-variants-to-vcf.py) to convert the CSVs that contain chrom/post/ref/alt values into a VCF then ran that through vcf2zarr and zipped the result.

There are two variant lists, one for array data (700K variants) and one for imputed data (160M variants).

For the array data:
```shell
bin/ofh-public-variants-to-vcf.py our_future_health_cpra_array_variant_list_grch38.csv \
  | bcftools view -o ofh-array.vcf.gz -W=csi
vcf2zarr convert -p 8 ofh-array.vcf.gz ofh-array.vcz
zipzarr -k ofh-array.vcz ofh-array.vcz.zip
vcztools index -n ofh-array.vcz.zip
# outputs 701345
```
The resulting zip was 2.8MB (compared to the original zipped CSV of 19.4MB).

For the imputed data:
```shell
bin/ofh-public-variants-to-vcf.py our_future_health_cpra_imputed_variant_list_grch38.csv \
  | bcftools view -o ofh-imputed.vcf.gz -W=csi
vcf2zarr convert -p 8 ofh-imputed.vcf.gz ofh-imputed.vcz
zipzarr -k ofh-imputed.vcz ofh-imputed.vcz.zip
vcztools index -n ofh-imputed.vcz.zip
# outputs 159587100
```
The resulting zip was 559M (compared to the original zipped CSV of 581MB).

### Running vczstore with the array variant list

Create a store with the array variant list. Note that the variant list is specified twice which forces create to merge the two inputs, which is an approximation to having two variant lists that are roughly the same.

```shell
time uv run vczstore create -v store.vcz ofh-array.vcz ofh-array.vcz
# 6.75s user 3.50s system 133% cpu 7.660 total
```

Then normalise the variant list against the store (again, the same variants in this simplistic demo):

```shell
time uv run vczstore normalise -v store.vcz ofh-array.vcz ofh-array-norm.vcz
# 5.70s user 2.44s system 134% cpu 6.055 total
```

### Running vczstore with the imputed variant list

For the imputed variant list:

```shell
time uv run vczstore create -v store.vcz ofh-imputed.vcz ofh-imputed.vcz
# 1525.39s user 708.35s system 116% cpu 32:04.15 total
```

```shell
time uv run vczstore normalise -v store.vcz ofh-imputed.vcz ofh-imputed-norm.vcz
# 694.71s user 692.39s system 129% cpu 17:54.42 total
```

These both completed (on a 8 core box), but there are probably still improvements we could make. In particular I noticed that `create` is very memory intensive (~60GB peak) and was swapping.

Here's the log with debug enabled (`-vv`):
[ofh-imputed-create-store.log](https://github.com/user-attachments/files/27529469/ofh-imputed-create-store.log)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale/perf testing `create` and `normalise` with large variant lists #95

Running vczstore with the array variant list

Running vczstore with the imputed variant list

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scale/perf testing create and normalise with large variant lists #95

Description

Running vczstore with the array variant list

Running vczstore with the imputed variant list

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Scale/perf testing `create` and `normalise` with large variant lists #95