Skip to content

Scale/perf testing create and normalise with large variant lists #95

@tomwhite

Description

@tomwhite

I used publicly available variants lists from Our Future Health to run some perf tests on operations that process variant lists.

I wrote a script to convert the CSVs that contain chrom/post/ref/alt values into a VCF then ran that through vcf2zarr and zipped the result.

There are two variant lists, one for array data (700K variants) and one for imputed data (160M variants).

For the array data:

bin/ofh-public-variants-to-vcf.py our_future_health_cpra_array_variant_list_grch38.csv \
  | bcftools view -o ofh-array.vcf.gz -W=csi
vcf2zarr convert -p 8 ofh-array.vcf.gz ofh-array.vcz
zipzarr -k ofh-array.vcz ofh-array.vcz.zip
vcztools index -n ofh-array.vcz.zip
# outputs 701345

The resulting zip was 2.8MB (compared to the original zipped CSV of 19.4MB).

For the imputed data:

bin/ofh-public-variants-to-vcf.py our_future_health_cpra_imputed_variant_list_grch38.csv \
  | bcftools view -o ofh-imputed.vcf.gz -W=csi
vcf2zarr convert -p 8 ofh-imputed.vcf.gz ofh-imputed.vcz
zipzarr -k ofh-imputed.vcz ofh-imputed.vcz.zip
vcztools index -n ofh-imputed.vcz.zip
# outputs 159587100

The resulting zip was 559M (compared to the original zipped CSV of 581MB).

Running vczstore with the array variant list

Create a store with the array variant list. Note that the variant list is specified twice which forces create to merge the two inputs, which is an approximation to having two variant lists that are roughly the same.

time uv run vczstore create -v store.vcz ofh-array.vcz ofh-array.vcz
# 6.75s user 3.50s system 133% cpu 7.660 total

Then normalise the variant list against the store (again, the same variants in this simplistic demo):

time uv run vczstore normalise -v store.vcz ofh-array.vcz ofh-array-norm.vcz
# 5.70s user 2.44s system 134% cpu 6.055 total

Running vczstore with the imputed variant list

For the imputed variant list:

time uv run vczstore create -v store.vcz ofh-imputed.vcz ofh-imputed.vcz
# 1525.39s user 708.35s system 116% cpu 32:04.15 total
time uv run vczstore normalise -v store.vcz ofh-imputed.vcz ofh-imputed-norm.vcz
# 694.71s user 692.39s system 129% cpu 17:54.42 total

These both completed (on a 8 core box), but there are probably still improvements we could make. In particular I noticed that create is very memory intensive (~60GB peak) and was swapping.

Here's the log with debug enabled (-vv):
ofh-imputed-create-store.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions