I used publicly available variants lists from Our Future Health to run some perf tests on operations that process variant lists.
I wrote a script to convert the CSVs that contain chrom/post/ref/alt values into a VCF then ran that through vcf2zarr and zipped the result.
There are two variant lists, one for array data (700K variants) and one for imputed data (160M variants).
For the array data:
bin/ofh-public-variants-to-vcf.py our_future_health_cpra_array_variant_list_grch38.csv \
| bcftools view -o ofh-array.vcf.gz -W=csi
vcf2zarr convert -p 8 ofh-array.vcf.gz ofh-array.vcz
zipzarr -k ofh-array.vcz ofh-array.vcz.zip
vcztools index -n ofh-array.vcz.zip
# outputs 701345
The resulting zip was 2.8MB (compared to the original zipped CSV of 19.4MB).
For the imputed data:
bin/ofh-public-variants-to-vcf.py our_future_health_cpra_imputed_variant_list_grch38.csv \
| bcftools view -o ofh-imputed.vcf.gz -W=csi
vcf2zarr convert -p 8 ofh-imputed.vcf.gz ofh-imputed.vcz
zipzarr -k ofh-imputed.vcz ofh-imputed.vcz.zip
vcztools index -n ofh-imputed.vcz.zip
# outputs 159587100
The resulting zip was 559M (compared to the original zipped CSV of 581MB).
Running vczstore with the array variant list
Create a store with the array variant list. Note that the variant list is specified twice which forces create to merge the two inputs, which is an approximation to having two variant lists that are roughly the same.
time uv run vczstore create -v store.vcz ofh-array.vcz ofh-array.vcz
# 6.75s user 3.50s system 133% cpu 7.660 total
Then normalise the variant list against the store (again, the same variants in this simplistic demo):
time uv run vczstore normalise -v store.vcz ofh-array.vcz ofh-array-norm.vcz
# 5.70s user 2.44s system 134% cpu 6.055 total
Running vczstore with the imputed variant list
For the imputed variant list:
time uv run vczstore create -v store.vcz ofh-imputed.vcz ofh-imputed.vcz
# 1525.39s user 708.35s system 116% cpu 32:04.15 total
time uv run vczstore normalise -v store.vcz ofh-imputed.vcz ofh-imputed-norm.vcz
# 694.71s user 692.39s system 129% cpu 17:54.42 total
These both completed (on a 8 core box), but there are probably still improvements we could make. In particular I noticed that create is very memory intensive (~60GB peak) and was swapping.
Here's the log with debug enabled (-vv):
ofh-imputed-create-store.log
I used publicly available variants lists from Our Future Health to run some perf tests on operations that process variant lists.
I wrote a script to convert the CSVs that contain chrom/post/ref/alt values into a VCF then ran that through vcf2zarr and zipped the result.
There are two variant lists, one for array data (700K variants) and one for imputed data (160M variants).
For the array data:
The resulting zip was 2.8MB (compared to the original zipped CSV of 19.4MB).
For the imputed data:
The resulting zip was 559M (compared to the original zipped CSV of 581MB).
Running vczstore with the array variant list
Create a store with the array variant list. Note that the variant list is specified twice which forces create to merge the two inputs, which is an approximation to having two variant lists that are roughly the same.
Then normalise the variant list against the store (again, the same variants in this simplistic demo):
Running vczstore with the imputed variant list
For the imputed variant list:
These both completed (on a 8 core box), but there are probably still improvements we could make. In particular I noticed that
createis very memory intensive (~60GB peak) and was swapping.Here's the log with debug enabled (
-vv):ofh-imputed-create-store.log