Computation information

It should be noted that the aim of our work is to provide a novel framework more comprehensive in terms of genetic risk assessment and currently is not yet optimized for computational performance. For all the computation below, we use a “standard” workstation (RAM=64GB with 6 CPU dual core).

Gene-based scoring and analysis

In the following table we show the computation time for the gene-core computation and association analysis with linear regression of the biggest chromosome (1,727,756 variants, MAF filtering = <1%) which includes also the higher number of genes (1,972 genes) by considering different numbers of individuals.

1K samples

10K samples

100K samples

gene-scoring (in mins)

22

25

48

Find-association, linear regression (in sec)

8

28

134

While for prediction models the complete input matrix (i.e., samples and genes plus covariates) should be loaded in RAM, for gene-scoring we use the efficient score function implemented in PLINK v2 (). For gene-association GenRisk the memory usage depends on the size of the input matrix, the larger the matrix the more memory it uses.

1K samples

10K samples

100K samples

Mem (in Gb)

3.1

3.4

9.4

Prediction models generation

GenRisk has “per-se” no limit in the number of features that can be used. However, there could be computational issues according to the dimensionality of the input, that is samples and features (genes, covariates, etc..). The tables below present the total run time (in seconds) and maximum memory usage (in GB) given different sample sizes with increasing number of features. Please note that it might be wise to run big data size (e.g 100K x 1000feats) using an HPC infrastructure. Another point to consider is that the time and memory usage also depends on the models included in the analysis and the best model fine-tuning and finalization. Some models, such as gradient boosting, might take more time than simpler models, like linear or lasso regression, to be finalized.

Total run time of prediction model generation in seconds

1K samples

10K samples

100K samples

10 feats

14

19

1690

100 feats

24

678

41649

1000 feats

143

1034

432000(≈ 5days)

Maximum memory used in GB

1K samples

10K samples

100K samples

10 feats

2.81

2.93

2.97

100 feats

2.93

2.95

3.29

1000 feats

3.51

3.82

8.29

Feature Selection

In general in the context of prediction models for big datasets we would suggest a feature selection using the “association” module and then generate prediction models. This is also in line with the expected genetic architecture of the majority of the traits in which only a small proportion of the genes plays a pivotal role. If instead we have a really highly polygenic phenotype the computation of genome-wide polygenic risk score (PRS) is probably the most appropriate approach, as PRS is a value per individual only a vector of scores would be generated and therefore the computational burden is limited.