Command Line Interface

Note

Detailed information about the functions can be found in the pipeline.

The genrisk command line interface includes multiple commands which can be used as follows:

genrisk score-genes

Calculate the gene-based scores for a given dataset.

Example

$ genrisk score-genes -a /path/to/toy_vcf_info.vcf -o toy_genes_scores.tsv -t toy_vcf_scoring -v ID -f AF -g gene -l ALT -d RawScore

Parameters

annotation_filestr: an annotation file containing variant IDs, alt, AF and deleterious scores.
bfilesstr: the binary files for plink process.
plinkstr: the location of plink, if not set in environment
beta_paramtuple: the parameters from beta weight function.
temp_dirstr: a temporary directory to save temporary files before merging.
output_filestr: the location and name of the final output scores matrix.
weight_funcstr: the weighting function used on allele frequency in score calculation. [beta| log10]
variant_colstr: the column containing the variant IDs.
gene_colstr: the column containing gene names. If the genes are in the INFO column, use the identifier of the value (i.e gene=IF, identifier is ‘gene’)
af_colstr: the column containing allele frequency. If in INFO, follow previous example
del_colstr: the column containing deleteriousness score (functional annotation). If in INFO, follow previous example
alt_colstr: the column containing alternate base.
maf_thresholdfloat: the threshold for minor allele frequency.

Returns

DataFrame information: the final scores dataframe information the DataFrame is saved into the output path indicated in the arguments

genrisk score-genes [OPTIONS]

Options

-a, --annotation-file <annotation_file>: Required an annotation file containing variant IDs, alt, AF and deletarious scores.

-b, --bfiles <bfiles>: Required provide binary files that contain the samples info

--plink <plink>: the directory of plink, if not set in environment

-t, --temp-dir <temp_dir>: Required a temporary directory to save temporary files before merging.

-o, --output-file <output_file>: Required the final output path

-p, --beta-param <beta_param>

the parameters from beta weight function.

Default:: 1.0, 25.0

-w, --weight-func <weight_func>

the weighting function used in score calculation.

Default:: beta
Options:: beta | log10

-v, --variant-col <variant_col>

the column containing the variant IDs.

Default:: SNP

-g, --gene-col <gene_col>

the column containing gene names.

Default:: Gene.refGene

-f, --af-col <af_col>

the column containing allele frequency.

Default:: MAF

-d, --del-col <del_col>

the column containing the deleteriousness score.

Default:: CADD_raw

-l, --alt-col <alt_col>

the column containing the alternate base.

Default:: Alt

-m, --maf-threshold <maf_threshold>

the threshold for minor allele frequency.

Default:: 0.01

-k, --keep: if flagged temporary files will not be deleted.

genrisk normalize

Normalize/standarize data.

Example

$ genrisk normalize --data-file toy_example/toy_dataset_scores --method gene_length --samples-col IID
--output-file toy_dataset_scores_normalized.tsv

Parameters

genes_infostr: the file containing genes names and length. if not provided ensembl database is used to retrieve data.
methodstr: the method of normalizing data. [gene_length|zscore|minmax|maxabs|robust]
data_filestr: the file containg data to be normalized.
samples_colstr: the column containing sample ids.
genes_colstr: the column containing gene names. ignore if genes_info file is not provided.
lengths_colstr: the column containing gene lengths. ignore if genes_info file is not provided.
output_filestr: the name of the file for final output

Returns

DataFrame with normalized data.

genrisk normalize [OPTIONS]

Options

--method <method>

Required

Options:: gene_length | zscore | minmax | maxabs | robust

--data-file <data_file>: Required

--genes-info <genes_info>

-m, --samples-col <samples_col>

the name of the column that contains the samples.

Default:: IID

--genes-col <genes_col>

Default:: HGNC symbol

--lengths-col <lengths_col>

Default:: gene_length

-o, --output-file <output_file>: Required the final output path

genrisk find-association

Calculate the P-value between two given groups.

Example

$ genrisk find-association --scores-file toy_example/toy_dataset_scores --info-file
toy_example/toy.pheno --phenotype trait1,trait2 --samples-column IID --test logit
 --covariates age,sex --adj-pval bonferroni

Parameters

scores_filestr: the file containing gene-based scores.
info_filestr: file containing the phenotype.
genesstr: a file that contains a list of genes to calculate p-values. if not, all genes in scoring file will be used.
phenotypestr: the name of the column with phenotypes. Phenotypes can be either binary or quantitative.
samples_colstr: the name of the column with sample IDs. All files need to have the same format.
teststr: the statistical test used for calculating p-values.
adj_pvalstr, optional: the method used to adjust the p-values.
covariatesstr, optional: the covariates used for calculation. Not all tests are able to include covariates. (e.g. Mann Whinteny U doesn’t allow for covariates)
processesint, optional: if more than 1 processer is selected, the function will be parallelized.

Returns

DataFrame information: the final dataframe information the DataFrame is saved into the output path indicated in the arguments

genrisk find-association [OPTIONS]

Options

-s, --scores-file <scores_file>: Required The scoring file of genes across a population.

-i, --info-file <info_file>: Required File containing information about the cohort.

-g, --genes <genes>: a file containing the genes to calculate. if not provided all genes will be used.

-t, --test <test>

Required statistical test for calculating P value.

Options:: ttest_ind | mannwhitneyu | logit | linear

-c, --phenotype <phenotype>: Required the name of the column that contains the case/control or quantitative vals.

-m, --samples-col <samples_col>

the name of the column that contains the samples.

Default:: IID

-a, --adj-pval <adj_pval>

Options:: bonferroni | sidak | holm-sidak | holm | simes-hochberg | hommel | fdr_bh | fdr_by | fdr_tsbh | fdr_tsbky

-v, --covariates <covariates>: the covariates used for calculation

-p, --processes <processes>

number of processes for parallelization

Default:: 1

--zero-threshold <zero_threshold>: the threshold for the frequency of zeros per gene to be included

genrisk visualize

Visualize manhatten plot and qqplot for the data.

Example

$ genrisk visualize --pvals-file toy_example/toy_dataset_scores
--info-file annotated_toy_dataset.vcf

Parameters

pvals_filestr: the file containing the calculated p-values.
info_filestr: file containing variant/gene info.
genescol_1str: the name of the genes column in pvals file.
genescol_2str: the name of the genes column in info file.
pval_colstr: the name of the pvalues column.
chr_colstr: the name of chromosomes column.
pos_colstr: the name of the position/start column.

Returns

genrisk visualize [OPTIONS]

Options

-p, --pvals-file <pvals_file>: Required the file containing p-values.

-i, --info-file <info_file>: file containing variant/gene info.

--genescol-1 <genescol_1>

the name of the genes column in pvals file.

Default:: genes

--genescol-2 <genescol_2>

the name of the genes column in info file.

Default:: Gene.refGene

-v, --pval-col <pval_col>

the name of the pvalues column.

Default:: p_value

-c, --chr-col <chr_col>

the name of the chromosomes column

Default:: Chr

-s, --pos-col <pos_col>

the name of the position/start of the gene column

Default:: Start

genrisk create-model

Create a prediction model with given dataset.

Example

$ genrisk create-model --data-file toy_example_regressor_features.tsv --model-type regressor
--output-folder toy_regressor  --test-size 0.25 --test --model-name toy_regressor
--target-col trait1 --imbalanced --normalize

Notes

The types of models available for training can be found model_types

Parameters

data_filestr: file containing features and target.
output_folderstr: a folder path to save all outputs.
test_sizefloat: the size of testing set.
testbool: if True the dataset will be split into training and testing for extra evaluation after finalization.
model_namestr: the name of the model to be saved.
model_typestr: the type of model [regressor| classifier].
target_colstr: the name of the target column in data file.
imbalancedbool: if true methods will be used to account for the imbalance.
normalizebool: if true the data will be normalized before training
normalize_methodstr: method used to normalize data. [zscore| minmax| maxab| robust]
foldsint: the number of folds used for cross validation
metricstr: the metric used to choose best model after training.
samples_colstr: the name of the column with samples IDs.
seedint: random seed number to run the machine learning models.
include_modelsstr: list of specific models to compare. more information in the documentations

Returns

Final prediction model

genrisk create-model [OPTIONS]

Options

-d, --data-file <data_file>: Required file with all features and target for training model.

-o, --output-folder <output_folder>: Required path of folder that will contain all outputs.

-i, --test-size <test_size>

test size for cross validation and evaluation.

Default:: 0.25

-n, --model-name <model_name>: Required name of model file.

--model-type <model_type>

Required type of prediction model.

Options:: regressor | classifier

-l, --target-col <target_col>: Required name of target column in data_file.

-b, --imbalanced: if flagged methods will be used to account for the imbalance.

--normalize: if flagged the data will be normalized before training.

--normalize-method <normalize_method>

features normalization method.

Default:: zscore
Options:: zscore | minmax | maxabs | robust

-f, --folds <folds>

number of cross-validation folds in training.

Default:: 10

--metric <metric>: the metric used to choose best model after training.

-m, --samples-col <samples_col>

the name of the column that contains the samples.

Default:: IID

--seed <seed>: add number to create reproduciple train_test splitting.

--include-models <include_models>: choose specific models to compare with comma in between. e.g lr,gbr,dt

--feature-selection: if selected feature selection will be implemented in training.

genrisk test-model

Evaluate a prediction model with a given dataset.

Example

$ genrisk test-model --model-path regressor_model.pkl --input-file testing_dataset.tsv
--model-type regressor --labels-col target --samples-col IID

Parameters

model_pathstr: the path to the ML model.
input_filestr: the testing (independent) dataset.
model_typestr: the type of model [classifier|regressor].
label_colstr: the labels/target column.
samples_colstr: the sample ids column.
output_filestr: the path to the dataframe with the prediction results.

Returns

DataFrame: dataframe with the prediction results.

genrisk test-model [OPTIONS]

Options

-t, --model-type <model_type>

Required type of prediction model.

Options:: regressor | classifier

-i, --input-file <input_file>: Required testing dataset

-l, --label-col <label_col>: Required the target/phenotype/label column

-m, --model-path <model_path>: Required path to the trained model.

-s, --samples-col <samples_col>

the samples column.

Default:: IID

-o, --output-file <output_file>: Required the final output path

genrisk get-prs

Calculate PRS. This command is interactive. This command gets a pgs file (provided by the user or downloaded) then calculates the PRS for dataset.

Example

This function is performed using commandline interface:

$ genrisk get-prs

Parameters

plinkstr: provide plink path if not default in environment.

Returns

genrisk get-prs [OPTIONS]

Options

-p, --plink <plink>