Command Line Interface

Note

Detailed information about the functions can be found in the pipeline.

The genrisk command line interface includes multiple commands which can be used as follows:

genrisk score-genes

Calculate the gene-based scores for a given dataset.

Example

$ genrisk score-genes -a /path/to/toy_vcf_info.vcf -o toy_genes_scores.tsv -t toy_vcf_scoring -v ID -f AF -g gene -l ALT -d RawScore

Parameters

annotation_filestr

an annotation file containing variant IDs, alt, AF and deleterious scores.

bfilesstr

the binary files for plink process.

plinkstr

the location of plink, if not set in environment

beta_paramtuple

the parameters from beta weight function.

temp_dirstr

a temporary directory to save temporary files before merging.

output_filestr

the location and name of the final output scores matrix.

weight_funcstr

the weighting function used on allele frequency in score calculation. [beta| log10]

variant_colstr

the column containing the variant IDs.

gene_colstr

the column containing gene names. If the genes are in the INFO column, use the identifier of the value (i.e gene=IF, identifier is ‘gene’)

af_colstr

the column containing allele frequency. If in INFO, follow previous example

del_colstr

the column containing deleteriousness score (functional annotation). If in INFO, follow previous example

alt_colstr

the column containing alternate base.

maf_thresholdfloat

the threshold for minor allele frequency.

Returns

DataFrame information

the final scores dataframe information the DataFrame is saved into the output path indicated in the arguments

genrisk score-genes [OPTIONS]

Options

-a, --annotation-file <annotation_file>

Required an annotation file containing variant IDs, alt, AF and deletarious scores.

-b, --bfiles <bfiles>

Required provide binary files that contain the samples info

the directory of plink, if not set in environment

-t, --temp-dir <temp_dir>

Required a temporary directory to save temporary files before merging.

-o, --output-file <output_file>

Required the final output path

-p, --beta-param <beta_param>

the parameters from beta weight function.

Default:

1.0, 25.0

-w, --weight-func <weight_func>

the weighting function used in score calculation.

Default:

beta

Options:

beta | log10

-v, --variant-col <variant_col>

the column containing the variant IDs.

Default:

SNP

-g, --gene-col <gene_col>

the column containing gene names.

Default:

Gene.refGene

-f, --af-col <af_col>

the column containing allele frequency.

Default:

MAF

-d, --del-col <del_col>

the column containing the deleteriousness score.

Default:

CADD_raw

-l, --alt-col <alt_col>

the column containing the alternate base.

Default:

Alt

-m, --maf-threshold <maf_threshold>

the threshold for minor allele frequency.

Default:

0.01

-k, --keep

if flagged temporary files will not be deleted.

genrisk normalize

Normalize/standarize data.

Example

$ genrisk normalize --data-file toy_example/toy_dataset_scores --method gene_length --samples-col IID
--output-file toy_dataset_scores_normalized.tsv

Parameters

genes_infostr

the file containing genes names and length. if not provided ensembl database is used to retrieve data.

methodstr

the method of normalizing data. [gene_length|zscore|minmax|maxabs|robust]

data_filestr

the file containg data to be normalized.

samples_colstr

the column containing sample ids.

genes_colstr

the column containing gene names. ignore if genes_info file is not provided.

lengths_colstr

the column containing gene lengths. ignore if genes_info file is not provided.

output_filestr

the name of the file for final output

Returns

DataFrame with normalized data.

genrisk normalize [OPTIONS]

Options

--method <method>

Required

Options:

gene_length | zscore | minmax | maxabs | robust

--data-file <data_file>

Required

--genes-info <genes_info>
-m, --samples-col <samples_col>

the name of the column that contains the samples.

Default:

IID

--genes-col <genes_col>
Default:

HGNC symbol

--lengths-col <lengths_col>
Default:

gene_length

-o, --output-file <output_file>

Required the final output path

genrisk find-association

Calculate the P-value between two given groups.

Example

$ genrisk find-association --scores-file toy_example/toy_dataset_scores --info-file
toy_example/toy.pheno --phenotype trait1,trait2 --samples-column IID --test logit
 --covariates age,sex --adj-pval bonferroni

Parameters

scores_filestr

the file containing gene-based scores.

info_filestr

file containing the phenotype.

genesstr

a file that contains a list of genes to calculate p-values. if not, all genes in scoring file will be used.

phenotypestr

the name of the column with phenotypes. Phenotypes can be either binary or quantitative.

samples_colstr

the name of the column with sample IDs. All files need to have the same format.

teststr

the statistical test used for calculating p-values.

adj_pvalstr, optional

the method used to adjust the p-values.

covariatesstr, optional

the covariates used for calculation. Not all tests are able to include covariates. (e.g. Mann Whinteny U doesn’t allow for covariates)

processesint, optional

if more than 1 processer is selected, the function will be parallelized.

Returns

DataFrame information

the final dataframe information the DataFrame is saved into the output path indicated in the arguments

genrisk find-association [OPTIONS]

Options

-s, --scores-file <scores_file>

Required The scoring file of genes across a population.

-i, --info-file <info_file>

Required File containing information about the cohort.

-g, --genes <genes>

a file containing the genes to calculate. if not provided all genes will be used.

-t, --test <test>

Required statistical test for calculating P value.

Options:

ttest_ind | mannwhitneyu | logit | linear

-c, --phenotype <phenotype>

Required the name of the column that contains the case/control or quantitative vals.

-m, --samples-col <samples_col>

the name of the column that contains the samples.

Default:

IID

-a, --adj-pval <adj_pval>
Options:

bonferroni | sidak | holm-sidak | holm | simes-hochberg | hommel | fdr_bh | fdr_by | fdr_tsbh | fdr_tsbky

-v, --covariates <covariates>

the covariates used for calculation

-p, --processes <processes>

number of processes for parallelization

Default:

1

--zero-threshold <zero_threshold>

the threshold for the frequency of zeros per gene to be included

genrisk visualize

Visualize manhatten plot and qqplot for the data.

Example

$ genrisk visualize --pvals-file toy_example/toy_dataset_scores
--info-file annotated_toy_dataset.vcf

Parameters

pvals_filestr

the file containing the calculated p-values.

info_filestr

file containing variant/gene info.

genescol_1str

the name of the genes column in pvals file.

genescol_2str

the name of the genes column in info file.

pval_colstr

the name of the pvalues column.

chr_colstr

the name of chromosomes column.

pos_colstr

the name of the position/start column.

Returns

genrisk visualize [OPTIONS]

Options

-p, --pvals-file <pvals_file>

Required the file containing p-values.

-i, --info-file <info_file>

file containing variant/gene info.

--genescol-1 <genescol_1>

the name of the genes column in pvals file.

Default:

genes

--genescol-2 <genescol_2>

the name of the genes column in info file.

Default:

Gene.refGene

-v, --pval-col <pval_col>

the name of the pvalues column.

Default:

p_value

-c, --chr-col <chr_col>

the name of the chromosomes column

Default:

Chr

-s, --pos-col <pos_col>

the name of the position/start of the gene column

Default:

Start

genrisk create-model

Create a prediction model with given dataset.

Example

$ genrisk create-model --data-file toy_example_regressor_features.tsv --model-type regressor
--output-folder toy_regressor  --test-size 0.25 --test --model-name toy_regressor
--target-col trait1 --imbalanced --normalize

Notes

The types of models available for training can be found model_types

Parameters

data_filestr

file containing features and target.

output_folderstr

a folder path to save all outputs.

test_sizefloat

the size of testing set.

testbool

if True the dataset will be split into training and testing for extra evaluation after finalization.

model_namestr

the name of the model to be saved.

model_typestr

the type of model [regressor| classifier].

target_colstr

the name of the target column in data file.

imbalancedbool

if true methods will be used to account for the imbalance.

normalizebool

if true the data will be normalized before training

normalize_methodstr

method used to normalize data. [zscore| minmax| maxab| robust]

foldsint

the number of folds used for cross validation

metricstr

the metric used to choose best model after training.

samples_colstr

the name of the column with samples IDs.

seedint

random seed number to run the machine learning models.

include_modelsstr

list of specific models to compare. more information in the documentations

Returns

Final prediction model

genrisk create-model [OPTIONS]

Options

-d, --data-file <data_file>

Required file with all features and target for training model.

-o, --output-folder <output_folder>

Required path of folder that will contain all outputs.

-i, --test-size <test_size>

test size for cross validation and evaluation.

Default:

0.25

-n, --model-name <model_name>

Required name of model file.

--model-type <model_type>

Required type of prediction model.

Options:

regressor | classifier

-l, --target-col <target_col>

Required name of target column in data_file.

-b, --imbalanced

if flagged methods will be used to account for the imbalance.

--normalize

if flagged the data will be normalized before training.

--normalize-method <normalize_method>

features normalization method.

Default:

zscore

Options:

zscore | minmax | maxabs | robust

-f, --folds <folds>

number of cross-validation folds in training.

Default:

10

--metric <metric>

the metric used to choose best model after training.

-m, --samples-col <samples_col>

the name of the column that contains the samples.

Default:

IID

--seed <seed>

add number to create reproduciple train_test splitting.

--include-models <include_models>

choose specific models to compare with comma in between. e.g lr,gbr,dt

--feature-selection

if selected feature selection will be implemented in training.

genrisk test-model

Evaluate a prediction model with a given dataset.

Example

$ genrisk test-model --model-path regressor_model.pkl --input-file testing_dataset.tsv
--model-type regressor --labels-col target --samples-col IID

Parameters

model_pathstr

the path to the ML model.

input_filestr

the testing (independent) dataset.

model_typestr

the type of model [classifier|regressor].

label_colstr

the labels/target column.

samples_colstr

the sample ids column.

output_filestr

the path to the dataframe with the prediction results.

Returns

DataFrame

dataframe with the prediction results.

genrisk test-model [OPTIONS]

Options

-t, --model-type <model_type>

Required type of prediction model.

Options:

regressor | classifier

-i, --input-file <input_file>

Required testing dataset

-l, --label-col <label_col>

Required the target/phenotype/label column

-m, --model-path <model_path>

Required path to the trained model.

-s, --samples-col <samples_col>

the samples column.

Default:

IID

-o, --output-file <output_file>

Required the final output path

genrisk get-prs

Calculate PRS. This command is interactive. This command gets a pgs file (provided by the user or downloaded) then calculates the PRS for dataset.

Example

This function is performed using commandline interface:

$ genrisk get-prs

Parameters

plinkstr

provide plink path if not default in environment.

Returns

genrisk get-prs [OPTIONS]

Options

-p, --plink <plink>