Command Line Interface
Note
Detailed information about the functions can be found in the pipeline.
The genrisk command line interface includes multiple commands which can be used as follows:
genrisk score-genes
Calculate the gene-based scores for a given dataset.
Example
$ genrisk score-genes -a /path/to/toy_vcf_info.vcf -o toy_genes_scores.tsv -t toy_vcf_scoring -v ID -f AF -g gene -l ALT -d RawScore
Parameters
- annotation_filestr
an annotation file containing variant IDs, alt, AF and deleterious scores.
- bfilesstr
the binary files for plink process.
- plinkstr
the location of plink, if not set in environment
- beta_paramtuple
the parameters from beta weight function.
- temp_dirstr
a temporary directory to save temporary files before merging.
- output_filestr
the location and name of the final output scores matrix.
- weight_funcstr
the weighting function used on allele frequency in score calculation. [beta| log10]
- variant_colstr
the column containing the variant IDs.
- gene_colstr
the column containing gene names. If the genes are in the INFO column, use the identifier of the value (i.e gene=IF, identifier is ‘gene’)
- af_colstr
the column containing allele frequency. If in INFO, follow previous example
- del_colstr
the column containing deleteriousness score (functional annotation). If in INFO, follow previous example
- alt_colstr
the column containing alternate base.
- maf_thresholdfloat
the threshold for minor allele frequency.
Returns
- DataFrame information
the final scores dataframe information the DataFrame is saved into the output path indicated in the arguments
genrisk score-genes [OPTIONS]
Options
- -a, --annotation-file <annotation_file>
Required an annotation file containing variant IDs, alt, AF and deletarious scores.
- -b, --bfiles <bfiles>
Required provide binary files that contain the samples info
- --plink <plink>
the directory of plink, if not set in environment
- -t, --temp-dir <temp_dir>
Required a temporary directory to save temporary files before merging.
- -o, --output-file <output_file>
Required the final output path
- -p, --beta-param <beta_param>
the parameters from beta weight function.
- Default:
1.0, 25.0
- -w, --weight-func <weight_func>
the weighting function used in score calculation.
- Default:
beta
- Options:
beta | log10
- -v, --variant-col <variant_col>
the column containing the variant IDs.
- Default:
SNP
- -g, --gene-col <gene_col>
the column containing gene names.
- Default:
Gene.refGene
- -f, --af-col <af_col>
the column containing allele frequency.
- Default:
MAF
- -d, --del-col <del_col>
the column containing the deleteriousness score.
- Default:
CADD_raw
- -l, --alt-col <alt_col>
the column containing the alternate base.
- Default:
Alt
- -m, --maf-threshold <maf_threshold>
the threshold for minor allele frequency.
- Default:
0.01
- -k, --keep
if flagged temporary files will not be deleted.
genrisk normalize
Normalize/standarize data.
Example
$ genrisk normalize --data-file toy_example/toy_dataset_scores --method gene_length --samples-col IID
--output-file toy_dataset_scores_normalized.tsv
Parameters
- genes_infostr
the file containing genes names and length. if not provided ensembl database is used to retrieve data.
- methodstr
the method of normalizing data. [gene_length|zscore|minmax|maxabs|robust]
- data_filestr
the file containg data to be normalized.
- samples_colstr
the column containing sample ids.
- genes_colstr
the column containing gene names. ignore if genes_info file is not provided.
- lengths_colstr
the column containing gene lengths. ignore if genes_info file is not provided.
- output_filestr
the name of the file for final output
Returns
DataFrame with normalized data.
genrisk normalize [OPTIONS]
Options
- --method <method>
Required
- Options:
gene_length | zscore | minmax | maxabs | robust
- --data-file <data_file>
Required
- --genes-info <genes_info>
- -m, --samples-col <samples_col>
the name of the column that contains the samples.
- Default:
IID
- --genes-col <genes_col>
- Default:
HGNC symbol
- --lengths-col <lengths_col>
- Default:
gene_length
- -o, --output-file <output_file>
Required the final output path
genrisk find-association
Calculate the P-value between two given groups.
Example
$ genrisk find-association --scores-file toy_example/toy_dataset_scores --info-file
toy_example/toy.pheno --phenotype trait1,trait2 --samples-column IID --test logit
--covariates age,sex --adj-pval bonferroni
Parameters
- scores_filestr
the file containing gene-based scores.
- info_filestr
file containing the phenotype.
- genesstr
a file that contains a list of genes to calculate p-values. if not, all genes in scoring file will be used.
- phenotypestr
the name of the column with phenotypes. Phenotypes can be either binary or quantitative.
- samples_colstr
the name of the column with sample IDs. All files need to have the same format.
- teststr
the statistical test used for calculating p-values.
- adj_pvalstr, optional
the method used to adjust the p-values.
- covariatesstr, optional
the covariates used for calculation. Not all tests are able to include covariates. (e.g. Mann Whinteny U doesn’t allow for covariates)
- processesint, optional
if more than 1 processer is selected, the function will be parallelized.
Returns
- DataFrame information
the final dataframe information the DataFrame is saved into the output path indicated in the arguments
genrisk find-association [OPTIONS]
Options
- -s, --scores-file <scores_file>
Required The scoring file of genes across a population.
- -i, --info-file <info_file>
Required File containing information about the cohort.
- -g, --genes <genes>
a file containing the genes to calculate. if not provided all genes will be used.
- -t, --test <test>
Required statistical test for calculating P value.
- Options:
ttest_ind | mannwhitneyu | logit | linear
- -c, --phenotype <phenotype>
Required the name of the column that contains the case/control or quantitative vals.
- -m, --samples-col <samples_col>
the name of the column that contains the samples.
- Default:
IID
- -a, --adj-pval <adj_pval>
- Options:
bonferroni | sidak | holm-sidak | holm | simes-hochberg | hommel | fdr_bh | fdr_by | fdr_tsbh | fdr_tsbky
- -v, --covariates <covariates>
the covariates used for calculation
- -p, --processes <processes>
number of processes for parallelization
- Default:
1
- --zero-threshold <zero_threshold>
the threshold for the frequency of zeros per gene to be included
genrisk visualize
Visualize manhatten plot and qqplot for the data.
Example
$ genrisk visualize --pvals-file toy_example/toy_dataset_scores
--info-file annotated_toy_dataset.vcf
Parameters
- pvals_filestr
the file containing the calculated p-values.
- info_filestr
file containing variant/gene info.
- genescol_1str
the name of the genes column in pvals file.
- genescol_2str
the name of the genes column in info file.
- pval_colstr
the name of the pvalues column.
- chr_colstr
the name of chromosomes column.
- pos_colstr
the name of the position/start column.
Returns
genrisk visualize [OPTIONS]
Options
- -p, --pvals-file <pvals_file>
Required the file containing p-values.
- -i, --info-file <info_file>
file containing variant/gene info.
- --genescol-1 <genescol_1>
the name of the genes column in pvals file.
- Default:
genes
- --genescol-2 <genescol_2>
the name of the genes column in info file.
- Default:
Gene.refGene
- -v, --pval-col <pval_col>
the name of the pvalues column.
- Default:
p_value
- -c, --chr-col <chr_col>
the name of the chromosomes column
- Default:
Chr
- -s, --pos-col <pos_col>
the name of the position/start of the gene column
- Default:
Start
genrisk create-model
Create a prediction model with given dataset.
Example
$ genrisk create-model --data-file toy_example_regressor_features.tsv --model-type regressor --output-folder toy_regressor --test-size 0.25 --test --model-name toy_regressor --target-col trait1 --imbalanced --normalize
Notes
The types of models available for training can be found model_types
Parameters
- data_filestr
file containing features and target.
- output_folderstr
a folder path to save all outputs.
- test_sizefloat
the size of testing set.
- testbool
if True the dataset will be split into training and testing for extra evaluation after finalization.
- model_namestr
the name of the model to be saved.
- model_typestr
the type of model [regressor| classifier].
- target_colstr
the name of the target column in data file.
- imbalancedbool
if true methods will be used to account for the imbalance.
- normalizebool
if true the data will be normalized before training
- normalize_methodstr
method used to normalize data. [zscore| minmax| maxab| robust]
- foldsint
the number of folds used for cross validation
- metricstr
the metric used to choose best model after training.
- samples_colstr
the name of the column with samples IDs.
- seedint
random seed number to run the machine learning models.
- include_modelsstr
list of specific models to compare. more information in the documentations
Returns
Final prediction model
genrisk create-model [OPTIONS]
Options
- -d, --data-file <data_file>
Required file with all features and target for training model.
- -o, --output-folder <output_folder>
Required path of folder that will contain all outputs.
- -i, --test-size <test_size>
test size for cross validation and evaluation.
- Default:
0.25
- -n, --model-name <model_name>
Required name of model file.
- --model-type <model_type>
Required type of prediction model.
- Options:
regressor | classifier
- -l, --target-col <target_col>
Required name of target column in data_file.
- -b, --imbalanced
if flagged methods will be used to account for the imbalance.
- --normalize
if flagged the data will be normalized before training.
- --normalize-method <normalize_method>
features normalization method.
- Default:
zscore
- Options:
zscore | minmax | maxabs | robust
- -f, --folds <folds>
number of cross-validation folds in training.
- Default:
10
- --metric <metric>
the metric used to choose best model after training.
- -m, --samples-col <samples_col>
the name of the column that contains the samples.
- Default:
IID
- --seed <seed>
add number to create reproduciple train_test splitting.
- --include-models <include_models>
choose specific models to compare with comma in between. e.g lr,gbr,dt
- --feature-selection
if selected feature selection will be implemented in training.
genrisk test-model
Evaluate a prediction model with a given dataset.
Example
$ genrisk test-model --model-path regressor_model.pkl --input-file testing_dataset.tsv --model-type regressor --labels-col target --samples-col IID
Parameters
- model_pathstr
the path to the ML model.
- input_filestr
the testing (independent) dataset.
- model_typestr
the type of model [classifier|regressor].
- label_colstr
the labels/target column.
- samples_colstr
the sample ids column.
- output_filestr
the path to the dataframe with the prediction results.
Returns
- DataFrame
dataframe with the prediction results.
genrisk test-model [OPTIONS]
Options
- -t, --model-type <model_type>
Required type of prediction model.
- Options:
regressor | classifier
- -i, --input-file <input_file>
Required testing dataset
- -l, --label-col <label_col>
Required the target/phenotype/label column
- -m, --model-path <model_path>
Required path to the trained model.
- -s, --samples-col <samples_col>
the samples column.
- Default:
IID
- -o, --output-file <output_file>
Required the final output path
genrisk get-prs
Calculate PRS. This command is interactive. This command gets a pgs file (provided by the user or downloaded) then calculates the PRS for dataset.
Example
This function is performed using commandline interface:
$ genrisk get-prs
Parameters
- plinkstr
provide plink path if not default in environment.
Returns
genrisk get-prs [OPTIONS]
Options
- -p, --plink <plink>