gmt-music-clinical-correlation: Correlate phenotypic traits against mutated genes, or against individual variants

gmt music clinical-correlation

VERSION

This document describes gmt music clinical-correlation version 0.04 (2013-05-14 at 16:03:04)

SYNOPSIS

gmt music clinical-correlation --bam-list=? --output-file=? [--maf-file=?] [--glm-clinical-data-file=?] [--use-maf-in-glm] [--skip-non-coding] [--skip-silent] [--clinical-correlation-matrix-file=?] [--input-clinical-correlation-matrix-file=?] [--genetic-data-type=?] [--numeric-clinical-data-file=?] [--numerical-data-test-method=?] [--categorical-clinical-data-file=?] [--glm-model-file=?]

 ... music clinical-correlation \
        --bam-list /path/myBamList.tsv \
        --maf-file /path/myMAF.tsv \
        --numeric-clinical-data-file /path/myNumericData.tsv \
        --genetic-data-type 'gene' \
        --output-file /path/output_file

 ... music clinical-correlation \
        --maf-file /path/myMAF.tsv \
        --bam-list /path/myBamList.tsv \
        --numeric-clinical-data-file /path/myNumericData.tsv \
        --categorical-clinical-data-file /path/myClassData.tsv \
        --genetic-data-type 'gene' \
        --output-file /path/output_file

 ... music clinical-correlation \
        --maf-file /path/myMAF.tsv \
        --bam-list /path/myBamList.tsv \
        --output-file /path/output_file \
        --glm-model-file /path/model.tsv \
        --glm-clinical-data-file /path/glm_clinical_data.tsv \
        --use-maf-in-glm

REQUIRED ARGUMENTS

bam-list Text: Tab delimited list of \s-1BAM\s0 files [sample_name, normal_bam, tumor_bam] (See Description)
output-file Text: Results of clinical-correlation tool. Will have suffix added for data type

OPTIONAL ARGUMENTS

maf-file Text: List of mutations using \s-1TCGA\s0 \s-1MAF\s0 specification v2.3
glm-clinical-data-file Text: Clinical traits, mutational profiles, other mixed clinical data (See \s-1DESCRIPTION\s0)
use-maf-in-glm Boolean: Create a variant matrix from the \s-1MAF\s0 file as variant input to \s-1GLM\s0 analysis. Default value 'false' (--nouse-maf-in-glm) if not specified
skip-non-coding Boolean: Skip non-coding mutations from the provided \s-1MAF\s0 file Default value 'true' if not specified
skip-silent Boolean: Skip silent mutations from the provided \s-1MAF\s0 file Default value 'true' if not specified
clinical-correlation-matrix-file Text: Specify a file to store the sample-vs-gene matrix created during calculations
input-clinical-correlation-matrix-file Text: Instead of creating this from the \s-1MAF\s0, input the sample-vs-gene matrix for calculations
genetic-data-type Text: Correlate clinical data to \*(L"gene\*(R" or \*(L"variant\*(R" level data Default value 'gene' if not specified
numeric-clinical-data-file Text: Table of samples (y) vs. numeric clinical data category (x)
numerical-data-test-method Text: Either 'cor' for Pearson Correlation or 'wilcox' for the Wilcoxon Rank-Sum Test for numerical clinical data Default value 'cor' if not specified
categorical-clinical-data-file Text: Table of samples (y) vs. categorical clinical data category (x)
glm-model-file Text: File outlining the type of model, response variable, covariants, etc. for the \s-1GLM\s0 analysis. (See \s-1DESCRIPTION\s0)

DESCRIPTION

This command relates clinical traits and mutational data. Either one can perform correlation analysis between mutations recorded in a \s-1MAF\s0 and the particular phenotypic traits recorded in clinical data files for the same samples, or one can run a generalized linear model (\s-1GLM\s0) analysis on the same types of data.

The clinical data files for correlation must be separated between numeric and categoric data and must follow these conventions:

Headers are required
Each file must include at least 1 sample_id column and 1 attribute column, with the format being [sample_id clinical_data_attribute_1 clinical_data_attribute_2 ...]
The sample \s-1ID\s0 must match the sample \s-1ID\s0 listed in the \s-1MAF\s0 under \*(L"Tumor_Sample_Barcode\*(R" for relating the mutations of this sample.

Note the importance of the headers: the header for each clinical_data_attribute will appear in the output file to denote relationships with the mutation data from the \s-1MAF\s0.

Internally, the input data is fed into an R script which calculates a P-value representing the probability that the correlation seen between the mutations in each gene (or variant) and each phenotype trait are random. Lower P-values indicate lower randomness, or likely true correlations.

The results are saved to the output filename given with a suffix appended; \*(L".numeric.csv\*(R" will be appended for results derived from numeric clinical data, and \*(L".categorical.csv\*(R" will be appended for results derived from categorical clinical data. Also, \*(L".glm.csv\*(R" will be appended to the output filename for \s-1GLM\s0 results.

The \s-1GLM\s0 analysis accepts a mixed numeric and categoric clinical data file, input using the parameter --glm-clinical-data-file. \s-1GLM\s0 clinical data must adhere to the formats described above for the correlation clinical data files. \s-1GLM\s0 also requires the user to input a --glm-model-file. This file requires specific headers and defines the analysis to be performed rather exactly. Here are the conventions required for this file:

Columns must be ordered as such:
[ analysis_type clinical_data_trait_name variant/gene_name covariates memo ]
The 'analysis_type' column must contain either \*(L"Q\*(R", indicating a quantative trait, or \*(L"B\*(R", indicating a binary trait will be examined.
The 'clinical_data_trait_name' is the name of a clinical data trait defined by being a header in the --glm-clinical-data-file.
The 'variant/gene_name' can either be the name of one or more columns from the --glm-clinical-data-file, or the name of one or more mutated gene names from the \s-1MAF\s0, separated by \*(L"|\*(R". If this column is left blank, or instead contains \*(L"\s-1NA\s0\*(R", then each column from either the variant mutation matrix (--use-maf-in-glm) or alternatively the --glm-clinical-data-file is used consecutively as the variant column in independent analyses.
'covariates' are the names of one or more columns from the --glm-clinical-data-file, separated by \*(L"+\*(R".
'memo' is any note deemed useful to the user. It will be printed in the output data file for reference.

\s-1GLM\s0 analysis may be performed using solely the data input into --glm-clinical-data-file, as described above, or alternatively, mutational data from the \s-1MAF\s0 may be included as variants in the \s-1GLM\s0 analysis, as also described above. Use the --use-maf-in-glm flag to include the mutation matrix derived from the maf as variant data.

Note that all input files for both correlation and \s-1GLM\s0 analysis must be tab-separated.

ARGUMENTS

--bam-list

Provide a file containing sample names and normal/tumor \s-1BAM\s0 locations for each. Use the tab- delimited format [sample_name normal_bam tumor_bam] per line. This tool only needs sample_name, so all other columns can be skipped. The sample_name must be the same as the tumor sample names used in the \s-1MAF\s0 file (16th column, with the header Tumor_Sample_Barcode).

LICENSE

It is released under the Lesser \s-1GNU\s0 Public License (\s-1LGPL\s0) version 3. See the associated \s-1LICENSE\s0 file in this distribution.

AUTHORS

Nathan D. Dees, Ph.D. Qunyuan Zhang, Ph.D. William Schierding, M.S.

RELATED TO gmt-music-clinical-correlation…

genome-music(1), genome(1)

gmt-music-clinical-correlation (1p)