AMOCATI: Algorithmic Meta-analysis Of Clinical And Transcriptomic Information
Warning: this tutorial is only available in English, even if you choose the French language at the bottom of the screen. Thank you for your understanding.
AMOCATI
is a R-written package which aims to analyze transcriptome-based datasets, and more specifically quantify how a given gene and/or gene signature impacts the overall survival of patients. For the sake of convenience, AMOCATI
allows to flawlessly download data from the Genomic Data Commons (GDC) repository, and more precisely cancer datasets from TCGA, TARGET and CGCI projects.
Table of contents
- Prerequisites
- Workspace directory setup
- Download and process a dataset of interest
- Launch the metaResults analysis
- Extract the Classification Signature
- Compute patient-wise the Quantitative Scores and Clinical Scores of the Classification Signature
- Separate the patients of the cohort according to the Quantitative Score or the Clinical Score of the Classification Signature
- Use of custom gene signatures instead of the Classification Signature
- Adding distinguishing genes as a new layer of complexity for custom signatures
- Further analyses
- Miscellaneous
- Citation
1) Prerequisites
1.1) Publication
Before using AMOCATI
, we greatly encourage users to carefully read our associated publication. The methodology behind AMOCATI
can be rather complex to understand at first sight, but we tried our best to make it as clear as possible. The manuscript and its associated supplemental resources will help users to understand how AMOCATI
works, what are the main steps of its workflow and how to apply it in real-life datasets to treat a biological question.
1.2) R environment introduction and installation
To make the installation of R programming language and RStudio development software easier for new or beginner users, we highly recommend the following ressource, entitled « YaRrr! The Pirate’s Guide to R ». New users should at least read the first (« Preface ») and second (« Getting Started ») sections, as they provide clear and straightforward instructions on how to setup R and RStudio on Windows and MacOS operating systems. These sections will allow users to correctly install AMOCATI
and launch it flawlessly.
1.3) AMOCATI R package installation
AMOCATI
package can be installed with the following command:
# The following line can be skipped if the devtools package is already installed
install.packages("devtools")
# Load the devtools package
library("devtools")
# Install PUPAID from GitHub repository
devtools::install_github("PaulRegnier/AMOCATI")
1.4) Load AMOCATI
To load AMOCATI
, simply enter the following command in the R console:
library("AMOCATI")
2) Workspace directory setup
Before launching the actual analysis, users need to select and setup their working directory, which AMOCATI
will use throughout its workflow:
# Select the right working directory
workingDirectory = file.path("YOUR", "PATH", "HERE")
setwd(workingDirectory)
# Construct the actual workspace
resetWorkspace(
eraseEntireRMemory = FALSE,
verbose = TRUE
)
This function will create a set of folders and subfolders in which different files will be written throughout the AMOCATI
workflow.
3) Download and process a dataset of interest
First, users should determine which dataset should be analyzed. For the sake of convenience, AMOCATI
allows users to directly download cancer datasets coming from the GDC repository (and notably TCGA, TARGET and CGCI projects).
If you want to use such dataset, please run the following command to access the available datasets to download:
# List projects and associated parameters
listProjectsAttributes()
This will output something similar to:
ProjectID ProjectName 1 CGCI-HTMCP-CC HIV+ Tumor Molecular Characterization Project - Cervical Cancer 2 CGCI-HTMCP-DLBCL HIV+ Tumor Molecular Characterization Project - Diffuse Large B-Cell Lymphoma 3 CGCI-HTMCP-LC HIV+ Tumor Molecular Characterization Project - Lung Cancer 4 TARGET-ALL-P1 Acute Lymphoblastic Leukemia - Phase I 5 TARGET-ALL-P2 Acute Lymphoblastic Leukemia - Phase II 6 TARGET-ALL-P3 Acute Lymphoblastic Leukemia - Phase III 7 TARGET-AML Acute Myeloid Leukemia 8 TARGET-CCSK Clear Cell Sarcoma of the Kidney 9 TARGET-NBL Neuroblastoma 10 TARGET-OS Osteosarcoma 11 TARGET-RT Rhabdoid Tumor 12 TARGET-WT High-Risk Wilms Tumor 13 TCGA-ACC Adrenocortical Carcinoma 14 TCGA-BLCA Bladder Urothelial Carcinoma 15 TCGA-BRCA Breast Invasive Carcinoma 16 TCGA-CESC Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma 17 TCGA-CHOL Cholangiocarcinoma 18 TCGA-COAD Colon Adenocarcinoma 19 TCGA-DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma 20 TCGA-ESCA Esophageal Carcinoma 21 TCGA-GBM Glioblastoma Multiforme 22 TCGA-HNSC Head and Neck Squamous Cell Carcinoma 23 TCGA-KICH Kidney Chromophobe 24 TCGA-KIRC Kidney Renal Clear Cell Carcinoma 25 TCGA-KIRP Kidney Renal Papillary Cell Carcinoma 26 TCGA-LAML Acute Myeloid Leukemia 27 TCGA-LGG Brain Lower Grade Glioma 28 TCGA-LIHC Liver Hepatocellular Carcinoma 29 TCGA-LUAD Lung Adenocarcinoma 30 TCGA-LUSC Lung Squamous Cell Carcinoma 31 TCGA-MESO Mesothelioma 32 TCGA-OV Ovarian Serous Cystadenocarcinoma 33 TCGA-PAAD Pancreatic Adenocarcinoma 34 TCGA-PCPG Pheochromocytoma and Paraganglioma 35 TCGA-PRAD Prostate Adenocarcinoma 36 TCGA-READ Rectum Adenocarcinoma 37 TCGA-SARC Sarcoma 38 TCGA-SKCM Skin Cutaneous Melanoma 39 TCGA-STAD Stomach Adenocarcinoma 40 TCGA-TGCT Testicular Germ Cell Tumors 41 TCGA-THCA Thyroid Carcinoma 42 TCGA-THYM Thymoma 43 TCGA-UCEC Uterine Corpus Endometrial Carcinoma 44 TCGA-UCS Uterine Carcinosarcoma 45 TCGA-UVM Uveal Melanoma
Of note, if users want to download a TCGA or a CGCI dataset, then they must use the dedicated TCGA_CGCI.download()
, TCGA_CGCI.createMetaMapping()
and TCGA_CGCI.pool()
functions as described below. On the contrary, if users rather want to use a TARGET dataset, they must use the TARGET.download()
, TARGET.createMetaMapping()
and TARGET.pool()
functions.
For the rest of this tutorial, we will use the cholangiocarcinoma dataset from the TCGA project (ProjectID = TCGA-CHOL
).
# Download the associated RNA-Seq and clinical data
TCGA_CGCI.download(projectID = "TCGA-CHOL")
# Create a metamapping file which links RNA-Seq and clinical data to the right patients
TCGA_CGCI.createMetaMapping(verbose = TRUE)
# Finally pool, process and export data in an all-in-one file
TCGA_CGCI.pool(verbose = TRUE)
Because of their dependence to the GDC API, the TCGA_CGCI.download()
and TARGET.download()
functions could potentially stop during the downloading and thus throw errors. In this case, do not hesitate to run the command again, as we implemented a mechanism inside to prevent the redownloading of already downloaded files (both for RNA-Seq and clinical data).
Indeed, users always have the possibility to use their own dataset, as long as it follows the correct format (see the Figure 1
below): the fullData.data
file (located in the output > data
folder) should be a tabulation-delimited plain text file, where the 1st column is entitled CaseUUID
and lists all the unique identifiers for each patient, the 2nd column is entitled vitalStatus
and lists the vital status of each patient (either Alive
or Dead
), the 3rd column is entitled survivedDays
and lists the number of days survived after the diagnosis, and the subsequent columns list the expression values for each gene (HGNC format) of the transcriptome.
Figure 1 – File format to respect for AMOCATI
data (click on the image to open in fullscreen).
Importantly, users can also use datasets unrelated to cancer if they wish to. The only important criteria is that the data should represent measurable features (one per column) for patients (one per line) with survival/relapse/event information (2nd and 3rd column). Alive
and Dead
values for the vitalStatus
column could easily be translated to code any other event type (although the column name nor the words Alive
or Dead
should be changed for the sake of compatibility with AMOCATI
).
4) Launch the metaResults analysis
Then, the next step is to compute the metaResults
associated with this dataset. In a few words, this function randomly samples the dataset a given number of times (bootstrapping approach) and then computes, summarizes and outputs different metrics for each gene allowing to estimate and classify its impact on the overall survival of patients (see publication for more details about the algorithm as well as for a graphical representation of what it actually does):
createMetaResults(
selectedGenesOnly = FALSE,
verbose = TRUE,
signaturesMode = FALSE,
minNumberOfPatientsPerGroup = 3,
unsollicitedCores = 2,
iterationsPerCluster = 16,
genesCutoff = 20
)
Of note, this step can be long to complete and can be rather computing intensive. So please set the number of unsollicitedCores
to a reasonable value (RAM requirements can be important).
This function outputs a tabulation-delimited text file named metaResults.meta
and located in the output > metaResults
folder.
If desired, this analysis can be performed only on a given selection of genes, in order to drastically reduce the computation time. To this, users can provide a list of genes to use through the selectedGenesOnly = TRUE
argument. In this case, the output metaResults analysis will be named metaResults_selectedGenes.meta
and will be located in the output > metaResults
folder, as previously described. Additionally, this mode of analysis will also generate other results, and notably a selectedGenes.zip
file contaning the associated survival tables and curves in the output > metaResults > selectedGenes
folder. Please note that the tabulation-delimited text file containing the selected gene(s) to use should follow a precise format, although its name can be whatever the users want (*.txt
): this file should contain a single column table, with the first line named HGNC_GeneSymbol
and the subsequent ones indicating the actual genes to use. These genes must be in the HGNC format. This file should mandatorily be located in the output > data > input
folder.
5) Extract the Classification Signature
Afterwards, users should plot two metrics that are output by the previous createMetaResults() function in order to select the genes that have a great impact on survival coupled with a low variability upon the performed boostrapping iterations.
First, users should see the two metrics that will help to delineate the genes that will compose the Classification Signature:
getClassificationSignature(
GeneScoreThreshold = NULL,
Gene_SNR_ExpressionThreshold = NULL,
exportSignature = FALSE,
verbose = TRUE
)
After that, users can choose associated thresholds and see how it affects the resulting signature (see the Figure 2 below):
GS_threshold = 0.5
SNR_threshold = 2
getClassificationSignature(
GeneScoreThreshold = GS_threshold,
Gene_SNR_ExpressionThreshold = SNR_threshold,
exportSignature = FALSE,
verbose = TRUE
)
Figure 2 – GeneScoreThreshold
and Gene_SNR_ExpressionThreshold
are visually set (click on the image to open in fullscreen).
When the thresholds are correctly set, users should export the final gene signature:
getClassificationSignature(
GeneScoreThreshold = GS_threshold,
Gene_SNR_ExpressionThreshold = SNR_threshold,
exportSignature = TRUE,
verbose = TRUE
)
The subsequent Classification Signature is saved in the output > signatures > classificationSignature.sign
file, as well as the associated plot in PDF format. *.sign
files are simply plain text file (tabulation-delimited), where each column represents a gene signature (each gene being in the HGNC format), with a short description in the 1st line. Therefore, several signatures can be aggregated within a single *.sign
file (see later in this tutorial).
6) Compute patient-wise the Quantitative Scores and Clinical Scores of the Classification Signature
Then, AMOCATI
can compute both the Quantitative Score as well as the Clinical Score for the Classification Signature and for each patient, which basically uses the previously-determined Classification Signature and the metaResults.meta
file:
applySignature(
signatureUsed = "classification",
verbose = TRUE
)
In this setting, the applySignature()
function outputs a *.apply
file in the output > apply
folder. This tabulation-delimited text file contains the Quantitative Scores and the Clinical Scores for the Classification Signature for each patient.
7) Separate the patients of the cohort according to the Quantitative Score or the Clinical Score of the Classification Signature
Next, we can separate the patients of the cohort according to their previously computed Clinical Score for the Classification Signature:
separatePatients(
applyFileUsed = "classification",
metricToUse = "CS",
verbose = TRUE
)
Basically, this function takes as input the previously generated *.apply file as well as the metaResults.meta
file.
If desired, users can also use the Quantitative Score to attempt to separate the patients of the cohort, although this generally leads to a poor separation of patients:
separatePatients(
applyFileUsed = "classification",
metricToUse = "QS",
verbose = TRUE
)
The aim of this function is to separate patients into 2 groups: the ones with the highest values (called Long-Term Survivors if the Clinical Score is used) and the ones with the lowest values (called Short-Term Survivors if the Clinical Score is used). Please read the publication for the full details.
This function outputs several files in the output > class > classificationSignature
folder: PDF files containing the ROC curve as well as the survival curves, a tabulation-delimited *.class
file which is basically a *.apply
file with a supplemental column indicated in which group (LTS or STS) each patient is, and several tabulation-delimited text files with different information and statistics about the two generated groups, the ROC curves, etc.
8) Use of custom gene signatures instead of the Classification Signature
If desired, users have the possiblity to use their own custom gene signatures instead of the determined Classification Signature. The process to follow is exactly the same as for the steps 6) and 7), except that all the signatures of interest should be written in the output > signatures > customSignatures.sign
file, which should follow the same file format as previously described for the classificationSignature.sign
file:
# Compute the Clinical Scores and the Quantitative Scores for each custom gene signatures and for each patient
applySignature(
signatureUsed = "custom",
verbose = TRUE
)
Technically, in this context, this function performs exactly the same tasks as in the step 6), but organizes the results in different output folders: output > customSignatures > fullTables
for the *.apply
files as previously described and output > customSignatures > synthesis
for a tabulation-delimited text file which contains the Pearson’s and Spearman’s coefficient of correlation as well as their associated p-values and the coefficient determined for a linear regression between the Quantitative Scores and Clinical Scores for each custom signature. This will help users to determine if their own gene signatures (which one(s) and with which magnitude) is positively or negatively associated with survival.
# Use the Clinical Scores obtained for each of the custom signatures to separate the patients of the cohort
separatePatients(
applyFileUsed = "custom",
metricToUse = "CS",
verbose = TRUE
)
This function also performs exactly as the one mentioned in the step 7), but instead creates in the output > class
folder one subdirectory per custom signature.
9) Adding distinguishing genes as a new layer of complexity for custom signatures
If desired, users have the possiblity to use one or several distinguishing gene(s) during the computation of the Clinical Score and Quantitative Score as well as during the patients separation. Briefly, a distinguishing gene is a given gene for which users want to deeper stratify the impact of the gene signatures on the overall survival (only for the custom signatures). This will simply add a new column in the resulting tables which will allow users to better see if a low, intermediate or high expression actually impacts the survival. The process is exactly the same as for the steps 6), 7) and 8):
# Compute the Clinical Score and the Quantitative Score on the custom gene signatures using one or more gene(s) as distinguishing gene(s)
applySignature(
signatureUsed = "custom",
distinguishingGenes = TRUE,
verbose = TRUE
)
Please note that the tabulation-delimited text file containing the distinguishing gene(s) to use should follow a precise format, although its name can be whatever the users want (*.txt
): this file should contain a single column table, with the first line named HGNC_GeneSymbol
and the subsequent ones indicating the actual genes to use. These genes must be in the HGNC format. This file should mandatorily be located in the output > apply > input
folder.
This functions also performs similar computations as compared to the ones presented in the step 8), but outputs more results that are ordered in supplementary folders:
- In the
output > apply > customSignatures > fullTables
folder, there will be zip files (one per distinguishing gene used) containing the actual*.apply
files as previously described in the step 6) - In the
output > apply > customSignatures > synthesis
folder, there will be one subfolder per distinguishing gene which will contain a tabulation-delimited text file describing the Clinical Scores and Quantitative Scores means for the low, intermediate and high expressing groups for the distinguishing gene of interest, as well as the associated p-values for each possible comparison - In the
output > apply > customSignatures > CS
folder, there will be zip files (one per distinguishing gene used) containing PDF graphs showing the variation of the Clinical Scores for each signature relative to the low, intermediate or high expression of the current distinguishing gene, as well as tabulation-delimited text files which statistically describe these graphs (p-values of comparisons, mean, median, etc.) - In the
output > apply > customSignatures > QS
folder, there will be zip files (one per distinguishing gene used) containing PDF graphs showing the variation of the Quantitative Scores for each signature relative to the low, intermediate or high expression of the current distinguishing gene, as well as tabulation-delimited text files which statistically describe these graphs (p-values of comparisons, mean, median, etc.) - In the
output > apply > customSignatures > CS VS QS
folder, there will be zip files (one per distinguishing gene used) containing PDF graphs showing the correlation between the Clinical Scores and the Quantitative Scores for each signature with highlight of the low, intermediate or high expression of the current distinguishing gene, as well as tabulation-delimited text files which statistically describe these graphs (Pearson’s and Spearman’s coefficients of correlation as well as linear regression coefficients)
10) Further analyses
If needed, users can go deeper in the analysis of the datasets, as we actually did in our associated publication. To this, they can use already implemented methods such as the limma
R package for differential analyses of gene or pathway expressions, the GSVA
R package for the enrichment analysis of gene sets (pathways) and igraph
R package for the generation of network-based graphs. The use of these functions will not be detailed in here, as they are not officially implemented in AMOCATI
. We invite users to read the respective vignettes and documentations for these packages.
11) Miscellaneous
AMOCATI
also offers the possibility to split a given dataset into two parts, which is useful to lead analyses in a discovery/validation intra-cohort setting:
splitData(
datasetAProportion = 0.5,
verbose = TRUE
)
This function will simply output 2 new *.data subdatasets (fullData_datasetA.data
and fullData_datasetB.data
) from the original output > data > fullData.data
file and save them in the same location.
12) Citation
If you used AMOCATI
in your work, we kindly encourage you to properly cite our bioRxiv manuscript: