AMOCATI: Algorithmic Meta-analysis Of Clinical And Transcriptomic Information

Warning: this tutorial is only available in English, even if you choose the French language at the bottom of the screen. Thank you for your understanding.

AMOCATI is a R-written package which aims to analyze transcriptome-based datasets, and more specifically quantify how a given gene and/or gene signature impacts the overall survival of patients. For the sake of convenience, AMOCATI allows to flawlessly download data from the Genomic Data Commons (GDC) repository, and more precisely cancer datasets from TCGA, TARGET and CGCI projects.

Table of contents

  1. Prerequisites
    1. Publication
    2. R environment introduction and installation
    3. AMOCATI R package installation
    4. Load AMOCATI
  2. Workspace directory setup
  3. Download and process a dataset of interest
  4. Launch the metaResults analysis
  5. Extract the Classification Signature
  6. Compute patient-wise the Quantitative Scores and Clinical Scores of the Classification Signature
  7. Separate the patients of the cohort according to the Quantitative Score or the Clinical Score of the Classification Signature
  8. Use of custom gene signatures instead of the Classification Signature
  9. Adding distinguishing genes as a new layer of complexity for custom signatures
  10. Further analyses
  11. Miscellaneous
  12. Citation

1) Prerequisites

1.1) Publication

Before using AMOCATI, we greatly encourage users to carefully read our associated publication. The methodology behind AMOCATI can be rather complex to understand at first sight, but we tried our best to make it as clear as possible. The manuscript and its associated supplemental resources will help users to understand how AMOCATI works, what are the main steps of its workflow and how to apply it in real-life datasets to treat a biological question.

1.2) R environment introduction and installation

To make the installation of R programming language and RStudio development software easier for new or beginner users, we highly recommend the following ressource, entitled « YaRrr! The Pirate’s Guide to R ». New users should at least read the first (« Preface ») and second (« Getting Started ») sections, as they provide clear and straightforward instructions on how to setup R and RStudio on Windows and MacOS operating systems. These sections will allow users to correctly install AMOCATI and launch it flawlessly.

1.3) AMOCATI R package installation

AMOCATI package can be installed with the following command:

# The following line can be skipped if the devtools package is already installed

install.packages("devtools")

# Load the devtools package

library("devtools")

# Install PUPAID from GitHub repository

devtools::install_github("PaulRegnier/AMOCATI")

1.4) Load AMOCATI

To load AMOCATI, simply enter the following command in the R console:

library("AMOCATI")

2) Workspace directory setup

Before launching the actual analysis, users need to select and setup their working directory, which AMOCATI will use throughout its workflow:

# Select the right working directory

workingDirectory = file.path("YOUR", "PATH", "HERE")
setwd(workingDirectory)

# Construct the actual workspace

resetWorkspace(
    eraseEntireRMemory = FALSE,
    verbose = TRUE
)

This function will create a set of folders and subfolders in which different files will be written throughout the AMOCATI workflow.

3) Download and process a dataset of interest

First, users should determine which dataset should be analyzed. For the sake of convenience, AMOCATI allows users to directly download cancer datasets coming from the GDC repository (and notably TCGA, TARGET and CGCI projects).

If you want to use such dataset, please run the following command to access the available datasets to download:

# List projects and associated parameters

listProjectsAttributes()

This will output something similar to:

          ProjectID                                                                   ProjectName
1     CGCI-HTMCP-CC               HIV+ Tumor Molecular Characterization Project - Cervical Cancer
2  CGCI-HTMCP-DLBCL HIV+ Tumor Molecular Characterization Project - Diffuse Large B-Cell Lymphoma
3     CGCI-HTMCP-LC                   HIV+ Tumor Molecular Characterization Project - Lung Cancer
4     TARGET-ALL-P1                                        Acute Lymphoblastic Leukemia - Phase I
5     TARGET-ALL-P2                                       Acute Lymphoblastic Leukemia - Phase II
6     TARGET-ALL-P3                                      Acute Lymphoblastic Leukemia - Phase III
7        TARGET-AML                                                        Acute Myeloid Leukemia
8       TARGET-CCSK                                              Clear Cell Sarcoma of the Kidney
9        TARGET-NBL                                                                 Neuroblastoma
10        TARGET-OS                                                                  Osteosarcoma
11        TARGET-RT                                                                Rhabdoid Tumor
12        TARGET-WT                                                         High-Risk Wilms Tumor
13         TCGA-ACC                                                      Adrenocortical Carcinoma
14        TCGA-BLCA                                                  Bladder Urothelial Carcinoma
15        TCGA-BRCA                                                     Breast Invasive Carcinoma
16        TCGA-CESC              Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma
17        TCGA-CHOL                                                            Cholangiocarcinoma
18        TCGA-COAD                                                          Colon Adenocarcinoma
19        TCGA-DLBC                               Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
20        TCGA-ESCA                                                          Esophageal Carcinoma
21         TCGA-GBM                                                       Glioblastoma Multiforme
22        TCGA-HNSC                                         Head and Neck Squamous Cell Carcinoma
23        TCGA-KICH                                                            Kidney Chromophobe
24        TCGA-KIRC                                             Kidney Renal Clear Cell Carcinoma
25        TCGA-KIRP                                         Kidney Renal Papillary Cell Carcinoma
26        TCGA-LAML                                                        Acute Myeloid Leukemia
27         TCGA-LGG                                                      Brain Lower Grade Glioma
28        TCGA-LIHC                                                Liver Hepatocellular Carcinoma
29        TCGA-LUAD                                                           Lung Adenocarcinoma
30        TCGA-LUSC                                                  Lung Squamous Cell Carcinoma
31        TCGA-MESO                                                                  Mesothelioma
32          TCGA-OV                                             Ovarian Serous Cystadenocarcinoma
33        TCGA-PAAD                                                     Pancreatic Adenocarcinoma
34        TCGA-PCPG                                            Pheochromocytoma and Paraganglioma
35        TCGA-PRAD                                                       Prostate Adenocarcinoma
36        TCGA-READ                                                         Rectum Adenocarcinoma
37        TCGA-SARC                                                                       Sarcoma
38        TCGA-SKCM                                                       Skin Cutaneous Melanoma
39        TCGA-STAD                                                        Stomach Adenocarcinoma
40        TCGA-TGCT                                                   Testicular Germ Cell Tumors
41        TCGA-THCA                                                             Thyroid Carcinoma
42        TCGA-THYM                                                                       Thymoma
43        TCGA-UCEC                                          Uterine Corpus Endometrial Carcinoma
44         TCGA-UCS                                                        Uterine Carcinosarcoma
45         TCGA-UVM                                                                Uveal Melanoma

Of note, if users want to download a TCGA or a CGCI dataset, then they must use the dedicated TCGA_CGCI.download(), TCGA_CGCI.createMetaMapping() and TCGA_CGCI.pool() functions as described below. On the contrary, if users rather want to use a TARGET dataset, they must use the TARGET.download(), TARGET.createMetaMapping() and TARGET.pool() functions.

For the rest of this tutorial, we will use the cholangiocarcinoma dataset from the TCGA project (ProjectID = TCGA-CHOL).

# Download the associated RNA-Seq and clinical data

TCGA_CGCI.download(projectID = "TCGA-CHOL")

# Create a metamapping file which links RNA-Seq and clinical data to the right patients

TCGA_CGCI.createMetaMapping(verbose = TRUE)

# Finally pool, process and export data in an all-in-one file

TCGA_CGCI.pool(verbose = TRUE)

Because of their dependence to the GDC API, the TCGA_CGCI.download() and TARGET.download() functions could potentially stop during the downloading and thus throw errors. In this case, do not hesitate to run the command again, as we implemented a mechanism inside to prevent the redownloading of already downloaded files (both for RNA-Seq and clinical data).

Indeed, users always have the possibility to use their own dataset, as long as it follows the correct format (see the Figure 1 below): the fullData.data file (located in the output > data folder) should be a tabulation-delimited plain text file, where the 1st column is entitled CaseUUID and lists all the unique identifiers for each patient, the 2nd column is entitled vitalStatus and lists the vital status of each patient (either Alive or Dead), the 3rd column is entitled survivedDays and lists the number of days survived after the diagnosis, and the subsequent columns list the expression values for each gene (HGNC format) of the transcriptome.

Figure 1 – File format to respect for AMOCATI data (click on the image to open in fullscreen).

Importantly, users can also use datasets unrelated to cancer if they wish to. The only important criteria is that the data should represent measurable features (one per column) for patients (one per line) with survival/relapse/event information (2nd and 3rd column). Alive and Dead values for the vitalStatus column could easily be translated to code any other event type (although the column name nor the words Alive or Dead should be changed for the sake of compatibility with AMOCATI).

4) Launch the metaResults analysis

Then, the next step is to compute the metaResults associated with this dataset. In a few words, this function randomly samples the dataset a given number of times (bootstrapping approach) and then computes, summarizes and outputs different metrics for each gene allowing to estimate and classify its impact on the overall survival of patients (see publication for more details about the algorithm as well as for a graphical representation of what it actually does):

createMetaResults(
    selectedGenesOnly = FALSE,
    verbose = TRUE,
    signaturesMode = FALSE,
    minNumberOfPatientsPerGroup = 3,
    unsollicitedCores = 2,
    iterationsPerCluster = 16,
    genesCutoff = 20
)

Of note, this step can be long to complete and can be rather computing intensive. So please set the number of unsollicitedCores to a reasonable value (RAM requirements can be important).

This function outputs a tabulation-delimited text file named metaResults.meta and located in the output > metaResults folder.

If desired, this analysis can be performed only on a given selection of genes, in order to drastically reduce the computation time. To this, users can provide a list of genes to use through the selectedGenesOnly = TRUE argument. In this case, the output metaResults analysis will be named metaResults_selectedGenes.meta and will be located in the output > metaResults folder, as previously described. Additionally, this mode of analysis will also generate other results, and notably a selectedGenes.zip file contaning the associated survival tables and curves in the output > metaResults > selectedGenes folder. Please note that the tabulation-delimited text file containing the selected gene(s) to use should follow a precise format, although its name can be whatever the users want (*.txt): this file should contain a single column table, with the first line named HGNC_GeneSymbol and the subsequent ones indicating the actual genes to use. These genes must be in the HGNC format. This file should mandatorily be located in the output > data > input folder.

5) Extract the Classification Signature

Afterwards, users should plot two metrics that are output by the previous createMetaResults() function in order to select the genes that have a great impact on survival coupled with a low variability upon the performed boostrapping iterations.

First, users should see the two metrics that will help to delineate the genes that will compose the Classification Signature:

getClassificationSignature(
    GeneScoreThreshold = NULL,
    Gene_SNR_ExpressionThreshold = NULL,
    exportSignature = FALSE,
    verbose = TRUE
)

After that, users can choose associated thresholds and see how it affects the resulting signature (see the Figure 2 below):

GS_threshold = 0.5
SNR_threshold = 2

getClassificationSignature(
    GeneScoreThreshold = GS_threshold,
    Gene_SNR_ExpressionThreshold = SNR_threshold,
    exportSignature = FALSE,
    verbose = TRUE
)

Figure 2 – GeneScoreThreshold and Gene_SNR_ExpressionThreshold are visually set (click on the image to open in fullscreen).

When the thresholds are correctly set, users should export the final gene signature:

getClassificationSignature(
    GeneScoreThreshold = GS_threshold,
    Gene_SNR_ExpressionThreshold = SNR_threshold,
    exportSignature = TRUE,
    verbose = TRUE
)

The subsequent Classification Signature is saved in the output > signatures > classificationSignature.sign file, as well as the associated plot in PDF format. *.sign files are simply plain text file (tabulation-delimited), where each column represents a gene signature (each gene being in the HGNC format), with a short description in the 1st line. Therefore, several signatures can be aggregated within a single *.sign file (see later in this tutorial).

6) Compute patient-wise the Quantitative Scores and Clinical Scores of the Classification Signature

Then, AMOCATI can compute both the Quantitative Score as well as the Clinical Score for the Classification Signature and for each patient, which basically uses the previously-determined Classification Signature and the metaResults.meta file:

applySignature(
    signatureUsed = "classification",
    verbose = TRUE
)

In this setting, the applySignature() function outputs a *.apply file in the output > apply folder. This tabulation-delimited text file contains the Quantitative Scores and the Clinical Scores for the Classification Signature for each patient.

7) Separate the patients of the cohort according to the Quantitative Score or the Clinical Score of the Classification Signature

Next, we can separate the patients of the cohort according to their previously computed Clinical Score for the Classification Signature:

separatePatients(
    applyFileUsed = "classification",
    metricToUse = "CS",
    verbose = TRUE
)

Basically, this function takes as input the previously generated *.apply file as well as the metaResults.meta file.

If desired, users can also use the Quantitative Score to attempt to separate the patients of the cohort, although this generally leads to a poor separation of patients:

separatePatients(
    applyFileUsed = "classification",
    metricToUse = "QS",
    verbose = TRUE
)

The aim of this function is to separate patients into 2 groups: the ones with the highest values (called Long-Term Survivors if the Clinical Score is used) and the ones with the lowest values (called Short-Term Survivors if the Clinical Score is used). Please read the publication for the full details.

This function outputs several files in the output > class > classificationSignature folder: PDF files containing the ROC curve as well as the survival curves, a tabulation-delimited *.class file which is basically a *.apply file with a supplemental column indicated in which group (LTS or STS) each patient is, and several tabulation-delimited text files with different information and statistics about the two generated groups, the ROC curves, etc.

8) Use of custom gene signatures instead of the Classification Signature

If desired, users have the possiblity to use their own custom gene signatures instead of the determined Classification Signature. The process to follow is exactly the same as for the steps 6) and 7), except that all the signatures of interest should be written in the output > signatures > customSignatures.sign file, which should follow the same file format as previously described for the classificationSignature.sign file:

# Compute the Clinical Scores and the Quantitative Scores for each custom gene signatures and for each patient

applySignature(
    signatureUsed = "custom",
    verbose = TRUE
)

Technically, in this context, this function performs exactly the same tasks as in the step 6), but organizes the results in different output folders: output > customSignatures > fullTables for the *.apply files as previously described and output > customSignatures > synthesis for a tabulation-delimited text file which contains the Pearson’s and Spearman’s coefficient of correlation as well as their associated p-values and the coefficient determined for a linear regression between the Quantitative Scores and Clinical Scores for each custom signature. This will help users to determine if their own gene signatures (which one(s) and with which magnitude) is positively or negatively associated with survival.

# Use the Clinical Scores obtained for each of the custom signatures to separate the patients of the cohort

separatePatients(
    applyFileUsed = "custom",
    metricToUse = "CS",
    verbose = TRUE
)

This function also performs exactly as the one mentioned in the step 7), but instead creates in the output > class folder one subdirectory per custom signature.

9) Adding distinguishing genes as a new layer of complexity for custom signatures

If desired, users have the possiblity to use one or several distinguishing gene(s) during the computation of the Clinical Score and Quantitative Score as well as during the patients separation. Briefly, a distinguishing gene is a given gene for which users want to deeper stratify the impact of the gene signatures on the overall survival (only for the custom signatures). This will simply add a new column in the resulting tables which will allow users to better see if a low, intermediate or high expression actually impacts the survival. The process is exactly the same as for the steps 6), 7) and 8):

# Compute the Clinical Score and the Quantitative Score on the custom gene signatures using one or more gene(s) as distinguishing gene(s)

applySignature(
    signatureUsed = "custom",
    distinguishingGenes = TRUE,
    verbose = TRUE
)

Please note that the tabulation-delimited text file containing the distinguishing gene(s) to use should follow a precise format, although its name can be whatever the users want (*.txt): this file should contain a single column table, with the first line named HGNC_GeneSymbol and the subsequent ones indicating the actual genes to use. These genes must be in the HGNC format. This file should mandatorily be located in the output > apply > input folder.

This functions also performs similar computations as compared to the ones presented in the step 8), but outputs more results that are ordered in supplementary folders:

  • In the output > apply > customSignatures > fullTables folder, there will be zip files (one per distinguishing gene used) containing the actual *.apply files as previously described in the step 6)
  • In the output > apply > customSignatures > synthesis folder, there will be one subfolder per distinguishing gene which will contain a tabulation-delimited text file describing the Clinical Scores and Quantitative Scores means for the low, intermediate and high expressing groups for the distinguishing gene of interest, as well as the associated p-values for each possible comparison
  • In the output > apply > customSignatures > CS folder, there will be zip files (one per distinguishing gene used) containing PDF graphs showing the variation of the Clinical Scores for each signature relative to the low, intermediate or high expression of the current distinguishing gene, as well as tabulation-delimited text files which statistically describe these graphs (p-values of comparisons, mean, median, etc.)
  • In the output > apply > customSignatures > QS folder, there will be zip files (one per distinguishing gene used) containing PDF graphs showing the variation of the Quantitative Scores for each signature relative to the low, intermediate or high expression of the current distinguishing gene, as well as tabulation-delimited text files which statistically describe these graphs (p-values of comparisons, mean, median, etc.)
  • In the output > apply > customSignatures > CS VS QS folder, there will be zip files (one per distinguishing gene used) containing PDF graphs showing the correlation between the Clinical Scores and the Quantitative Scores for each signature with highlight of the low, intermediate or high expression of the current distinguishing gene, as well as tabulation-delimited text files which statistically describe these graphs (Pearson’s and Spearman’s coefficients of correlation as well as linear regression coefficients)

10) Further analyses

If needed, users can go deeper in the analysis of the datasets, as we actually did in our associated publication. To this, they can use already implemented methods such as the limma R package for differential analyses of gene or pathway expressions, the GSVA R package for the enrichment analysis of gene sets (pathways) and igraph R package for the generation of network-based graphs. The use of these functions will not be detailed in here, as they are not officially implemented in AMOCATI. We invite users to read the respective vignettes and documentations for these packages.

11) Miscellaneous

AMOCATI also offers the possibility to split a given dataset into two parts, which is useful to lead analyses in a discovery/validation intra-cohort setting:

splitData(
    datasetAProportion = 0.5,
    verbose = TRUE
)

This function will simply output 2 new *.data subdatasets (fullData_datasetA.data and fullData_datasetB.data) from the original output > data > fullData.data file and save them in the same location.

12) Citation

If you used AMOCATI in your work, we kindly encourage you to properly cite our bioRxiv manuscript:

fr_FRFrançais