To analyze relationships between perturbations, we utilize the framework of connectivity. A connectivity score between two perturbations quantifies the similarity of the cellular responses evoked by these perturbations. A score of 1 means that these two perturbations are more similar to each other than 100% of other perturbation pairs. A score of -1 means that these two perturbations are more dissimilar to each other than 100% of other perturbation pairs.

Introspect means querying your dataset against itself. Make sure to "Include Introspect" if you would like to see connections within your dataset (in addition to connections between your dataset and Touchstone-P).

In computing connectivity, biological or technical replicates can be aggregated together. Please select which metadata fields should be used to recognize replicates. For example, if you wish to distinguish between different doses of the same compound, make sure to select "pert_dose" (or something similar) as one of the metadata fields by which to group replicates. The possible metadata fields by which to group replicates only appear after you have upload your GCT and selected "Yes" for "Are there replicates in your data?".

Access a suite of analysis apps by clicking on the menu (or type command-K to open)

The first step in using the Query App to compute connections with your gene expression data is to assign a name to your query. Results will be stored in your Analysis History after your query is submitted.

Enter an up-regulated gene of interest, hit enter, and type in subsequent genes in the set you would like to query. You may also have down-regulated genes of interest. They can be entered in the box to the right.

Hit submit and the query algorithm will find connections between your genes of interest and perturbagens in CMap that have signatures most similar to your query. Data are generated in approximately 5 minutes and will be stored in your Analysis History.

The L1000 assay directly measures or infers the expression levels of 12,328 genes. By evaluating the current statistical model against a large compendium of RNA-Seq profiles from over 100 tissues from the GTEx consortium, we have identified a subset of 10,174 genes that are either measured or well inferred. This subset is known as the Best INferred Gene (BING) space. The Query App uses BING space to compute similarities between users' gene sets and the gene expression signatures in the CMap database. Each user entry is therefore mapped into one of the three following categories.
Invalid gene: Not a valid HUGO symbol or Entrez ID, and therefore not used in the query.
Valid gene: A valid HUGO symbol or Entrez ID that is also part of BING space, and therefore is used in the query.
Valid but not used in query: A valid HUGO symbol or Entrez ID that is not part of BING space, and therefore is not used in the query.

Click on a perturbagen in this table to see a CLUE Card that contains all of the information available for this perturbagen. You can also select any compound in the table to query connections with all other compounds in Touchstone. Click on Detailed List to view connections in a table, or click Heatmap to see connections in a matrix powered by the Morpheus App.

Filter the Touchstone data table by selecting perturbagen type or perturbational classes of interest.

Average transcriptional impact

Impact is assessed as a transcriptional activity score, which is calculated as a mean value of median replicate correlation and median signature strength of a perturbagen across multiple cell lines and doses. The score describes a perturbagen’s transcriptional activity, relative to all other perturbagens, as derived from its replicate reproducibility and magnitude of differential gene expression.

PCTCCi =  rank( median( CCi ) )N

PCTSSi =  rank( median( SSi ) )N



TASi is the transcriptional impact score for the i-th perturbagen

PCTCCi is the percentile, relative to all other perturbagens, of the i-th perturbagen’s median replicate correlation coefficient (CC) across all of its signatures

PCTSSi is the percentile, relative to all other perturbagens, of the i-th perturbagen’s signature strength (SS) across all of its signatures

N is the total number of perturbagens

Signature diversity

Thick black bars signify Transcriptional Activity Scores greater than or equal to 0.5; thinner black bars denote scores less than 0.5. Absence of a bar means no data available. Colored lines (chords) signify similar connectivity scores between cell lines; red for positive connectivity scores of 80-100 (pale to intense color according to the score); blue for negative connectivity. Chords are only shown when TAS scores are > 0.5; thus absence of a chord either means that the perturbagen TAS score is very low, or that no data is available. Chords for individual cell lines can be isolated from the rest of the figure by hovering over the cell line name.

Baseline expression of this gene in each cell line is represented as a z-score (top numbers). Scores were calculated using robust z-score formula:

z-scorei = ( xi - median( X ) )/( MAD( X ) * 1.4826 ),


xi is expression value of a given gene in i-th cell line

X = [ x1, x2 ... xn ] is a vector of expression values for a given gene across n cell lines

MAD( X ) is a median absolute deviation of X

1.4826 is a constant to rescale the score as if the standard deviation of X instead of MAD was used

Median and MAD expression values were calculated using RNA-Seq profiles from a total of 1022 cell lines, comprising data from the Cancer Cell Line Encyclopedia (CCLE; Barretina, et al.) and cell lines nominated by the CMap team. Plots show z-score values only for the core LINCS lines used by CMap in L1000 experiments. Light red or light blue regions indicate positive or negative outlier expression, respectively, of the gene relative to the other lines shown; z-score of a positive outlier in the corresponding cell line is in dark red and a negative outlier is in dark blue.

Summary class connectivity shows a boxplot that summarizes the connectivity of a class. Each data point, shown as a light gray dot, represents the median value of connectivity of one member to the other class members. (This corresponds to the median for each row, excluding the main diagonal, in the heatmap shown below.) The box is the distribution of those data points, where the box boundary represents the interquartile range, the vertical line within the box is the median, and the whiskers reflect the minimum and maximum values of the data (exclusive of extreme outliers, which may appear beyond the whiskers).

Connectivity between members of class is a standard heat map of the connectivity scores, summarized across cell lines, between members of the class, where dark red represents the highest positive scores and deep blue the highest negative scores. Individual scores are revealed to the left below the map by hovering over each cell of the map.

Class inter-cell line connectivity is a plot of the median (black line) and Q25-Q75 connectivity scores (blue area around black line) for each cell line as well as the summary scores across cell lines. In some cases perturbations have not been tested in every cell line; the absence of data is indicated by a “0” for that cell line. The example shown reveals that these estrogen agonists show the strongest connectivity to each other in MCF7, a human breast cancer cell line that expresses the estrogen receptor.

Profile status

Colored portion of top bar indicates the Broad assays in which this compound has been profiled.

L1000 cell/dose coverage

For compounds profiled by L1000, cell lines and dose range for which signatures are available are indicated by dark gray bars (lighter gray bar indicates no data is available for that cell line/dose combination). A bar displayed one row above the 10 uM row indicates that doses higher than 10uM were tested. The 6 rows correspond to 6 canonical doses: 20 nM, 100 nM, 500 nM, 1 uM, 2.5 uM, and 10 uM. (In some cases non-canonical doses were tested; these are rounded to the nearest canonical dose for the purpose of this display. For example, if the dose tested was 3.33uM, the 2.5uM bar is shown in dark gray here.)

About CMap

We are creating a genome-scale library of cellular signatures that catalogs transcriptional responses to chemical, genetic, and disease perturbation. To date, the library contains more than {1.3 Million} profiles resulting from perturbations of multiple cell types.

Connectivity Map

The Connectivity Map, or CMap, is a resource that uses transcriptional expression data to probe relationships between diseases, cell physiology, and therapeutics. The changes in gene expression, or “signatures,” that arise from a disease, genetic perturbation (knockdown or overexpression of a gene) or treatment with a small molecule are compared for similarity to all perturbational signatures in the database. Perturbations that elicit highly similar, or highly dissimilar, expression signatures are termed “connected”; their related transcriptional effects suggest they confer related physiological effects on the cell. Our goal is to use these connections to uncover novel treatments for a variety of diseases, including cancers, neurological diseases, and infectious diseases.

CMap is a dynamic database and we will release new versions as new data becomes available; version numbers are identified on the home page. We invite you to use CMap and our tools to analyze your gene expression profiles for connectivity to known perturbagens.

What are examples of CMap applications?

For the biologist: use CMap to reveal connections between steps of biological pathways.

For the chemist: use CMap to uncover structure-function relationships between novel and well-studied compounds

For the pharmacologist: use CMap as a first step in the drug discovery process


Recent technological improvements have resulted in a dramatic increase in high-dimensional perturbational datasets available to the biomedical community. However, the enormity of the data and the complexity of integrating across multiple assays, multiple cell types, and experimental conditions of dose and treatment time requires users to have considerable computational expertise to ask questions of the data.

Biologists need intuitive and performant user-interfaces to explore and query the dataset to evaluate hypotheses. Even for computational researchers, the huge scale of the dataset requires wasted effort downloading and formatting the data, and is sometimes a barrier to their engagement with the data at all.

To address these needs, we have developed a computational environment built from the ground up to execute on state-of-the-art cloud-based systems. This environment, which we call CLUE, is built to meet the following goals:

  • Lower the barrier to access by making data and tools available on the cloud, thereby 1) eliminating the need to download massive files and 2) allowing users to stay in sync with the latest data releases
  • Facilitate interoperability between perturbational data types by harmonizing datasets
  • Implement web applications with user friendly graphical user interfaces that access underlying sophisticated algorithms

Availability and use

While CLUE is still early in its growth phase, it is functional now, and has already been loaded with over 1M gene expression profiles, related perturbational datasets, analytical tools, and web-based applications. These data and tools are freely available to academic users. Recognizing that drug-discovery companies will want to leverage this work for their proprietary research programs while still maintaining confidentiality of their proprietary data, we also offer CLUE as a subscription. See details at subscribe.

Web Apps

  • Touchstone App

    “Touchstone” is our term for compound and genetic perturbagens (~5000) that are well-studied and generate robust gene expression signatures in cells. Thus the Touchstone data set serves as a benchmark for assessing connectivity among perturbagens. Use the Touchstone app to learn more about these perturbagens and explore their connectivities.

  • ICV App

    The Integrated Connectivity Viewer presents connectivity data as a matrix-based interactive heatmap that provides a comprehensive view of connections and allows one to easily explore relationships within the data.

  • Query App

    Use the query app to find positive and negative connections between your gene expression signature of interest and all the signatures in CMap.

  • Morpheus App

    Morpheus is an interactive version of the ICV that lets you manipulate and annotate an existing dataset or one of your choice.

  • Repurposing App

    Explore our repurposing collection of ~5000 tool compounds and drugs for drug discovery opportunities.

Analysis Tools

Big data sets can be an enigmatic monolith without the proper interface to access and interpret the information they hold. We offer command line interfaces (CLI) for computational biologists, API's for software engineers, and web-based software applications for all. Check out our collection of Web Apps and Developer's Tools.

Our API provides metadata about compounds, genes, cell lines, and signatures. We have also developed command line interfaces with tools for computationalists and developers.


  • Berger AH, Brooks AN, Wu X, Shrestha Y, Chouinard C, Piccioni F, Bagul M, Kamburov A, Imielinski M, Hogstrom L, et al. High-throughput Phenotyping of Lung Cancer Somatic Mutations. Cancer Cell. 2016/08/08. 30(2):214-28, (2016). 
  • Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression inference with deep learning. Bioinformatics. 2016/06/15. 32(12):1832-9, (2016). 
  • Duan Q, Flynn C, Niepel M, Hafner M, Muhlich JL, Fernandez NF, Rouilard AD, Tan CM, Chen EY, Golub TR, Sorger PK, Subramanian A, Ma'ayan A. LINCS Canvas Browser: interactive web app to query, browse, and interrogate LINCS L1000 gene expression signatures. Nucleic Acids Research. 2014/6/06. 42(Web Server Issue):W449-60, (2014). 
  • Johannessen CM, Johnson LA, Piccioni F, Townes A, Frederick DT, Donahue MK, Narayan R, Flaherty KT, Wargo JA, Root DE, Garraway LA. A melanocyte lineage program confers resistance to MAP kinase pathway inhibition. Nature. 2013/12/5. 504(7478):138-42, (2013). 
  • Kim E, Ilic N, Shrestha Y, Zou L, Kamburov A, Zhu C, Lubonja R, Tran N, Nguyen C, Lawrence MS, et al. Systematic Functional Interrogation of Rare Cancer Variants Identifies Oncogenic Alleles. Cancer Discovery. 2016/05/04. 6(6), (2014). 
  • Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006/9/29. 313(5795):1929-35, (2006). 
  • Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011/06/15. 27(12):1739-20, (2011). 
  • Peck D, Crawford ED, Ross KN, Stegmaier K, Golub TR, Lamb J. A method for high-throughput gene expression signature analysis. Genome Biology. 2006/07/19. 7(7):R61, (2006). 
  • Santagata S, Mendillo ML, Tang YC, Subramanian A, Perley CC, Roche SP, Wong B, Narayan R, Kwon H, Koeva M, Amon A, Golub TR, Porco JA Jr., Whitesell L,Lindquist S. Tight coordination of protein translation and HSF1 activation supports the anabolic malignant state. Science. 2013/7/19. 341(6143), (2013). 
  • Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005/10/25. 102(43):15545-50, (2005). 
  • Wilson FH, Johannessen CM, Piccioni F, Tamayo P, Kim JW, Van Allen EM, Corsello SM, Capelletti M, Calles A, Butaney M, et al. A functional landscape of resistance to ALK inhibition in lung cancer. Cancer Cell. 2015/05/27. 27(3):397-408, (2015). 


We are grateful for the important contributions from the Broad community, the CMap Team, our research collaborators, and third party code developers.  

Contact CMap