A dimensionality reduction technique particularly well suited for visualizing data. (For references, see https://lvdmaaten.github.io/tsne)

The parameters that were used for running t-SNE here are: 50 initial dimensions, perplexity of 30, and theta of 0.5. For datasets with <= 5000 samples, the standard t-SNE algorithm is used. For larger datasets, the Barnes-Hut algorithm is employed.

A dimensionality reduction technique in which the two principal components are chosen to have the largest possible variance.

To analyze relationships between perturbations, we utilize the framework of connectivity. A connectivity score between two perturbations quantifies the similarity of the cellular responses evoked by these perturbations. A score of 1 means that these two perturbations are more similar to each other than 100% of other perturbation pairs. A score of -1 means that these two perturbations are more dissimilar to each other than 100% of other perturbation pairs.

See a heatmap of connections between individual perturbagens in cell lines and all other perturbagens used for the P100 assay or the GCP assay. The tutorial describes the features of the heatmap.

Bring data, in GCT format, from your own P100 or GCP studies to query against our datasets.

Introspect means querying your dataset against itself. Make sure to "Include Introspect" if you would like to see connections within your dataset (in addition to connections between your dataset and Touchstone-P).

In computing connectivity, biological or technical replicates can be aggregated together. Please select which metadata fields should be used to recognize replicates. For example, if you wish to distinguish between different doses of the same compound, make sure to select "pert_dose" (or something similar) as one of the metadata fields by which to group replicates. The possible metadata fields by which to group replicates only appear after you have upload your GCT and selected "Yes" for "Are there replicates in your data?".


Matched mode: When running GUTC, incorporates cell-line information to match query data against matching cell types in Touchstone. Currently this includes the following 9 cell types : [A375, A549, HEPG2, HCC515, HA1E, HT29, MCF7, PC3, VCAP].
Unmatched mode (recommended): When running GUTC, does not incorporate cell-line information when querying the data against Touchstone signatures.


L-Build ("Light" Build):  All levels of L1000 data up to aggregated signatures.
Full Build:  All levels of L1000 data up to aggregated signatures, as well as all relevant additional analyses of the data (Introspect, t-SNE, PCA, etc.).

When querying Touchstone, Feature Space determines what set of genes to query against. When perturbagens are profiled on the L1000 platform, Landmark is recommended. When the queries you wish to use are not landmarks, use BING instead.

Root location within a brew folder that contains the instance matrices and the brew_group folder. Default is brew/pc

List of expected treatment doses in micromolar as a listmaker list. If provided, dose discretization is applied to the pert_dose metadata field to generate a canonicalized pert_idose field. Note this assumes that the pert_dose annotations are in micromolar.

Generates TAS plots and connectivity heatmap of preliminary callibration plates to identify the most suitable experimental conditions of specified parameters. Tool should be run on small pilot experiments, with a variety of experimental parameters such as seeding density and time point. Plots can also be decoupled by parameters such as cell id.

Column filter to sig_build_tool as a listmaker collection

The name of the build used when generating all associated files and folders (e.g. <BUILD_CODE>_metadata). For this reason, the code must be filename compatible.

When merging replicates for L1000, several versions of the merged data are made. This parameter determines which version to use when creating your build. by_rna_well is the default. by_rna_well is recommended.

All data is from the Cancer Cell Line Encyclopedia resource. Expression data was released 15-Aug-2017, copy number data is dated 27-May-2014, and mutational data is dated 15-Aug-2017.


Feature Mapping: Ensembl Ids from the source data were mapped to Entrez Gene Ids using gene annotations from NCBI (downloaded on 02-Mar-2016).
Normalization:  RNAseq RPKM values were log2 transformed using log2(max(RPKM, eps)). The data were then normalized such that the expression values were comparable across cell lines, by minimizing technical variation and equalizing their distributions (for details of the normalization, see LISS and QNORM entries in the Connectopedia glossary). Post-normalization, the expression values range between 4 and 15 log2 units, with 4 indicating that a gene is minimally or not expressed and 15 indicating the maximum readout.
Z-scores: The number of standard deviations that a gene is above or below the population mean is called its z-score. The "robust" z-score is resistant to outliers by using median instead of mean and median absolute deviation (MAD) instead of standard deviation. The reference population used to compute the median and MAD for a particular gene is all CCLE lines with data for that gene.
Z-scores Within Primary Site: Similar to z-scores, but the reference population used to compute the median and MAD is all CCLE lines from the same lineage with data for that gene.

All scores indicated are in log 2 ratios to reference, binned using the heuristics described in CNVkit.

Deletion:  score < -1.1
Loss:  -1.1 ≤ score ≤ -0.25
No change:  -0.25 < score < +0.2
Gain: +0.2 ≤ score < +0.7
Amplification: +0.7 ≤ score

Access a suite of analysis apps by clicking on the menu (or type command-K to open)

Switch between running a single query and running a batch query.

Give each query a descriptive name that will help you identify your results.

Tip: Each list can have a different number of genes; in fact, you can run a query with only one list (up OR down).

Your query will take about 5 minutes to process; check the History section in the Menu for your results!

Valid genes used in the query have HUGO symbols or Entrez IDs and are well-inferred or directly measured by L1000 (member of the BING gene set). Valid genes not used in a query are those that have a valid HUGO or Entrez identifier but are not part of the BING set. Invalid genes do not have HUGO or Entrez IDs.

Give each query a descriptive name that will help you identify your results.

Your query will take about 5 minutes to process; check the History section in the Menu for your results!

The sig_fastgutc_tool is a reimplementation of our query algorithm that enables faster query results, especially at larger batch sizes. It is the result of crowd-sourced contest. It is currently in beta mode.

Filter datasets by category to see only those of interest.

Data Icons identify published and proprietary datasets.

Click on a row to see a summary of that dataset, including cell lines and treatment conditions, assay type, and dates.

Arrange the table to display the information most important for your work, and add key datasets to favorites.

View details about the collection as a whole and about individual compounds.

View subsets of compounds based on mechanism, drug target, or known disease application.

Purity is assessed by ultra-performance liquid chromatography-mass spectrometry (UPLC-MS) of compounds after receipt from the vendor.

Status as of publication of this resource (March 2017). We will be updating this but let us know if you notice a discrepancy.

Click on a compound to see details about its structure, mechanism, targets, approval status, and vendor.

Mouse over this graphic to see the classes of proteins targeted by drugs in the hub.

This is the current count of perturbagens in the reference (touchstone) dataset.

Select data from perturbagens grouped by their MoA or role in the cell.

Choose a perturbagen type, or view them all.

Touchstone is our reference dataset, made from well-annotated perturbagens profiled in a core set of 9 cell lines.

Detailed List is unavailable for Touchstone v1.1.1.1. A new data visualization approach is in development, but to get results in a table format (similar to Detailed View), please click on Heat Map and download the dataset as a GCT file that can be viewed in Excel or similar apps. Please see here for a detailed explanation.

Articles are tagged with topics. Click on a topic tag to see all related articles.

Look it up! A quick reference guide of CMap terms and their meanings.

Email us with your questions.

Click on the heading to read all the articles in this section on a single page, or open each article separately.

Click on a heading to open a menu of articles.

Each article is tagged with key words that describe its content.

Underlined words link to their definition in the CMap glossary.

Your feedback helps us make Connectopedia more useful.

Average transcriptional impact

TAS is a metric that incorporates the signature strength (the number of significantly differentially expressed transcripts) and signature concordance (the reproducibility of those changes across biological replicates) to capture activity of a compound. The score is computed as the geometric mean of the signature strength and the 75th quantile of pairwise replicate correlations for a given signature. Prior to computing the geometric mean, the signature strength is multiplied by the square root of the number of replicates. This serves to mitigate score shrinkage with increasing replicate number and allows TAS values derived from signatures of different numbers of replicates to be compared with each other.

Signature diversity

Thick black bars signify Transcriptional Activity Scores greater than or equal to 0.5; thinner black bars denote scores less than 0.5. Absence of a bar means no data available. Colored lines (chords) signify similar connectivity scores between cell lines; red for positive connectivity scores of 80-100 (pale to intense color according to the score); blue for negative connectivity. Chords are only shown when TAS scores are > 0.5; thus absence of a chord either means that the perturbagen TAS score is very low, or that no data is available. Chords for individual cell lines can be isolated from the rest of the figure by hovering over the cell line name.

Baseline expression of this gene in each cell line is represented as a z-score (top numbers). Scores were calculated using robust z-score formula:

z-scorei = ( xi - median( X ) )/( MAD( X ) * 1.4826 ),

where:

xi is expression value of a given gene in i-th cell line

X = [ x1, x2 ... xn ] is a vector of expression values for a given gene across n cell lines

MAD( X ) is a median absolute deviation of X

1.4826 is a constant to rescale the score as if the standard deviation of X instead of MAD was used

Median and MAD expression values were calculated using RNA-Seq profiles from a total of 1022 cell lines, comprising data from the Cancer Cell Line Encyclopedia (CCLE; Barretina, et al.) and cell lines nominated by the CMap team. Plots show z-score values only for the core LINCS lines used by CMap in L1000 experiments. Light red or light blue regions indicate positive or negative outlier expression, respectively, of the gene relative to the other lines shown; z-score of a positive outlier in the corresponding cell line is in dark red and a negative outlier is in dark blue.

Summary class connectivity shows a boxplot that summarizes the connectivity of a class. Each data point, shown as a light gray dot, represents the median value of connectivity of one member to the other class members. (This corresponds to the median for each row, excluding the main diagonal, in the heatmap shown below.) The box is the distribution of those data points, where the box boundary represents the interquartile range, the vertical line within the box is the median, and the whiskers reflect the minimum and maximum values of the data (exclusive of extreme outliers, which may appear beyond the whiskers).

Connectivity between members of class is a standard heat map of the connectivity scores, summarized across cell lines, between members of the class, where dark red represents the highest positive scores and deep blue the highest negative scores. Individual scores are revealed to the left below the map by hovering over each cell of the map.

Class inter-cell line connectivity is a plot of the median (black line) and Q25-Q75 connectivity scores (blue area around black line) for each cell line as well as the summary scores across cell lines. In some cases perturbations have not been tested in every cell line; the absence of data is indicated by a “0” for that cell line. The example shown reveals that these estrogen agonists show the strongest connectivity to each other in MCF7, a human breast cancer cell line that expresses the estrogen receptor.

Profile status

Colored portion of top bar indicates the Broad assays in which this compound has been profiled.

L1000 cell/dose coverage

For compounds profiled by L1000, cell lines and dose range for which signatures are available are indicated by dark gray bars (lighter gray bar indicates no data is available for that cell line/dose combination). A bar displayed one row above the 10 uM row indicates that doses higher than 10uM were tested. The 6 rows correspond to 6 canonical doses: 20 nM, 100 nM, 500 nM, 1 uM, 2.5 uM, and 10 uM. (In some cases non-canonical doses were tested; these are rounded to the nearest canonical dose for the purpose of this display. For example, if the dose tested was 3.33uM, the 2.5uM bar is shown in dark gray here.)


Glossary

A

AIG   All Inferred Genes; includes the 978 landmark genes, plus the entire set of genes for which we infer expression in any given experiment, including the best inferred 9196 genes (BING) and 2154 additional genes whose expression is less well inferred by our algorithm.

B

BING   Best Inferred Genes; includes the 978 landmark genes as well as 9196 non-landmark genes for which we reliably infer gene expression using a computational approach. The L1000 assay directly measures or infers the expression levels of 12,328 genes. By evaluating the current statistical model against a large compendium of RNA-Seq profiles from over 100 tissues from the GTEx consortium, we have identified a subset of 10,174 genes that are either measured or well inferred. This subset is known as the Best INferred Gene (BING) space.

BRD   First 3 letters of each identifier for each perturbagen. Identifiers with a “K” following BRD have an unambiguous structure. Identifiers with an “A” following BRD are ambiguous in structure.

Brew   CMap-specific name assigned to the last step of the espresso computational pipeline, in which biological replicate signatures (typically three) from each experiment are collapsed into a single signature using a moderated z-score (MODZ) procedure. This procedure mitigates the effects of uncorrelated or outlier data in replicates and thus generates a signature that more accurately reflects the transcriptional effects of a given perturbagen.

bead_batch   One instantiation of a complete set of beads which have been coupled to probes at one time under the same conditions.

bead_revision   The set of beads that applies to a particular collection of gene pairs for each bead color.

bead_set   A pair of barcodes for each gene pair used for a bead color; gene pairs are used for the tag duo procedure, where a bead color is coupled to two different genes.

brew_prefix   aka replicate set - the part of plate name (det_plate or rna_plate) that is independent of replicate, bead_batch and other. E.g. LJP005_A375_24H is the brew_prefix for the plates.LJP005_A375_24H_X1_B19, LJP005_A375_24H_X2_B19, LJP005_A375_24H_X3_B19

build   A version of the cmap dataset and webtools. Each build is uniquely identified by the build number. New builds are released intermittently as new data and tools become available. Data and tools may also be retired between builds.

C

CGS   Consensus Gene Signature; a signature, generated by an algorithm, that reflects the consistent gene expression effects of shRNAs that target the same gene. CGS was developed to distinguish the strong off-target effects of shRNAs, which are not shared, from on-target effects.

CSS   Consensus seed signature; a signature, generated by an algorithm, that reflects the consistent gene expression effects of multiple shRNAs that target the same seed sequence but target different gene targets.

Cancer Cell Line Encyclopedia (CCLE)   The Cancer Cell Line Encyclopedia (CCLE) project is a collaboration between the Broad Institute, the Novartis Institutes for Biomedical Research, and the Genomics Institute of the Novartis Research Foundation to conduct genetic and pharmacologic characterization of a large number of human cancer cell lines. The CCLE contains public access sequencing data and visualization of DNA copy number, mRNA expression, mutation data, and more for 1000 cell lines.

Cell Painting   An imaged-based assay that uses fluorescent dyes to to detect morphological changes in a variety of cell features in response to perturbagen treatment.

Cell line / Cell type / Cellular model   The in vitro model in which an L1000 experiment is performed. Typically, these cell lines model a certain disease and may harbor mutations or other genetic alterations of interest. A cancer cell line is a cell line derived from a cancerous tissue and established for continuous growth in vitro; these cells are used to study the biology of cancer and cancer treatment. An immortalized Cell Type refers to cells that originate from a particular tissue type and have acquired the ability to divide indefinitely and evade normal cellular senescence mechanisms. Cells may be immortal because they derive from a naturally occurring cancer; alternatively they may have been induced to become immortal through in vitro genetic manipulation.

Connectivity   A measure of similarity between two signatures. Two signatures that have a very high positive connectivity score are said to be positively connected; two with a very high negative score are said to be negatively connected.

Connectivity Map   Connectivity Map, or CMap, is a resource of over one million gene expression profiles from many cell types treated with chemical or genetic perturbagens, designed to probe relationships between diseases, cell physiology, and therapeutics.

Connectivity score   A connectivity score is a value, between +100 and -100, that quantifies the relationship between a query signature (set of differentially expressed genes representing a biological state of interest) and a perturbagen. The connectivity score incorporates three components: 1) a nominal p-value (NP) that indicates the significance of the similarity (enrichment based on the Kolmogorov-Smirnov statistic) between query and reference signature compared to a null distribution of random queries, 2) a false discovery rate (FDR) value that adjusts the p-value to account for multiple hypothesis testing given the large numbers of comparisons in the dataset, and 3) 𝛕, which represents the effect size of a given enrichment score on a standardized scale. 𝛕 is a scale-free measure ranging from -100 to 100 and is the score that we report in query results; a 𝛕 of 90 indicates that only 10% of perturbations showed stronger connectivity to the query. Note that we don’t routinely calculate NP and FDR because empirically tau of >90 passes those tests.

Controls; Types of   Controls are used to differentiate authentic perturbagen-induced changes in gene expression from artifacts and technical errors. PosCons refer to positive controls, which are perturbagen treatments known to induce strong changes in expression. Vehicle controls refer to negative controls that should not cause significant changes in expression.

Controls—PosCons   perturbagens that generate well-understood gene expression signatures and thus serve as positive controls for each experimental plate.

Controls—Vehicles (VC)   Various forms of “inert” experiments performed to determine, by contrast, effects that are specifically due to a perturbation as opposed to responses associated with handling cells or delivering a perturbagen; a type of negative control.

Core Cell Line Panel   A set of nine cell lines that are typically used in L1000 assays to test the transcriptional effects of perturbagens. The core cell lines are: A375, A549, HA1E, HCC515, HT29, HEPG2, MCF7, PC3, VCAP. These lines were used to generate all the data in the Touchstone (TS) dataset. Cells that have been transfected by a particular Cas9-containing vector, such as 311 or 101, are named with the vector name appended to their cell line name. For example, HA1E cells that are transfected with 311 vector are named HA1E.311.

cell_id   A shorthand CMap identifier number assigned to each cell line used in the L1000 assay.

cl_center_specific_id   synonym for cell_id.

clue.io   A secure, cloud-based computing environment, that allows both experienced and novice CMap users to interact with and analyze Connectivity Map data via a collection of robust and easy to use web-based applications.

count_cv   The coefficient of variation of bead counts.

count_mean   The mean of per well-analyte bead counts.

D

DEx database   A database of differential expression signatures derived from comparisons using diseased tissue; for example, comparison of diseased to healthy tissue, or diseased tissue subject to pharmacological or other type of treatment over a time period. In general, DEx signatures have not been made by the CMap team using L1000; rather, they've been submitted by other labs to public data repositories such as GEO or MSigDB.

Dactyloscopy   Dactyloscopy is an algorithm that is used to verify the identity of cell lines and confirm metadata in the descriptions of L1000 experiments. The algorithm is based on comparison of gene expression profiles measured in L1000 to baseline expression profiles, generated from RNAseq data, from the library of ~1000 cancer cell lines (Cancer Cell Line Encyclopedia). The Spearman correlation coefficient of an experimental expression profile should be highest against the reference profile of the actual cell line used in the L1000 experiment, when compared to the correlation coefficients against reference profiles of other cell lines in the library. The algorithm relies on the fact that baseline gene expression is cell line-specific, which together with the fact that most perturbagens significantly modulate expression of only a few genes, allows for unambiguous distinguishing of cell lines used in L1000.

Data Level   Data level refers to the degree of mathematical processing that has been performed on L1000 data as it goes through the signature generation pipeline, which consists of five levels.

Deconvolution (d-peak)   The computational method that assigns a data peak (from Level 1 data) to a specific gene is referred to as deconvolution, or d-peak.

Differential expression   The degree to which a gene’s expression is increased or decreased in response to a perturbagen, relative to the absence of perturbagen treatment. In the L1000 pipeline, differential expression is computed using a robust z-score.

Discover dataset   Data produced for discovery purposes; the assay conditions do not necessarily involve a majority of the CMap core cell lines and are not necessarily the standard experimental conditions used for the Touchstone dataset; moreover, discovery perturbagens may have tentative or no annotations.

Diversity score   A metric that describes the diversity of the gene expression consequence of perturbagen treatment across different cell lines. This is also sometimes called inter-cell connectivity (ICC).

det_mode   The detection mode used for acquiring L1000 data. Can be either DUO (two genes per analyte color) or UNI (one gene per analyte color).

det_plate   Detection plate, the plate of L1000 experiments that, at the end of the assay pipeline, is put through the Luminex scanners to detect the levels of landmark gene amplicons.

det_well   Detection well, which refers to each well of the detection plate in which an L1000 experiment is conducted.

distil_cc_q75   75th quantile of pairwise spearman correlations in landmark space of replicate level 4 profiles.

distil_id   ID of an individual replicate profile, referred to as level 4 / z-score data, that is used in creating the signature from replicates assayed together on an L1000 plate. The signature is referred to as level 5 / aggregated z-score data.

distil_nsample   Number of individual replicate profiles (level 4 / z-score) that were used to create the signature (level 5 / aggregate z-score).

distil_ss   The number of significantly differentially expressed transcripts that arise from a particular perturbagen treatment.

down score   A value between +1 and -1 representing the absolute enrichment of a down tag list in a given instance. The down score is the "down" value reported on the detailed result page and the result detail window. A high positive down score indicates that the corresponding perturbagen induced the expression of the probe sets in the down tag list. A high negative down score indicates that the corresponding perturbagen repressed the expression of the probe sets in the down tag list. The connectivity score is a combination of the up score and down score.

E

Espresso   CMap-specific name assigned to the computational pipeline that processes raw L1000 data (LXB files of scan data for each experiment) through several steps to generate differential gene expression signatures.

Experiment (aka treatment, aka profile)   Perturbagen treatment of cells at one experimental condition of cells, dose, and time-point

G

GCP   GCP (global chromatin profiling) uses targeted mass-spectrometry technology to measure ~60 combinations of post-translational modifications on histones. GCP measures global histone modifications (e.g. di-methylation of lysine 27 on histone 3), rather than modifications at specific sites along the genome.

GCT   Gene Cluster Text; a matrix-based file format containing numerical data as well as row and column annotations. For L1000 data, the columns of the matrix correspond to perturbagen-specific profiles, and the rows correspond to genes.

GCTx   Gene Cluster Text x, a file format that enables structuring of content in an HDF5 file that is compatible with L1000 data. GCTx format allows for storage of every profile we generate in a single unified file to which new signatures can be appended as they are generated. Each column in the GCTX matrix is a signature and each row is a gene in that signature.

GRP file   Gene set format file; contains a list of elements (typically gene symbols or feature identifiers) in a simple newline-delimited text format with one entry per line.

GTEx   A database that catalogs gene expression profiles from many tissue types, to further our understanding of how changes in our genes and gene expression contribute to human diseases.

GUTC   The Grand Unified Theory of Connectivity (GUTC) is a computational method that reduces the large, complex dataset of signatures arising from a query to a smaller, focused set of relevant connectivity results. Signature data is stored as a matrix, where each row is a landmark gene (~1000 total) and each column is a treatment. Currently there are at least 400,000 signatures in this matrix, and this number is growing. Thus, a user querying the database with an expression signature faces the prospect of sorting through at least 400,000 similarity scores. gutC is a computational method that reduces the amount of output data to a more manageable and meaningful set of scores.

Gene Set   The pair of lists of up- and down- regulated genes in a biological condition of interest. For example, one could make a gene set of the genes induced and repressed following the exposure of a cell line to a small molecule relative to a vehicle control treatment. Any marker-selection algorithm or heuristic can be used to produce a gene set. It is possible to define a gene set from a signature by simply selecting the N most up- and/or down-regulated genes, or by selecting all genes with differential expression greater than some threshold value.

Genetic gain of function (GoF)   A mutation in a gene that causes the gene product to acquire a new molecular function or a new pattern of gene expression. In the L1000 context, we often refer to genetic over-expression constructs as gain of function experiments, because they may result in aberrantly high expression of a particular gene of interest.

Genetic loss of function LoF   A mutation in a gene that either prevents synthesis of the protein product or results in synthesis of a nonfunctional protein. In the L1000 context, we often refer to genetic knock down or knock out, via RNAi or CRISPR, as loss of function experiments, because they result in aberrantly low, or no, expression of a particular gene of interest.

gtex_id   identifier for the sample as used within the GTEx project.

H

HG-U133A   Affymetrix GeneChip Human Genome U133A Array (part number 510681). The probe sets on this array define the feature set.

I

Inference   An algorithm that uses expression of the 978 Landmark genes in a given experiment to estimate expression of 11,350 additional genes.

Inferred gene   A non-landmark gene whose expression in response to perturbagen treatment is estimated using an inference algorithm, rather than measured directly by L1000.

Instance   A treatment and control pair and the list of probe sets ordered by their extent of differential expression between this treatment and control pair. The instance is the basic unit of data and metadata in cmap. Every instance has a number of attributes including a unique identifier (instance_id), the batch in which it was produced, the cmap name of the perturbagen, the source of that perturbagen, the concentration of that perturbagen, the cmap cell line used, and the scan numbers for the treatment and its control(s). All instances in the current build and their attributes are accessible from the instance page.

Introspect   The act of querying the signatures within a dataset against each other to determine connectivities among them, as opposed to determining connectivities between each signature of the set and the Touchstone reference dataset.

Invariant gene   A gene whose expression has been determined to be consistent across a wide variety of Affymetrix samples extracted from Gene Expression Omnibus (GEO)

iPSC Cells   Induced pluripotent stem cells are cells, derived from skin or blood, that have been genetically altered to an embryonic stem cell state; these cells can then be induced to develop into any cell type.

icc   Inter-cell connectivity (ICC). The similarity (aggregated WTCS) between signatures of a given perturbagen across cell lines. This number ranges between -1 and 1, and the higher the number, the more similar the signatures across cell lines. Only exemplar signatures are used in computing ICC. See is_exemplar for more details.

inf_model   Inference model designation.

inst_id   the identifier for the sample exactly as it appears in the matrix files attached to the series.

is_exemplar   A boolean indicating whether the given signature is an exemplar. Due to the redundancy of the CMap database, meaning that some perturbagens have many signatures even within the same cell line, it is convenient to identify a single ‘exemplar’ signature for each perturbagen in each cell line. These signatures are specifically designated for further analysis, such as ICC and aggregate TAS. Exemplar signatures are generally picked based on TAS, such that the signature with the highest TAS is chosen as exemplar.

is_gold   A heuristic for assessing whether a signature is reproducible and distinct. Requirements include: distil_cc_q75 >= 0.2 and pct_self_rank_q25 <= 0.05.

K

Kolmogorov-Smirnov statistic   The non-parametric rank statistic upon which the cmap analytic is based. A detailed description of the Kolmogorov-Smirnov statistic can be found in Nonparametric Statistical Methods (second edition) by Myles Hollander and Douglas Wolfe (1999).

L

L1000   A high-throughput gene expression assay that measures the expression of approximately 1000 landmark genes in cells treated with perturbagens; this data is then used to infer expression of 11,350 additional genes. The L1000 assay contains 1058 probes for 978 landmark transcripts and 80 control transcripts chosen for their invariant expression. Briefly, the assay is performed as follows: Cells growing in 384-well plates are lysed and the mRNA transcripts captured on oligo-dT-coated plates. cDNAs are synthesized from the captured transcripts and subjected to LMA (ligation-mediated amplification) using locus-specific oligonucleotides harboring a unique 24-mer barcode sequence and a 5’biotin label. The biotinylated LMA products are detected by hybridization to polystyrene beads of distinct fluorescent color, each coupled to an oligonucleotide complementary to a barcode, and staining with streptavidin-phycoerythrin. Thus each bead is analyzed for its bead color (denoting the landmark identity) and the fluorescence intensity of the phycoerythrin signal (denoting the landmark transcript abundance). Because only 500 bead colors are commercially available, we devised a strategy that allows two transcripts to be identified by a single bead color.

LINCS DCIC   Library of Network-Based Cellular Signatures Data Coordination and Integration Center, which is working to develop methods and tools to integrate and interact with the large genomics and proteomics datasets generated by the members of the LINCS consortium.

LISS Normalization   The first of two normalization steps by which Level 2 data is converted to Level 3 data. The LISS (Luminex Invariant Set Normalization) step fits a power law curve to the median log2-intensities for the 10 invariant gene sets. This curve serves as a reference against which the experimental data are re-scaled, ultimately enabling conversion between measured Luminex intensity and Affymetrix log2-expression values.

Landmark gene   One of the 978 genes that are measured directly using the L1000 assay. Each landmark represents a tight cluster of co-regulated genes, as determined from analysis of a large collection of publically available gene expression profiles.

Library of Network-based Cellular Signatures (LINCS)   AN NIH program working to catalogue perturbagen-induced changes in gene expression and other cellular processes, in order to expand understanding of networks and pathways in biology.

M

Marker Selection   Marker selection refers to the identification of genes that are differentially expressed under a defined condition (eg: perturbation). The Morpheus app contains a Marker Selection tool that generates from input data lists of the up- and down-regulated genes.

Molecular Signature Database (MSigDB)   MSigDB is a database of annotated gene sets pertaining to biological pathways and processes in mammalian, primarily human, cells.

mfc_plate_dim   Manufacturer’s stated dimensions of the pert plate.

mfc_plate_id   Manufacturer’s designated plate id.

mfc_plate_name   Name of the plate as designated by the manufacturer.

mfc_plate_quad   Quadrant of the plate as designated by the manufacturer.

mfc_plate_well   Well of the pert plate as designated by the manufacturer.

modZ   Moderated Z score; produced by the weighted average of replicates, where weighting is proportional to the Spearman correlation between replicates.

N

Normalized enrichment score (NES)   Enrichment score that has been rescaled to account for different query sizes, cell lines, and perturbagen types.

ngenes_modulated_dn_lm   The number of landmark genes that show decreased expression in cells treated with perturbagen.

P

P-value   An estimate of the likelihood that the enrichment of a set of instances in the list of all instances in a given result would be observed by chance. This value is determined empirically by computing the enrichment of one hundred thousand sets of instances selected at random from the set of all instances in the result.

P100   P100 uses targeted mass-spectrometry technology to measure 96 phosphopeptides that are commonly observed and modulated in diverse cell types. These phosphopeptides were chosen similarly to the way the 978 landmarks were chosen for L1000, as a way of representing the cellular state with incomplete information.

PBIOA   Pooled L1000 bioactivity assay, where L1000 serves as a bioactivity sensor using pooled cell lines

PCL   Perturbagen CLass, referring to CMap-designated groupings of compound and genetic perturbagens based on their strong connectivities to each other and their shared mechanisms of action or biological functions. Each PCL generally has 3 or more members.

PRISM   A high-throughput screen for assessing cell viability in which cell lines that have each been labelled with a unique 24-nucleotide barcode are pooled and treated with the experimental condition, and surviving cells are “counted” through identification of the cognate barcode. PRISM is an acronym for Profiling Relative Inhibition Simultaneously in Mixture.

Peak deconvolution   Also referred to as "dpeak", the computational method for assigning the correct expression level to each of the two genes whose transcripts bind to beads of the same analyte color.

Permuted signature   A signature generated from the computation, done during the brew process, of collapsing random replicates into signatures to generate a null for comparison to actual signatures, to be used for replicate correlation and signature strength determinations.

Perturbagen   Reagent used in the laboratory for cell treatment and determination of the resulting transcriptional response. Perturbagen types used primarily include small-molecule compounds (trt_cp), gene knockdowns using shRNAs (trt_shrna) and/or CRISPR (trt_xpr), and reagents that cause increased amounts of a target protein using cDNAs delivered with a vector (trt_oe). Perturbagens (or groups of closely related perturbagens) are identified by their cmap name. A single perturbagen may be represented by multiple instances.

Plate Map   A plate map is the collection of metadata describing the perturbagen treatments and experimental conditions for each well in an L1000 assay plate.

Plate types   Usually L1000 assays are carried out in 384 well plates. There are several types of plates used for the different parts of the assay. See Connectopedia for a list defining these plates, including the common term as well as the term that appears in data files.

Population Control (PC)   A control against which gene expression changes are measured; the population control consists of the expression of each landmark gene in every well of the plate except the well(s) for the treatment of interest. In other words, a test with a given perturbagen will involve only one or a few wells in a 384-well plate; the other wells will serve as a population control for each landmark gene.

Primary Cells   Cells that are taken from living tissue and established in vitro for study; because they go through very few doublings in vitro prior to experimental use, they closely mimic the tissue from which they originated.

Probe set   The collection of match and mismatch oligonucleotides on an Affymetrix GeneChip microarray designed against a given transcript which together allow the relative level of that transcript to be estimated. Probe sets are uniquely identified with a code number that, by convention, ends with "_at" ( eg 200800_s_at). Tag lists and instances are populated with probe sets from the feature set. Detailed descriptions of individual probe sets can be found at the AffyMetrix NetAffx Analysis Center.

Profile   A profile (also termed an experiment or an instance) corresponds to data generated from a single perturbagen, cell type, dose, and time point. The numbers in a profile represent either the raw fluorescent intensity values (level 1 or raw data) or these numbers post deconvolution (level 2) or post normalization (level 3). Profiles are compared to appropriate controls to generate a list of differentially expressed genes (level 4). Replicate level 4 profiles (typically 3) are collapsed via weighted averaging into one differentially expressed vector, which we term a signature (level 5).

pc   The percent of total perturbagens, querying the column sample against the Touchstone dataset, that exceed the given thresholds

pc_selection   The percent of total perturbagens, querying the column sample against selected rows, that exceed the given thresholds

pct_self_rank_q25   Self connectivity of replicates expressed as a percentage of total instances in a replicate set.

pert_desc   A brief summary of the biological function (for genetic perturbagens) or mechanism of action (for compound perturbagens).

pert_dose   Precise amount of compound used to treat cells.

pert_dose_unit   Unit (generally micromolar) applied to the dose of compound used to treat cells.

pert_id   A unique identifier for a perturbagen that refers to the perturbagen in general, not to any particular batch or sample.

pert_idose   The concatenation of pert_dose and pert_dose_unit to create a string containing the dose information. We use a standardized dose for a perturbagen treatment. For example, the less common dose of 10.04 is rounded to 10. This enables grouping of signatures by a common dose.

pert_iname   The internal (CMap-designated) name of a perturbagen. By convention, for genetic perturbations CMap uses the HUGO gene symbol.

pert_itime   The concatenation of pert_time and pert_time_unit to create a string containing the length of time that a perturbagen was applied to the cells. We use a standardized time for a perturbagen treatment. For example, if data is made by treating cells with a perturbagen for 5.5 hours, we round that time to the more common treatment time of 6 hours.

pert_mfc_id   A manufacturer's id for the perturbagen; by convention, for compounds registered with Broad Compound Management this is the full BRD containing both the compound ID and the batch ID.

pert_time   The length of time, expressed as a number, that a perturbagen was applied to the cells; does not include the unit.

pert_time_unit   The unit that applies to the pert_time numerical value.

pert_type   Abbreviated designation for perturbagen type, referring to compound or genetic perturbagens that are used in cell treatments to assess gene expression effects.

pert_vehicle   The solvent or other vehicle used to deliver the perturbagen.

pool_id   Landmark probe pool used; generally the pool is epison, and older pools such as delta and deltaprime are not relevant for most users.

provenance_code   A shorthand code that tracks the different steps in data processing.

Q

Quantile Normalization (QNORM)   The second of two normalization steps by which Level 2 data is converted to Level 3 data. The QNORM step standardizes the shape of the expression profile distributions on each plate such that all of the data for a plate is on the same scale. Quantile normalization is then performed on all plates within a cohort to standardize the data across plates (Q2NORM).

Query   Strictly speaking, a query is a request for information; thus, a query of the Connectivity Map aims to reveal signatures that are very similar or very different (strongly positively or negatively connected) to an input signature. The input is a list of genes. The query asks Cmap to compute connectivity scores for all instances with respect to a specified signature or a number of selected instances.

qc_f_logp   The -log10 of p-value (for f statistic), representing the goodness of fit of the power model used to convert raw fluorescence intensity values to log2 expression values during the LISS process.

qc_iqr   The interquartile range of normalized expression within a level 3 profile.

qc_slope   The line slope in degrees (arctan of slope) of the line of best fit through the observed invariant set expression levels and their expected expression ranks.

R

Reduced representation of transcriptome   A concept that, since genes with similar expression levels tend to have similar cellular functions or roles, the cellular state can be captured by measuring only a portion of the ~22k human genes

Replicate recall   A measurement of the degree with which a signature generated by RNA-seq and used to query L1000 data finds its L1000 counterpart amidst all other RNA-seq and L1000 profiles.

Roast   CMap-specific name assigned to the first four steps of the espresso computational pipeline. The Roast steps 1) deconvolute the two peaks from each bead scan and assign each peak to a particular landmark gene, 2) normalize the data by first scaling it to the invariant gene set and then standardizing it so that all of the data from a plate is on the same scale, 3) infer the expression of approximately 10,000 non-landmark genes based on the landmark gene expression for that experiment, and 4) determine the differential expression of each landmark and inferred gene relative to vehicle controls and to that gene’s expression in each of the other wells on that assay plate.

rna_plate   Name of the plate as it was used throughout the assay prior to detection; the name includes all information except the bead_batch_id suffix, for example: LJP005_A375_24H_X1.

rna_well   Name of a well within an rna_plate.

S

SSF   Specialized Service Facility; this name applies to a Broad program such as CMap that provides a service for outside groups to use for their research.

Seedolog   shRNAs that target different genes but share the same seed sequence, a 2-8 nucleotide sequence that contributes to off-target effects.

Signature   We refer to a signature as the entire vector of differential expression values, one per gene, obtained after aggregating the data across replicates; therefore “signature” is used to refer to level 5 data (each replicate prior to aggregation is referred to as a “profile”). Signatures provide a representation of the biological response of the genome to the perturbation. For the L1000 assay, each signature is designated by its sig_id identification tag.

Signature concordance (SC)   the reproducibility of gene expression changes in a signature across biological replicates.

Signature strength (SS)   the number of significantly differentially expressed transcripts in a signature

Summly   Given a set of connectivity scores for a particular perturbagen in several cell lines, summly is a computational method that summarizes those scores across the cell lines, resulting in a single summary score.

sig_id   A CMap unique identification number assigned to each signature generated from L1000 data.

sm_center_compound_id   synonym for pert_mfc_id

sm_dose   synonym for pert_dose.

sm_dose_unit   synonym for pert_dose_unit.

sm_lincs_id   synonym for pert_id.

sm_name   synonym for pert_iname.

sm_pert_type   synonym for pert_type.

sm_time   synonym for pert_time.

sm_time_unit   synonym for pert_time_unit

T

Tag-duo   CMap lab methodology for detecting two genes using the same analyte color.

Touchstone dataset   Our reference dataset produced from a set of ~8000 perturbagens for which we have annotated biological function information, and which we have assayed in the majority of the nine core cell lines under our standard conditions.

Transcriptional activity score (TAS)   TAS is a metric that incorporates the signature strength (the number of significantly differentially expressed transcripts) and signature concordance (the reproducibility of those changes across biological replicates) to capture activity of a compound. The score is computed as the geometric mean of the signature strength and the 75th quantile of pairwise replicate correlations for a given signature. Prior to computing the geometric mean, the signature strength is multiplied by the square root of the number of replicates. This serves to mitigate score shrinkage with increasing replicate number and allows TAS values derived from signatures of different numbers of replicates to be compared with each other.

Treatment dose   The amount of perturbagen used to treat cells in an experiment. While dose may vary depending on the experiment, typical treatment doses range from 0.04µM to 10µM.

Treatment duration   The amount of time cells are exposed to a particular perturbagen in an experiment, prior to cell lysis. Typically cells are treated with compound perturbagens for 6 hrs and 24 hrs. RNAi perturbation treatment lasts for 96 hrs.

t-SNE   A dimensionality reduction technique particularly well suited for visualizing data. (For references, see https://lvdmaaten.github.io/tsne/)

target_seq   The sequence within a gene that is targeted by a hairpin shRNA for knockdown, to abolish expression of that gene. This term applies to shRNA experiments only.

tas_q75   Aggregated transcriptional activity score. For a given perturbagen, tas_q75 is computed as the 75th quantile of its TAS across cell lines. The higher the number, the more generally active the perturbagen. Only exemplar signatures are used for computing tas_q75. See is_exemplar for more details.

ts_pc   The percent of total Touchstone perturbagens that connect to the given perturbagen above the indicated thresholds

Z

Z-score   For a data point, the number of standard deviations that point is above or below the population mean is called its Z-score. In the L1000 data processing pipeline, we compute a robust z-score for each gene in each sample. The reference population used to compute the median and MAD is the expression of the given gene in every other well on the plate. These z-score values correspond to level 4 data.

zmad_ref   The reference population used for Z scoring; generally population or vehicle, but it could be custom (see provenance code for what was used).