A dimensionality reduction technique particularly well suited for visualizing data. (For references, see https://lvdmaaten.github.io/tsne)

The parameters that were used for running t-SNE here are: 50 initial dimensions, perplexity of 30, and theta of 0.5. For datasets with <= 5000 samples, the standard t-SNE algorithm is used. For larger datasets, the Barnes-Hut algorithm is employed.

A dimensionality reduction technique in which the two principal components are chosen to have the largest possible variance.

To analyze relationships between perturbations, we utilize the framework of connectivity. A connectivity score between two perturbations quantifies the similarity of the cellular responses evoked by these perturbations. A score of 1 means that these two perturbations are more similar to each other than 100% of other perturbation pairs. A score of -1 means that these two perturbations are more dissimilar to each other than 100% of other perturbation pairs.

See a heatmap of connections between individual perturbagens in cell lines and all other perturbagens used for the P100 assay or the GCP assay. The tutorial describes the features of the heatmap.

Bring data, in GCT format, from your own P100 or GCP studies to query against our datasets.

Introspect means querying your dataset against itself. Make sure to "Include Introspect" if you would like to see connections within your dataset (in addition to connections between your dataset and Touchstone-P).

In computing connectivity, biological or technical replicates can be aggregated together. Please select which metadata fields should be used to recognize replicates. For example, if you wish to distinguish between different doses of the same compound, make sure to select "pert_dose" (or something similar) as one of the metadata fields by which to group replicates. The possible metadata fields by which to group replicates only appear after you have upload your GCT and selected "Yes" for "Are there replicates in your data?".

Matched mode: When running GUTC, incorporates cell-line information to match query data against matching cell types in Touchstone. Currently this includes the following 9 cell types : [A375, A549, HEPG2, HCC515, HA1E, HT29, MCF7, PC3, VCAP].
Unmatched mode (recommended): When running GUTC, does not incorporate cell-line information when querying the data against Touchstone signatures.

L-Build ("Light" Build):  All levels of L1000 data up to aggregated signatures.
Full Build:  All levels of L1000 data up to aggregated signatures, as well as all relevant additional analyses of the data (Introspect, t-SNE, PCA, etc.).

When querying Touchstone, Feature Space determines what set of genes to query against. When perturbagens are profiled on the L1000 platform, Landmark is recommended. When the queries you wish to use are not landmarks, use BING instead.

Root location within a brew folder that contains the instance matrices and the brew_group folder. Default is brew/pc

List of expected treatment doses in micromolar as a listmaker list. If provided, dose discretization is applied to the pert_dose metadata field to generate a canonicalized pert_idose field. Note this assumes that the pert_dose annotations are in micromolar.

Generates TAS plots and connectivity heatmap of preliminary callibration plates to identify the most suitable experimental conditions of specified parameters. Tool should be run on small pilot experiments, with a variety of experimental parameters such as seeding density and time point. Plots can also be decoupled by parameters such as cell id.

Column filter to sig_build_tool as a listmaker collection

The name of the build used when generating all associated files and folders (e.g. <BUILD_CODE>_metadata). For this reason, the code must be filename compatible.

When merging replicates for L1000, several versions of the merged data are made. This parameter determines which version to use when creating your build. by_rna_well is the default. by_rna_well is recommended.

All data is from the Cancer Cell Line Encyclopedia resource. Expression data was released 15-Aug-2017, copy number data is dated 27-May-2014, and mutational data is dated 15-Aug-2017.

Feature Mapping: Ensembl Ids from the source data were mapped to Entrez Gene Ids using gene annotations from NCBI (downloaded on 02-Mar-2016).
Normalization:  RNAseq RPKM values were log2 transformed using log2(max(RPKM, eps)). The data were then normalized such that the expression values were comparable across cell lines, by minimizing technical variation and equalizing their distributions (for details of the normalization, see LISS and QNORM entries in the Connectopedia glossary). Post-normalization, the expression values range between 4 and 15 log2 units, with 4 indicating that a gene is minimally or not expressed and 15 indicating the maximum readout.
Z-scores: The number of standard deviations that a gene is above or below the population mean is called its z-score. The "robust" z-score is resistant to outliers by using median instead of mean and median absolute deviation (MAD) instead of standard deviation. The reference population used to compute the median and MAD for a particular gene is all CCLE lines with data for that gene.
Z-scores Within Primary Site: Similar to z-scores, but the reference population used to compute the median and MAD is all CCLE lines from the same lineage with data for that gene.

All scores indicated are in log 2 ratios to reference, binned using the heuristics described in CNVkit.

Deletion:  score < -1.1
Loss:  -1.1 ≤ score ≤ -0.25
No change:  -0.25 < score < +0.2
Gain: +0.2 ≤ score < +0.7
Amplification: +0.7 ≤ score

Access a suite of analysis apps by clicking on the menu (or type command-K to open)

Switch between running a single query and running a batch query.

Give each query a descriptive name that will help you identify your results.

Tip: Each list can have a different number of genes; in fact, you can run a query with only one list (up OR down).

Your query will take about 5 minutes to process; check the History section in the Menu for your results!

Valid genes used in the query have HUGO symbols or Entrez IDs and are well-inferred or directly measured by L1000 (member of the BING gene set). Valid genes not used in a query are those that have a valid HUGO or Entrez identifier but are not part of the BING set. Invalid genes do not have HUGO or Entrez IDs.

Give each query a descriptive name that will help you identify your results.

Your query will take about 5 minutes to process; check the History section in the Menu for your results!

The sig_fastgutc_tool is a reimplementation of our query algorithm that enables faster query results, especially at larger batch sizes. It is the result of crowd-sourced contest. It is currently in beta mode.

Filter datasets by category to see only those of interest.

Data Icons identify published and proprietary datasets.

Click on a row to see a summary of that dataset, including cell lines and treatment conditions, assay type, and dates.

Arrange the table to display the information most important for your work, and add key datasets to favorites.

View details about the collection as a whole and about individual compounds.

View subsets of compounds based on mechanism, drug target, or known disease application.

Purity is assessed by ultra-performance liquid chromatography-mass spectrometry (UPLC-MS) of compounds after receipt from the vendor.

Status as of publication of this resource (March 2017). We will be updating this but let us know if you notice a discrepancy.

Click on a compound to see details about its structure, mechanism, targets, approval status, and vendor.

Mouse over this graphic to see the classes of proteins targeted by drugs in the hub.

This is the current count of perturbagens in the reference (touchstone) dataset.

Select data from perturbagens grouped by their MoA or role in the cell.

Choose a perturbagen type, or view them all.

Touchstone is our reference dataset, made from well-annotated perturbagens profiled in a core set of 9 cell lines.

Detailed List is unavailable for Touchstone v1.1.1.1. A new data visualization approach is in development, but to get results in a table format (similar to Detailed View), please click on Heat Map and download the dataset as a GCT file that can be viewed in Excel or similar apps. Please see here for a detailed explanation.

Articles are tagged with topics. Click on a topic tag to see all related articles.

Look it up! A quick reference guide of CMap terms and their meanings.

Email us with your questions.

Click on the heading to read all the articles in this section on a single page, or open each article separately.

Click on a heading to open a menu of articles.

Each article is tagged with key words that describe its content.

Underlined words link to their definition in the CMap glossary.

Your feedback helps us make Connectopedia more useful.

Average transcriptional impact

TAS is a metric that incorporates the signature strength (the number of significantly differentially expressed transcripts) and signature concordance (the reproducibility of those changes across biological replicates) to capture activity of a compound. The score is computed as the geometric mean of the signature strength and the 75th quantile of pairwise replicate correlations for a given signature. Prior to computing the geometric mean, the signature strength is multiplied by the square root of the number of replicates. This serves to mitigate score shrinkage with increasing replicate number and allows TAS values derived from signatures of different numbers of replicates to be compared with each other.

Signature diversity

Thick black bars signify Transcriptional Activity Scores greater than or equal to 0.5; thinner black bars denote scores less than 0.5. Absence of a bar means no data available. Colored lines (chords) signify similar connectivity scores between cell lines; red for positive connectivity scores of 80-100 (pale to intense color according to the score); blue for negative connectivity. Chords are only shown when TAS scores are > 0.5; thus absence of a chord either means that the perturbagen TAS score is very low, or that no data is available. Chords for individual cell lines can be isolated from the rest of the figure by hovering over the cell line name.

Baseline expression of this gene in each cell line is represented as a z-score (top numbers). Scores were calculated using robust z-score formula:

z-scorei = ( xi - median( X ) )/( MAD( X ) * 1.4826 ),


xi is expression value of a given gene in i-th cell line

X = [ x1, x2 ... xn ] is a vector of expression values for a given gene across n cell lines

MAD( X ) is a median absolute deviation of X

1.4826 is a constant to rescale the score as if the standard deviation of X instead of MAD was used

Median and MAD expression values were calculated using RNA-Seq profiles from a total of 1022 cell lines, comprising data from the Cancer Cell Line Encyclopedia (CCLE; Barretina, et al.) and cell lines nominated by the CMap team. Plots show z-score values only for the core LINCS lines used by CMap in L1000 experiments. Light red or light blue regions indicate positive or negative outlier expression, respectively, of the gene relative to the other lines shown; z-score of a positive outlier in the corresponding cell line is in dark red and a negative outlier is in dark blue.

Summary class connectivity shows a boxplot that summarizes the connectivity of a class. Each data point, shown as a light gray dot, represents the median value of connectivity of one member to the other class members. (This corresponds to the median for each row, excluding the main diagonal, in the heatmap shown below.) The box is the distribution of those data points, where the box boundary represents the interquartile range, the vertical line within the box is the median, and the whiskers reflect the minimum and maximum values of the data (exclusive of extreme outliers, which may appear beyond the whiskers).

Connectivity between members of class is a standard heat map of the connectivity scores, summarized across cell lines, between members of the class, where dark red represents the highest positive scores and deep blue the highest negative scores. Individual scores are revealed to the left below the map by hovering over each cell of the map.

Class inter-cell line connectivity is a plot of the median (black line) and Q25-Q75 connectivity scores (blue area around black line) for each cell line as well as the summary scores across cell lines. In some cases perturbations have not been tested in every cell line; the absence of data is indicated by a “0” for that cell line. The example shown reveals that these estrogen agonists show the strongest connectivity to each other in MCF7, a human breast cancer cell line that expresses the estrogen receptor.

Profile status

Colored portion of top bar indicates the Broad assays in which this compound has been profiled.

L1000 cell/dose coverage

For compounds profiled by L1000, cell lines and dose range for which signatures are available are indicated by dark gray bars (lighter gray bar indicates no data is available for that cell line/dose combination). A bar displayed one row above the 10 uM row indicates that doses higher than 10uM were tested. The 6 rows correspond to 6 canonical doses: 20 nM, 100 nM, 500 nM, 1 uM, 2.5 uM, and 10 uM. (In some cases non-canonical doses were tested; these are rounded to the nearest canonical dose for the purpose of this display. For example, if the dose tested was 3.33uM, the 2.5uM bar is shown in dark gray here.)

Is the L1000 data "fully QC'd"?

CMap is dedicated to ensuring that every plate run through the L1000 assay produces data of high technical quality. We have implemented a large number of quality control and assurance procedures and routinely exclude samples that appear sub-optimal. The guiding principle of LINCS / CMap is to make data that has passed all technical QC measures available, even if parts of the data aren't perfect. Hence, we encourage users to access data that we have vetted and organized for analysis in our webapps at clue.io (as opposed to downloading data). If you choose to download data from GEO, please pay attention to metrics included in metadata headers (rather than assuming all columns of data pass QC), and be aware that, as in any high-throughput system, there could be unobserved issues.

We have developed QC metrics to assess quality and to address issues that could arise at various stages throughout the assay. These include measures to ensure that 1) the correct cell line was plated and treated, 2) PCR amplification was good, 3) bead was properly added throughout the plate, and 4) any equipment issues were identified and addressed promptly. Our quality control procedures for L1000 are highly automated, allowing for high-level quality control metrics to be analyzed in a high throughput and timely manner. We have developed a plate tracking system called Lumitracker that generates real time QC plots, allowing us to assess the quality of a plate from the time it begins the detection process on our scanners. More extensive analysis is done when a plate finishes scanning and runs through our data pipeline, which assesses the various QC results and flags plates using a pass/fail system.

Here we describe several of the most important QC metrics as well as some of the QC tools we use. All of the plots and tools outlined are essential for ensuring data quality and flagging potential problems that could arise in any aspect of our production process.

How do we verify that the correct cell line was plated and treated?

To ensure that the proper cell line was plated and treated our team has developed a tool known as Dactyloscopy.** **Formally dactyloscopy is the science of fingerprint identification; in L1000 it refers to identifying the fingerprints of cell lines. This QC tool verifies the identity of cell lines on CMap plates by calculating the Spearman correlation coefficient of the level 3 (QNORM) gene expression data in each well with respect to a library of reference baseline expression profiles for over 1000 cell lines. If the cell line on a plate is labeled correctly, its correlation coefficient to the respective profile in the reference library should be higher than to the profiles of every other cell line in the library. If a plate fails dactyloscopy, it fails the overall L1000 QC process.

How can we verify that proper amplification occurred?

There are several ways we can ensure that all the steps from mRNA capture to PCR amplification occurred properly. These include several spiked-in control wells as well as QC plots such as our calibration curves plot and invariant set 10 level heatmap.

Control wells

For L1000 experiments, the four wells in the top left corner of each plate (A1, A2, B1, and B2; figure below) are reserved for controls. Prior to lysate capture, wells A2 and B2 have reference MCF7 RNA added to them. If these wells look technically good but the rest of the plate failed, then the problem is most likely due to poor lysate quality or an error during the capture step. Later in the L1000 process, before hybridization, we add a positive amplicon control to well B1. If amplification in this well looks technically good and the rest of the plate failed then this could indicate an amplification problem. If this well failed along with the rest of the plate, then the issue most likely occurred during the hybridization or staining steps. Well A1 is a negative control well that does not contain anything except bead.

Calibration Curves Plot

The calibration curves plot** **shows a graphical representation of the median expression of each set of invariant genes in each well. As part of the initial L1000 assay design we identified a number of genes that exhibit relatively constant expression across different tissue types and experimental conditions. We call these "invariant genes" and we use 80 of these grouped into ten sets such that the genes within a set have similar expression levels and the median expression level of the sets are monotonically increasing. The invariant genes in each well are represented by a single line and since the median expression of the invariant genes should not vary much from well to well, we expect to see a similar trend for each line. In other words, the lines should be close together; the more spread out they are the more variation in fluorescence intensity we are seeing between wells on that plate, which can indicate there is a problem with some wells or the entire plate. Below are examples of two calibration curve plots. The top plot below shows a good plate with little variation in expression between wells, while the plot below it shows a plate where only the positive amplicon control showed a normal calibration curve and every other well was flat, indicating that either amplification did not occur or there was an issue with amplicon transfer during the hybridization step.


Invariant Set 10 Levels Heatmap

The invariant set 10 levels heatmap displays the median expression levels of only invariant set 10 (the set with the highest level of expression). Once again, we are looking to see that the fluorescence intensity is consistent across the plate. By looking at just the invariant set 10 in the heatmap we can see if there are any effects that depend upon the position of the wells within the plate, which cannot be determined from the calibration curve plots. For example, the heatmaps below show (top) a plate with fairly uniform invariant set 10 median fluorescence intensity (and hence probably of high technical quality) and (bottom) a plate where the intensity is not uniform and has a spatially-localized effect of higher invariant set 10 median fluorescent intensity around the edges, suggesting that there was most likely an issue with staining.


How do we verify that the same amount of Luminex beads were added consistently throughout the plate?

Median Bead Count per well Heatmap

The median bead count per well heatmap displays the median count of each bead in each well of the plate. Ideally we aim to see a count of ~70 for each bead color in each well. Sufficient bead counts are vital to ensure that the peak deconvolution algorithm can correctly identify genes during the deconvolution step of the signature generation process. Below is an example of a good plot of median count per well (top) where the wells look uniform and most are close to or greater than ~70, and a bad median count per well plot (bottom), where bead counts are non-uniform with many wells much less than 70 and some approaching 0!


Quality Code Plot

Quality code plots display the percentage of analytes that are showing different numbers of peaks for each well. Since the majority of beads are used to measure two separate genes, ideally we expect to see two peaks for almost all analytes. The plot below (top) is considered good because 90.8% of the analyte-wells are detected as having two peaks. The plot below that is considered bad because only 15.5% of analyte-wells have two peaks.


How can we identify problems with equipment used in the lab?

Weekly QC report

We produce a weekly QC report that is automatically generated and populated with information about all of the plates that were scanned in a given week. This report is extremely useful for determining technical issues, particularly with our scanning and liquid handling equipment. It shows the percentage of plate failures from each scanner for that week. The report also includes a well failure heat map as well as the number of failures per well, both of which indicate if there was a problem with a liquid handler, such as a clogged tip.

What QC metrics are looked at after replicates have been collapsed?

L1000 experiments are typically run in triplicate and results from the three replicates are combined using a process we refer to as "brew". During this step in our computational pipeline biological replicate experiments are collapsed into a single signature using the "moderated z-score" procedure. This procedure mitigates the effects of uncorrelated or outlier data from individual replicates and thus generates a signature that more accurately reflects the transcriptional effects of a given perturbagen.** **There are several QC plots that are generated after three replicates have been brewed together. These include the strength correlation plot (aka "s-c plot") and instance correlations plot.

Strength Correlation Plot

A strength-correlation (s-c) plot (shown below) displays a graphical representation of the signal strength and correlation associated with each signature. Signal strength (y axis) is computed as the number of landmark genes with absolute z-score greater than or equal to 2; correlation (x axis) refers to the correlation of the replicates for each signature. (Note: To measure correlation we use the 75th percentile from the set of Spearman correlations between the replicates. Typically we use 3 replicates, therefore there are 3 correlation values (A with B, B with C, A with C; 3-choose-2). For the 75th percentile of 3 numbers, we use this definition: 0.25b + 0.75c where the three numbers (a,b,c) have the relationship a < b < c.) In addition to the signatures of replicates, we also calculate and plot a null distribution, which is generated by calculating correlations among signatures that are not replicates of each other.

Seen below, the blue circles represent each sample and the blue x's represent entries in the null distribution. The plot is divided into four distinct quadrants indicated by the dotted lines, where the position of the dotted lines is determined based on the null distribution. The vertical dashed line is placed at a value such that ~98% of the points in the null distribution are to its left. The horizontal dashed line is placed at a value such that ~98% of the points in the null distribution are below it. The quadrants then serve as a rough heuristic about the behavior of the signatures.

  • The upper right quadrant indicates samples that are highly correlated and have a strong signal strength. Samples in this quadrant most likely indicate highly active perturbagens. You can see that in the example below there are several samples in this quadrant as well as our positive controls (orange circles).

  • The lower right quadrant indicates samples that have a strong correlation but lower signal strength.

  • The lower left quadrant indicates samples that have low correlation and low signal strength. By definition, the majority of null distribution points are located in this quadrant. Signatures that occur in this quadrant are probably indicative of weak compounds. As expected the majority of our negative controls (green squares) fall within this lower left quadrant.

  • The upper left quadrant contains samples that show low correlation but high signal strength, i.e., while the replicates are not that similar to each other, one shows a very strong effect. Signatures in this quadrant are suspect, and if a plate has many signatures here it may indicate a systematic problem with the experiment.

Instance Correlations Plot

Instance correlation plots display similar information to SC plots but with a different graphical representation. In the plot below you can see there are two distributions, one in blue and one in red. The blue represents the correlation of replicates of the null distribution of signatures created by combining random experiments, while the red represents the correlation distribution of replicate signatures. There is a clear distinction between these two distributions, as expected. Overlapping distributions could indicate a problem with the experiment.

Last modified: Fri Jan 05 2018 12:46:27 GMT-0500 (EST)

Was this article helpful? (Feedback is anonymous)