CMap is dedicated to ensuring that every plate run through the L1000 assay produces data of high technical quality. We have implemented a large number of quality control and assurance procedures and routinely exclude samples that appear sub-optimal. The guiding principle of LINCS / CMap is to make data that has passed all technical QC measures available, even if parts of the data aren't perfect. Hence, we encourage users to access data that we have vetted and organized for analysis in our webapps at clue.io (as opposed to downloading data). If you choose to download data from GEO, please pay attention to metrics included in metadata headers (rather than assuming all columns of data pass QC), and be aware that, as in any high-throughput system, there could be unobserved issues.
We have developed QC metrics to assess quality and to address issues that could arise at various stages throughout the assay. These include measures to ensure that 1) the correct cell line was plated and treated, 2) PCR amplification was good, 3) bead was properly added throughout the plate, and 4) any equipment issues were identified and addressed promptly. Our quality control procedures for L1000 are highly automated, allowing for high-level quality control metrics to be analyzed in a high throughput and timely manner. We have developed a plate tracking system called Lumitracker that generates real time QC plots, allowing us to assess the quality of a plate from the time it begins the detection process on our scanners. More extensive analysis is done when a plate finishes scanning and runs through our data pipeline, which assesses the various QC results and flags plates using a pass/fail system.
Here we describe several of the most important QC metrics as well as some of the QC tools we use. All of the plots and tools outlined are essential for ensuring data quality and flagging potential problems that could arise in any aspect of our production process.
How do we verify that the correct cell line was plated and treated?
To ensure that the proper cell line was plated and treated our team has developed a tool known as Dactyloscopy.** **Formally dactyloscopy is the science of fingerprint identification; in L1000 it refers to identifying the fingerprints of cell lines. This QC tool verifies the identity of cell lines on CMap plates by calculating the Spearman correlation coefficient of the level 3 (QNORM) gene expression data in each well with respect to a library of reference baseline expression profiles for over 1000 cell lines. If the cell line on a plate is labeled correctly, its correlation coefficient to the respective profile in the reference library should be higher than to the profiles of every other cell line in the library. If a plate fails dactyloscopy, it fails the overall L1000 QC process.
How can we verify that proper amplification occurred?
There are several ways we can ensure that all the steps from mRNA capture to PCR amplification occurred properly. These include several spiked-in control wells as well as QC plots such as our calibration curves plot and invariant set 10 level heatmap.
For L1000 experiments, the four wells in the top left corner of each plate (A1, A2, B1, and B2; figure below) are reserved for controls. Prior to lysate capture, wells A2 and B2 have reference MCF7 RNA added to them. If these wells look technically good but the rest of the plate failed, then the problem is most likely due to poor lysate quality or an error during the capture step. Later in the L1000 process, before hybridization, we add a positive amplicon control to well B1. If amplification in this well looks technically good and the rest of the plate failed then this could indicate an amplification problem. If this well failed along with the rest of the plate, then the issue most likely occurred during the hybridization or staining steps. Well A1 is a negative control well that does not contain anything except bead.
Calibration Curves Plot
The calibration curves plot** **shows a graphical representation of the median expression of each set of invariant genes in each well. As part of the initial L1000 assay design we identified a number of genes that exhibit relatively constant expression across different tissue types and experimental conditions. We call these "invariant genes" and we use 80 of these grouped into ten sets such that the genes within a set have similar expression levels and the median expression level of the sets are monotonically increasing. The invariant genes in each well are represented by a single line and since the median expression of the invariant genes should not vary much from well to well, we expect to see a similar trend for each line. In other words, the lines should be close together; the more spread out they are the more variation in fluorescence intensity we are seeing between wells on that plate, which can indicate there is a problem with some wells or the entire plate. Below are examples of two calibration curve plots. The top plot below shows a good plate with little variation in expression between wells, while the plot below it shows a plate where only the positive amplicon control showed a normal calibration curve and every other well was flat, indicating that either amplification did not occur or there was an issue with amplicon transfer during the hybridization step.
Invariant Set 10 Levels Heatmap
The invariant set 10 levels heatmap displays the median expression levels of only invariant set 10 (the set with the highest level of expression). Once again, we are looking to see that the fluorescence intensity is consistent across the plate. By looking at just the invariant set 10 in the heatmap we can see if there are any effects that depend upon the position of the wells within the plate, which cannot be determined from the calibration curve plots. For example, the heatmaps below show (top) a plate with fairly uniform invariant set 10 median fluorescence intensity (and hence probably of high technical quality) and (bottom) a plate where the intensity is not uniform and has a spatially-localized effect of higher invariant set 10 median fluorescent intensity around the edges, suggesting that there was most likely an issue with staining.
How do we verify that the same amount of Luminex beads were added consistently throughout the plate?
Median Bead Count per well Heatmap
The median bead count per well heatmap displays the median count of each bead in each well of the plate. Ideally we aim to see a count of ~70 for each bead color in each well. Sufficient bead counts are vital to ensure that the peak deconvolution algorithm can correctly identify genes during the deconvolution step of the signature generation process. Below is an example of a good plot of median count per well (top) where the wells look uniform and most are close to or greater than ~70, and a bad median count per well plot (bottom), where bead counts are non-uniform with many wells much less than 70 and some approaching 0!
Quality Code Plot
Quality code plots display the percentage of analytes that are showing different numbers of peaks for each well. Since the majority of beads are used to measure two separate genes, ideally we expect to see two peaks for almost all analytes. The plot below (top) is considered good because 90.8% of the analyte-wells are detected as having two peaks. The plot below that is considered bad because only 15.5% of analyte-wells have two peaks.
How can we identify problems with equipment used in the lab?
Weekly QC report
We produce a weekly QC report that is automatically generated and populated with information about all of the plates that were scanned in a given week. This report is extremely useful for determining technical issues, particularly with our scanning and liquid handling equipment. It shows the percentage of plate failures from each scanner for that week. The report also includes a well failure heat map as well as the number of failures per well, both of which indicate if there was a problem with a liquid handler, such as a clogged tip.
What QC metrics are looked at after replicates have been collapsed?
L1000 experiments are typically run in triplicate and results from the three replicates are combined using a process we refer to as "brew". During this step in our computational pipeline biological replicate experiments are collapsed into a single signature using the "moderated z-score" procedure. This procedure mitigates the effects of uncorrelated or outlier data from individual replicates and thus generates a signature that more accurately reflects the transcriptional effects of a given perturbagen.** **There are several QC plots that are generated after three replicates have been brewed together. These include the strength correlation plot (aka "s-c plot") and instance correlations plot.
Strength Correlation Plot
A strength-correlation (s-c) plot (shown below) displays a graphical representation of the signal strength and correlation associated with each signature. Signal strength (y axis) is computed as the number of landmark genes with absolute z-score greater than or equal to 2; correlation (x axis) refers to the correlation of the replicates for each signature. (Note: To measure correlation we use the 75th percentile from the set of Spearman correlations between the replicates. Typically we use 3 replicates, therefore there are 3 correlation values (A with B, B with C, A with C; 3-choose-2). For the 75th percentile of 3 numbers, we use this definition: 0.25b + 0.75c where the three numbers (a,b,c) have the relationship a < b < c.) In addition to the signatures of replicates, we also calculate and plot a null distribution, which is generated by calculating correlations among signatures that are not replicates of each other.
Seen below, the blue circles represent each sample and the blue x's represent entries in the null distribution. The plot is divided into four distinct quadrants indicated by the dotted lines, where the position of the dotted lines is determined based on the null distribution. The vertical dashed line is placed at a value such that ~98% of the points in the null distribution are to its left. The horizontal dashed line is placed at a value such that ~98% of the points in the null distribution are below it. The quadrants then serve as a rough heuristic about the behavior of the signatures.
The upper right quadrant indicates samples that are highly correlated and have a strong signal strength. Samples in this quadrant most likely indicate highly active perturbagens. You can see that in the example below there are several samples in this quadrant as well as our positive controls (orange circles).
The lower right quadrant indicates samples that have a strong correlation but lower signal strength.
The lower left quadrant indicates samples that have low correlation and low signal strength. By definition, the majority of null distribution points are located in this quadrant. Signatures that occur in this quadrant are probably indicative of weak compounds. As expected the majority of our negative controls (green squares) fall within this lower left quadrant.
The upper left quadrant contains samples that show low correlation but high signal strength, i.e., while the replicates are not that similar to each other, one shows a very strong effect. Signatures in this quadrant are suspect, and if a plate has many signatures here it may indicate a systematic problem with the experiment.
Instance Correlations Plot
Instance correlation plots display similar information to SC plots but with a different graphical representation. In the plot below you can see there are two distributions, one in blue and one in red. The blue represents the correlation of replicates of the null distribution of signatures created by combining random experiments, while the red represents the correlation distribution of replicate signatures. There is a clear distinction between these two distributions, as expected. Overlapping distributions could indicate a problem with the experiment.