CONNECTOPEDIA

What levels/types of data are available?

TAGS:

L1000 data proceeds through a data processing pipeline outlined in the figure below. Briefly, the pipeline captures raw data from Luminex FlexMap 3D scanners as it is generated, deconvolutes 978 transcripts from only 500 Luminex bead colors, normalizes the data based on 80 invariant control genes, infers the expression of the non-measured transcripts, determines differentially expressed genes following a perturbation compared to controls, and generates composite signatures across biological replicates. Along the way the data are subjected to rigorous quality control filters at both the sample and plate level.

While we anticipate most users will ignore the lower levels and focus their analysis on level 5 data (as that is what is used for connectivity analyses) we also make available data from each of the levels for those who want to dive into the details.

The code for the data processing pipeline is available in the cmapM GitHub repository. The procedure to replicate each step the pipeline along with sample data are detailed here.

Level 1 - LXB - raw fluorescent intensity (FI) values measured for every bead detected by Luminex scanners. The FI is proportional to the amount of amplicon bound to the bead, and hence also proportional to the transcript abundance of the genes that particular bead is interrogating. Each 384-well plate generates 384 LXB files, where each file contains a fluorescent intensity value for each observed bead in the well. Here, the data from each perturbagen treatment is referred to as a profile, experiment, or instance.

Level 2 - GEX - Gene expression levels for the 978 landmark genes, deconvoluted from the measured fluorescent intensity values. (See supplementary information in Subramanian, et al., 2017 for details on peak deconvolution.) Here, the data from each perturbagen treatment is referred to as a profile, experiment, or instance.

Level 3a - NORM - Gene expression (GEX, Level 2) are normalized to invariant gene set curves and quantile normalized across each plate. Here, the data from each perturbagen treatment is referred to as a profile, experiment, or instance.

Level 3b - INF- Additional values for 11,350 additional genes not directly measured in the L10000 assay are inferred based on the normalized values for the 978 landmark genes.

Level 4 - ZS - Z-scores for each gene based on Level 3 with respect to the entire plate population. This comparison of profiles to their appropriate population control generates a list of differentially expressed genes.

Level 5 - MODZ - replicate-collapsed z-score vectors based on Level 4. Replicate collapse generates one differential expression vector, which we term a signature. Connectivity analyses are performed on signatures.

For levels 1 and 2, values are present for only the 978 landmark features. For levels 3-5, values are present for each of the 12,328 genes (978 landmark plus 11,350 inferred).

Summary:

In our experience, biological discovery is best achieved by using Level 5 data, and most users will likely want to focus their efforts there. Lower level datasets are applicable in order to investigate data before and after deconvolution and normalization. Our methods for those preprocessing steps are certainly not perfect and improvements are welcome. That being said, our overarching goal is to facilitate biological discovery and our tools and apps in clue.io primarily support that use case.

Last modified: Tue Aug 27 2019 12:34:53 GMT-0400 (EDT)