LINCS-funded CMap L1000 data has been generated by the Connectivity Map group at the Broad Institute. All the LINCS data is deposited into GEO and available for download. An overview of the data levels and signature generation process is described in a separate article of Connectopedia. The code used for data processing is available at https://github.com/cmap/cmapM. While we provide the data from each level of processing, unless you are interested in pre-processing details, we suggest that you use Level 5 data for your analyses.
*A note about the difference between GEO terms: Series, Super Series, and subset Series: *
A Series record links together a group of related samples and provides a focal point and description of the entire study. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Due to GEO technical limitations, for the CMap L1000 submission to GEO the samples have been broken up into multiple series (subset Series), which are linked together via a master or Super Series (GSE70138). The Super Series GSE70138 contains supplementary data files, one for each level of data; these contain the data and metadata for every sample in the series. The files are in GCTx format. Most importantly, if you simply want the matrices, ignore the sub-series and sub-projects.
General File and Data Conventions
The files available for download are either of the GCTx or TXT format and contain data matrices and metadata annotation.
|Matrix of numbers||GCTx||Binary format based on HDF5 that enables faster i/o than text. Code is available through the CMap Code page at clue.io/code.|
|Experimental Metadata||TXT||Information on perturbagens and cell lines that were profiled|
|Metrics||TXT||Statistics computed on signatures that reflect their characteristics, including reproducibility of profiles and the magnitude of gene expression changes|
The following table shows the filenames for the metadata and metrics data. The metadata is accurate to the best of our knowledge at the time of deposition. However, given the large size of the data and the many people, organizations, and processes involved, there will inevitably be errors or holes. While we will do our best to fix any errors, accessing via the CLUE API (clue.io/api) is likely a more convenient and up-to-date mode for subscribers.
Note that any missing metadata value is represented by -666, which indicates that the information is not available or not applicable.
|GSE***_Broad_LINCS_cell_info.txt.gz||Metadata describing cell lines used in perturbagen treatments|
|GSE***_Broad_LINCS_gene_info.txt.gz||Metadata describing measured and inferred genes|
|GSE***_Broad_LINCS_inst_info.txt.gz||Metadata pertaining to individual profiles (or instances, experiments)|
|GSE***_Broad_LINCS_pert_info.txt.gz||Metadata describing each perturbagen used in experiments|
|GSE***_Broad_LINCS_sig_info.txt.gz||Metadata for each signature in the Level 5 matrix|
|GSE***_Broad_LINCS_pert_metrics.txt.gz||Calculated / derived / inferred metrics and annotations associated with each perturbagen that was used in the experiments.|
|GSE***_Broad_LINCS_sig_metrics.txt.gz||Calculated / derived / inferred metrics and annotations associated with each signature in the Level 5 matrix|
Each dataset is described below, accompanied by a table or tables with information about the files.
GSE92742 (aka LINCS Phase I L1000 dataset)
LINCS Phase 1 data is in GEO series GSE92742. As this represents an earlier phase of LINCS, it will not be updated except for bug fixes, if any. Please note that an earlier version of this release incorrectly contained ~5k profiles that should not have been included, and they have been removed.
|GSE92742_Broad_LINCS_Level1_LXB_n1403502.tar.gz||Level 1 data (raw fluorescence measurements)||Gzipped tar|
|GSE92742_Broad_LINCS_Level2_GEX_delta_n49216x978.gctx.gz||Level 2 data for delta probes/features (raw gene expression / GEX)||Gzipped gctx|
|GSE92742_Broad_LINCS_Level2_GEX_epsilon_n1269922x978.gctx.gz||Level 2 data for epsilon probes/features (raw gene expression / GEX)||Gzipped gctx|
|GSE92742_Broad_LINCS_Level3_Q2NORM_n1319138x12328.gctx.gz||Level 3 data (normalized & inferred / INF)||Gzipped gctx|
|GSE92742_Broad_LINCS_Level4_ZSPC_n1319138x12328.gctx.gz||Level 4 data, plate-control normalized (robust z-scores / ZSPC)||Gzipped gctx|
|GSE92742_Broad_LINCS_Level5_MODZS_n473647x12328.gctx.gz||Level 5 data (moderated z-scores / MODZS)||Gzipped gctx|
|GSE92742_Broad_LINCS_cell_info.txt.gz||Metadata for each cell line that was used in the experiments||Gzipped tab-delimited text|
|GSE92742_Broad_LINCS_gene_info.txt.gz||Metadata for each measured feature / gene (metadata for rows of the data matrices)||Gzipped tab-delimited text|
|GSE92742_Broad_LINCS_gene_info_including_delta.txt.gz||Metadata for each measured feature / gene (metadata for rows of the data matrices) including the delta probes that were used in a small number of initial experiments||Gzipped tab-delimited text|
|GSE92742_Broad_LINCS_inst_info.txt.gz||Metadata for each experiment in the Levels 3-4 matrices (metadata for the columns in the Levels 3-4 data matrices)||Gzipped tab-delimited text|
|GSE92742_Broad_LINCS_pert_info.txt.gz||Metadata for each perturbagen that was used in the experiments||Gzipped tab-delimited text|
|GSE92742_Broad_LINCS_pert_metrics.txt.gz||Calculated / derived / inferred metrics and annotations associated with each perturbagen that was used in the experiments.||Gzipped tab-delimited text|
|GSE92742_Broad_LINCS_sig_info.txt.gz||Metadata for each signature in the Level 5 matrix (metadata for the columns in the Level 5 data matrix)||Gzipped tab-delimited text|
|GSE92742_Broad_LINCS_sig_metrics.txt.gz||Calculated / derived / inferred metrics and annotations associated with each signature in the Level 5 matrix||Gzipped tab-delimited text|
|GSE92742_Broad_LINCS_auxiliary_datasets.tar.gz||Auxiliary datasets used for supplementary calculations. Please see the table below for a description of the individual files included in the archive.||Gzipped tar of set of gctx|
|Inventory / hash codes|
|GSE92742_SHA512SUMS.txt.gz||Text file containing checksums calculated for each of the above files, for use in verifying integrity of downloaded files||Gzipped text|
*Overview of the GSE92742 dataset. *This flow chart is a graphical explanation of the numbers of entries in each file at the different levels of data processing. An overview of the data levels and signature generation process is described in a separate article of Connectopedia.
For GSE92742, there is more data at level 1 than the other levels because level 1 data includes samples that failed to pass QC. The level 2 data (GEX, which represents direct measurements) has been split into 2 files ("delta" and "epsilon"), based on which set of genes was measured directly; the total number of columns for these level 2 files is 49,216 + 1,278,882 = 1,328,098. That total (1,328,098) matches the number of columns for the level 3 (INF) and level 4 (ZSPC) data. As described below for GSE70138, the level 5 data (MODZS) is calculated by aggregating across the replicates within the level 4 data, and thus there are fewer columns - about ⅓ as many columns in the level 5 data as in the level 4 data.
|Supplemental Datasets Contents of GSE92742_Broad_LINCS_auxiliary_datasets.tar.gz|
|Name||Brief description / use case|
|DS_GEO_n12031x22268.gctx||Collection of publicly available gene expression profiles on Affymetrix HGU133A arrays that was used to select landmark genes and train the inference model.
rows: 22,268 features columns: 12,301 samples
|DS_GEO_OLS_WEIGHTS_n979x21290.gctx||The matrix of weights learned by training the L1000 inference algorithm, ordinary least squares (OLS) linear regression, on DSGEO.
rows: 21,290 inferred features columns: 978 landmark genes + intercept = 979
|DS_CMAP_AFFXB01_n566x22268.gctx||Pilot connectivity Map dataset of 455 gene expression profiles of human cell lines treated with 164 small molecule compounds that were used in simulations to determine the optimal number of landmarks.
rows: 22,268 features columns: 455 samples
|DS_REPRODUCIBILITY_n216x978.gctx||Samples of purified total RNA from six human cancer cell lines, purchased from Life Technologies, were subjected to L1000 profiling. L1000 expression profiles were generated consisting of 12 technical replicates for each of the six cell lines, all done in three consecutive, independent LMA batches, yielding 36 replicate profiles per cell line and a total of 216 total profiles (6 cell line x 12 replicates x 3 batches). These profiles were used to assess the technical reproducibility of the L1000 assay.
rows: 978 landmark genes columns: 216 samples
|DS_KDLM_n955x978.gctx||We evaluated if probes designed against individual target landmark genes worked in the multiplexed gene assay format. To assess the specificity of L1000 landmark probe measurements, we procured shRNAs from The RNAi Consortium (TRC) that targeted landmark genes, treated MCF7 and PC3 cell lines with these shRNAs. The resulting dataset contains, as columns, an individual shRNA targeting a landmark gene performed in either MCF7 or the PC3 cancer cell line. Rows are z-scores of all measured landmark genes. For each gene in each sample, we computed differential expression values (z-scores) by comparing the gene's expression value in the given sample to that same gene's expression values in all other samples.
rows: 978 landmark genes columns: 955 samples
|DS_GTEX_RNASEQ_n8555x12320.gctx||Compendium of 8,555 RNA-seq samples obtained from the GTEx consortium (version 6). A subset of 3,176 of these samples were also profiled on L1000 and were used to compare the two platforms.
rows: 12,320 genes columns: 8,555 samples
|DS_GTEX_L1000_n3176x12320.gctx||From the GTEx version 6 collection, a subset of 3,176 samples were generously donated by the GTEx consortium for profiling in the L1000 assay.
rows: 970 landmark + 11,350 inferred = 12,320 genes columns: 3,176 samples
|DS_GTEX_RNASEQ_LMONLY_n8555x970.gctx||For convenience of use during assay validation and inference testing, we separated out from the overall GTEx RNA-seq dataset (DSGTEx-RNA-seq), the subset of landmark genes and refer to it as DSGTEx-RNA-seq-lmonly in the methods. Note that the GTEx pre-processing provided mappings for 970 of the landmark genes.
rows: 970 landmark genes columns: 8,555 samples
|DS_GTEX_RNASEQ_INF_n8555x12320.gctx||Inferred version of DSGTEx-RNA-seq dataset generated by applying DSGEO-OLS to DSGTEx-RNA-seq-lmonly.
rows: 970 landmark + 11,350 inferred = 12,320 genes columns: 8,555 samples
GSE70138 (aka LINCS Phase II L1000 dataset)
LINCS Phase 2 data is in GEO series GSE70138. This series will be updated every 6 months as more L1000 data is produced and QC'ed over the duration of the LINCS program (Starting in 2016 through 2020). The same data will also become available through the LINCS DCIC portal.
|GSE70138_Broad_LINCS_Level1_LXB_n345976.tar.gz||Level 1 data (raw fluorescence intensity measurements / LXB)||Gzipped tar of directory containing binary lxb files|
|GSE70138_Broad_LINCS_Level2_GEX_n345976x978.gctx.gz||Level 2 data (raw gene expression / GEX)||GCTX|
|GSE70138_Broad_LINCS_Level3_INF_mlr12k_n78980x22268_2015-06-30.gct.gz||Level 3 data (normalized & inferred / INF)||GCTX|
|GSE70138_Broad_LINCS_Level4_ZSPCINF_mlr12k_n345976x12328.gctx.gz||Latest Level 4 data (robust z-scores / ZSPC)||GCTX|
|GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328.gctx.gz||Latest Level 5 data (signatures from aggregating replicates)||GCTX|
|GSE70138_Broad_LINCS_inst_info.txt.gz||Metadata for individual experiments (levels 1-4)||Gzipped tab-delimited text|
|GSE70138_Broad_LINCS_sig_info.txt.gz||Metadata for signatures of aggregated replicates (level 5)||Gzipped tab-delimited text|
|GSE70138_Broad_LINCS_gene_info.txt.gz||Metadata for rows / genes of matrices||Gzipped tab-delimited text|
|GSE70138_Broad_LINCS_gene_info_delaprime.txt.gz||Metadata for rows / genes of matrices, applies to all profiles generated using the delta prime probe pool. See FAQ for details||Gzipped tab-delimited text|
|GSE70138_SHA512SUMS||Text file containing checksums calculated for each of the most recent above files, for use in verifying integrity of downloaded files||Gzipped text|
For GSE70138, all data levels have the same number of columns (345,976) except for level 5 (MODZS) which has 118050 columns. The difference arises because the level 5 data is calculated by aggregating across individual replicates to generate a single signature for each group of replicates (generally 3 per experiment).
Note: We release data before full analysis and before publication, with the expectation that the data will be useful to others even in its early form. Therefore, some best practices to keep in mind for using released data:
check back here or contact us before you publish to see if any details have been updated
if you notice something odd / inconsistent / unexpected in the data, please email us with details
Over the course of the LINCS initiative, L1000 data generated as part of LINCS will be released to GEO twice a year. L1000 data generated from other sources will be released into the public domain as funding sources permit.
Data releases are appended to the GEO series object, allowing you to view earlier releases in addition to the current one. Each release is a full package (i.e earlier releases are not overwritten), enabling researchers using the data for the first time to get the latest release, while at the same time providing a persistently addressable source for earlier datasets. Thus, please download the data labelled with the most recent date. The figure below shows the most recent files, circled in green, for the GSE70138 dataset.
GSE106127 (aka RNAi And CRISPR datasets)
RNAi and CRISPR datasets are in GEO series GSE106127. Note that this data is not new; it is subsetted from the data in GSE92742 and GSE70138 and is the portion of that data corresponding to genetic perturbational signatures from shRNA and CRISPR reagents.
|GSE106127_level_4_zspc_n341336x978.gctx.gz||Level 4 data, plate-control normalized (robust z-scores / ZSPC)||Gzipped gctx|
|GSE106127_level_5_modz_n119013x978.gctx.gz||Level 5 data (moderated z-scores / MODZS)||Gzipped gctx|
|GSE106127_CGS_n33839x978.gctx.gz||Consensus gene signatures generated by combining level 5 signatures of individual shRNAs||Gzipped gctx|
|GSE106127_level_4_PRIME_zspc_n341336x978.gctx.gz||Level 4 data with the global first principal component (PC1) removed||Gzipped gctx|
|GSE106127_level_5_PRIME_modz_n119013x978.gctx.gz||Level 5 data with the global first principal component (PC1) removed||Gzipped gctx|
|GSE106127_CGS_PRIME_n33839x978.gctx.gz||CGS data with the global first principal component (PC1) removed||Gzipped gctx|
|GSE106127_pc_coeff_global_n978x978.gctx.gz||The principal component loadings derived by running PCA on the entire CMap level 5 data matrix (~470k signatures)||Gzipped gctx|
|GSE106127_inst_info.txt.gz||Metadata for each experiment in the Level 4 matrices (metadata for the columns in the Level 4 data matrices)||Gzipped tab-delimited text|
|GSE106127_sig_info.txt.gz||Metadata for each signature in the Level 5 matrices (metadata for the columns in the Level 5 data matrices)||Gzipped tab-delimited text|
|GSE106127_sig_metrics.txt.gz||Calculated / derived / inferred metrics and annotations associated with each signature in the Level 5 matrices||Gzipped tab-delimited text|
|GSE106127_CGS_meta.txt.gz||Metadata for each CGS signature in the CGS matrix (metadata for the columns in the CGS data matrix)||Gzipped tab-delimited text|
|GSE106127_CGS_PRIME_meta.txt.gz||Metadata for each CGS signature in the CGS PRIME matrix (metadata for the columns in the CGS PRIME data matrix)||Gzipped tab-delimited text|
|GSE106127_gene_info.txt.gz||Metadata for each measured feature / gene (metadata for rows of the data matrices)||Gzipped tab-delimited text|
|GSE106127_SHA512SUMS.txt.gz||Text file containing checksums calculated for each of the above files, for use in verifying integrity of downloaded files||Gzipped tab-delimited text|
GSE92743 (aka CMap-HBS Contest)
Additionally, CMap / LINCS organizes datasets into discrete bundles to address particular questions in the form of contests, which are archived at GSE92743. The goal of this is to both engage the wider computational community in the improvement of analytics as well as to provide to LINCS users datasets along with benchmarks that are well organized for easier use.
|GSE92743_Broad_Affymetrix_training_Level3_Q2NORM_n100000x12320.gctx.gz||Affymetrix data for 100,000 samples used by contestants for building their models||Gzipped gctx|
|GSE92743_Broad_GTEx_L1000_Level3_Q2NORM_n3176x12320.gctx.gz||All of the level 3 data (normalized & inferred / INF) of L1000 measurements on GTEx samples||Gzipped gctx|
|GSE92743_Broad_GTEx_L1000_Holdout_Level3_Q2NORM_n1000x12320.gctx.gz||Just the holdout level 3 data (normalized & inferred / INF) of L1000 measurements on GTEx samples||Gzipped gctx|
|GSE92743_Broad_GTEx_L1000_NotUsed_Level3_Q2NORM_n1526x12320.gctx.gz||Just the unused level 3 data (normalized & inferred / INF) of L1000 measurements on GTEx samples||Gzipped gctx|
|GSE92743_Broad_GTEx_L1000_Test_Level3_Q2NORM_n650x12320.gctx.gz||Just the test level 3 data (normalized & inferred / INF) of L1000 measurements on GTEx samples||Gzipped gctx|
|GSE92743_Broad_GTEx_RNAseq_Log2RPKM_q2norm_n3176x12320.gctx.gz||Level 3 data (normalized) of RNA-seq measurements on GTEx samples||Gzipped gctx|
|GSE92743_Broad_GTEx_gene_info.txt||Metadata for each measured feature / gene (metadata for rows of the data matrices)||tab-delimited text|
|GSE92743_Broad_GTEx_inst_info.txt||Metadata for each experiment in the Levels 3-4 matrices (metadata for the columns in the Levels 3-4 data matrices)||Gzipped tab-delimited text|
|GSE92743_Broad_OLS_WEIGHTS_n979x11350.gctx.gz||Matrix of weights used in current CMap L1000 inference model||Gzipped gctx|
|GSE92743_SHA512SUMS||Text file containing checksums calculated for each of the above files, for use in verifying integrity of downloaded files||text|
Explanations of the metadata and metrics data column headers can be found in the Glossary.