A dimensionality reduction technique particularly well suited for visualizing data. (For references, see https://lvdmaaten.github.io/tsne)

The parameters that were used for running t-SNE here are: 50 initial dimensions, perplexity of 30, and theta of 0.5. For datasets with <= 5000 samples, the standard t-SNE algorithm is used. For larger datasets, the Barnes-Hut algorithm is employed.

A dimensionality reduction technique in which the two principal components are chosen to have the largest possible variance.

To analyze relationships between perturbations, we utilize the framework of connectivity. A connectivity score between two perturbations quantifies the similarity of the cellular responses evoked by these perturbations. A score of 1 means that these two perturbations are more similar to each other than 100% of other perturbation pairs. A score of -1 means that these two perturbations are more dissimilar to each other than 100% of other perturbation pairs.

See a heatmap of connections between individual perturbagens in cell lines and all other perturbagens used for the P100 assay or the GCP assay. The tutorial describes the features of the heatmap.

Bring data, in GCT format, from your own P100 or GCP studies to query against our datasets.

Introspect means querying your dataset against itself. Make sure to "Include Introspect" if you would like to see connections within your dataset (in addition to connections between your dataset and Touchstone-P).

In computing connectivity, biological or technical replicates can be aggregated together. Please select which metadata fields should be used to recognize replicates. For example, if you wish to distinguish between different doses of the same compound, make sure to select "pert_dose" (or something similar) as one of the metadata fields by which to group replicates. The possible metadata fields by which to group replicates only appear after you have upload your GCT and selected "Yes" for "Are there replicates in your data?".


Matched mode: When running GUTC, incorporates cell-line information to match query data against matching cell types in Touchstone. Currently this includes the following 9 cell types : [A375, A549, HEPG2, HCC515, HA1E, HT29, MCF7, PC3, VCAP].
Unmatched mode (recommended): When running GUTC, does not incorporate cell-line information when querying the data against Touchstone signatures.


L-Build ("Light" Build):  All levels of L1000 data up to aggregated signatures.
Full Build:  All levels of L1000 data up to aggregated signatures, as well as all relevant additional analyses of the data (Introspect, t-SNE, PCA, etc.).

When querying Touchstone, Feature Space determines what set of genes to query against. When perturbagens are profiled on the L1000 platform, Landmark is recommended. When the queries you wish to use are not landmarks, use BING instead.

Root location within a brew folder that contains the instance matrices and the brew_group folder. Default is brew/pc

List of expected treatment doses in micromolar as a listmaker list. If provided, dose discretization is applied to the pert_dose metadata field to generate a canonicalized pert_idose field. Note this assumes that the pert_dose annotations are in micromolar.

Generates TAS plots and connectivity heatmap of preliminary callibration plates to identify the most suitable experimental conditions of specified parameters. Tool should be run on small pilot experiments, with a variety of experimental parameters such as seeding density and time point. Plots can also be decoupled by parameters such as cell id.

Column filter to sig_build_tool as a listmaker collection

The name of the build used when generating all associated files and folders (e.g. <BUILD_CODE>_metadata). For this reason, the code must be filename compatible.

When merging replicates for L1000, several versions of the merged data are made. This parameter determines which version to use when creating your build. by_rna_well is the default. by_rna_well is recommended.

All data is from the Cancer Cell Line Encyclopedia resource. Expression data was released 15-Aug-2017, copy number data is dated 27-May-2014, and mutational data is dated 15-Aug-2017.


Feature Mapping: Ensembl Ids from the source data were mapped to Entrez Gene Ids using gene annotations from NCBI (downloaded on 02-Mar-2016).
Normalization:  RNAseq RPKM values were log2 transformed using log2(max(RPKM, eps)). The data were then normalized such that the expression values were comparable across cell lines, by minimizing technical variation and equalizing their distributions (for details of the normalization, see LISS and QNORM entries in the Connectopedia glossary). Post-normalization, the expression values range between 4 and 15 log2 units, with 4 indicating that a gene is minimally or not expressed and 15 indicating the maximum readout.
Z-scores: The number of standard deviations that a gene is above or below the population mean is called its z-score. The "robust" z-score is resistant to outliers by using median instead of mean and median absolute deviation (MAD) instead of standard deviation. The reference population used to compute the median and MAD for a particular gene is all CCLE lines with data for that gene.
Z-scores Within Primary Site: Similar to z-scores, but the reference population used to compute the median and MAD is all CCLE lines from the same lineage with data for that gene.

All scores indicated are in log 2 ratios to reference, binned using the heuristics described in CNVkit.

Deletion:  score < -1.1
Loss:  -1.1 ≤ score ≤ -0.25
No change:  -0.25 < score < +0.2
Gain: +0.2 ≤ score < +0.7
Amplification: +0.7 ≤ score

Access a suite of analysis apps by clicking on the menu (or type command-K to open)

Switch between running a single query and running a batch query.

Give each query a descriptive name that will help you identify your results.

Tip: Each list can have a different number of genes; in fact, you can run a query with only one list (up OR down).

Your query will take about 5 minutes to process; check the History section in the Menu for your results!

Valid genes used in the query have HUGO symbols or Entrez IDs and are well-inferred or directly measured by L1000 (member of the BING gene set). Valid genes not used in a query are those that have a valid HUGO or Entrez identifier but are not part of the BING set. Invalid genes do not have HUGO or Entrez IDs.

Give each query a descriptive name that will help you identify your results.

Your query will take about 5 minutes to process; check the History section in the Menu for your results!

The sig_fastgutc_tool is a reimplementation of our query algorithm that enables faster query results, especially at larger batch sizes. It is the result of crowd-sourced contest. It is currently in beta mode.

Filter datasets by category to see only those of interest.

Data Icons identify published and proprietary datasets.

Click on a row to see a summary of that dataset, including cell lines and treatment conditions, assay type, and dates.

Arrange the table to display the information most important for your work, and add key datasets to favorites.

View details about the collection as a whole and about individual compounds.

View subsets of compounds based on mechanism, drug target, or known disease application.

Purity is assessed by ultra-performance liquid chromatography-mass spectrometry (UPLC-MS) of compounds after receipt from the vendor.

Status as of publication of this resource (March 2017). We will be updating this but let us know if you notice a discrepancy.

Click on a compound to see details about its structure, mechanism, targets, approval status, and vendor.

Mouse over this graphic to see the classes of proteins targeted by drugs in the hub.

This is the current count of perturbagens in the reference (touchstone) dataset.

Select data from perturbagens grouped by their MoA or role in the cell.

Choose a perturbagen type, or view them all.

Touchstone is our reference dataset, made from well-annotated perturbagens profiled in a core set of 9 cell lines.

Detailed List is unavailable for Touchstone v1.1.1.1. A new data visualization approach is in development, but to get results in a table format (similar to Detailed View), please click on Heat Map and download the dataset as a GCT file that can be viewed in Excel or similar apps. Please see here for a detailed explanation.

Articles are tagged with topics. Click on a topic tag to see all related articles.

Look it up! A quick reference guide of CMap terms and their meanings.

Email us with your questions.

Click on the heading to read all the articles in this section on a single page, or open each article separately.

Click on a heading to open a menu of articles.

Each article is tagged with key words that describe its content.

Underlined words link to their definition in the CMap glossary.

Your feedback helps us make Connectopedia more useful.

Average transcriptional impact

TAS is a metric that incorporates the signature strength (the number of significantly differentially expressed transcripts) and signature concordance (the reproducibility of those changes across biological replicates) to capture activity of a compound. The score is computed as the geometric mean of the signature strength and the 75th quantile of pairwise replicate correlations for a given signature. Prior to computing the geometric mean, the signature strength is multiplied by the square root of the number of replicates. This serves to mitigate score shrinkage with increasing replicate number and allows TAS values derived from signatures of different numbers of replicates to be compared with each other.

Signature diversity

Thick black bars signify Transcriptional Activity Scores greater than or equal to 0.5; thinner black bars denote scores less than 0.5. Absence of a bar means no data available. Colored lines (chords) signify similar connectivity scores between cell lines; red for positive connectivity scores of 80-100 (pale to intense color according to the score); blue for negative connectivity. Chords are only shown when TAS scores are > 0.5; thus absence of a chord either means that the perturbagen TAS score is very low, or that no data is available. Chords for individual cell lines can be isolated from the rest of the figure by hovering over the cell line name.

Baseline expression of this gene in each cell line is represented as a z-score (top numbers). Scores were calculated using robust z-score formula:

z-scorei = ( xi - median( X ) )/( MAD( X ) * 1.4826 ),

where:

xi is expression value of a given gene in i-th cell line

X = [ x1, x2 ... xn ] is a vector of expression values for a given gene across n cell lines

MAD( X ) is a median absolute deviation of X

1.4826 is a constant to rescale the score as if the standard deviation of X instead of MAD was used

Median and MAD expression values were calculated using RNA-Seq profiles from a total of 1022 cell lines, comprising data from the Cancer Cell Line Encyclopedia (CCLE; Barretina, et al.) and cell lines nominated by the CMap team. Plots show z-score values only for the core LINCS lines used by CMap in L1000 experiments. Light red or light blue regions indicate positive or negative outlier expression, respectively, of the gene relative to the other lines shown; z-score of a positive outlier in the corresponding cell line is in dark red and a negative outlier is in dark blue.

Summary class connectivity shows a boxplot that summarizes the connectivity of a class. Each data point, shown as a light gray dot, represents the median value of connectivity of one member to the other class members. (This corresponds to the median for each row, excluding the main diagonal, in the heatmap shown below.) The box is the distribution of those data points, where the box boundary represents the interquartile range, the vertical line within the box is the median, and the whiskers reflect the minimum and maximum values of the data (exclusive of extreme outliers, which may appear beyond the whiskers).

Connectivity between members of class is a standard heat map of the connectivity scores, summarized across cell lines, between members of the class, where dark red represents the highest positive scores and deep blue the highest negative scores. Individual scores are revealed to the left below the map by hovering over each cell of the map.

Class inter-cell line connectivity is a plot of the median (black line) and Q25-Q75 connectivity scores (blue area around black line) for each cell line as well as the summary scores across cell lines. In some cases perturbations have not been tested in every cell line; the absence of data is indicated by a “0” for that cell line. The example shown reveals that these estrogen agonists show the strongest connectivity to each other in MCF7, a human breast cancer cell line that expresses the estrogen receptor.

Profile status

Colored portion of top bar indicates the Broad assays in which this compound has been profiled.

L1000 cell/dose coverage

For compounds profiled by L1000, cell lines and dose range for which signatures are available are indicated by dark gray bars (lighter gray bar indicates no data is available for that cell line/dose combination). A bar displayed one row above the 10 uM row indicates that doses higher than 10uM were tested. The 6 rows correspond to 6 canonical doses: 20 nM, 100 nM, 500 nM, 1 uM, 2.5 uM, and 10 uM. (In some cases non-canonical doses were tested; these are rounded to the nearest canonical dose for the purpose of this display. For example, if the dose tested was 3.33uM, the 2.5uM bar is shown in dark gray here.)


Guide to LINCS data release into NCBI GEO - L1000
TAGS:  


LINCS-funded CMap L1000 data has been generated by the Connectivity Map group at the Broad Institute. All the LINCS data is deposited into GEO and available for download. An overview of the data levels and signature generation process is described in a separate article of Connectopedia. The code used for data processing is available at https://github.com/cmap/cmapM. While we provide the data from each level of processing, unless you are interested in pre-processing details, we suggest that you use Level 5 data for your analyses.

A note about the difference between GEO terms: Series, Super Series, and subset Series:

A Series record links together a group of related samples and provides a focal point and description of the entire study. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Due to GEO technical limitations, for the CMap L1000 submission to GEO the samples have been broken up into multiple series (subset Series), which are linked together via a master or Super Series (GSE70138). The Super Series GSE70138 contains supplementary data files, one for each level of data; these contain the data and metadata for every sample in the series. The files are in GCTx format. Most importantly, if you simply want the matrices, ignore the sub-series and sub-projects.

General File and Data Conventions

The files available for download are either of the GCTx or TXT format and contain data matrices and metadata annotation.

Type Format Notes
Matrix of numbers GCTx Binary format based on HDF5 that enables faster i/o than text. Code is available through the CMap Code page at clue.io/code.
Experimental Metadata TXT Information on perturbagens and cell lines that were profiled
Metrics TXT Statistics computed on signatures that reflect their characteristics, including reproducibility of profiles and the magnitude of gene expression changes

The following table shows the filenames for the metadata and metrics data. The metadata is accurate to the best of our knowledge at the time of deposition. However, given the large size of the data and the many people, organizations, and processes involved, there will inevitably be errors or holes. While we will do our best to fix any errors, accessing via the CLUE API (clue.io/api) is likely a more convenient and up-to-date mode for subscribers. If you notice an error, please email us at clue@broadinstitute.org.

Note that any missing metadata value is represented by -666, which indicates that the information is not available or not applicable.

File Description
GSE_Broad_LINCS_cell_info.txt.gz Metadata describing cell lines used in perturbagen treatments
GSE_Broad_LINCS_gene_info.txt.gz Metadata describing measured and inferred genes
GSE_Broad_LINCS_inst_info.txt.gz Metadata pertaining to individual profiles (or instances, experiments)
GSE_Broad_LINCS_pert_info.txt.gz Metadata describing each perturbagen used in experiments
GSE_Broad_LINCS_sig_info.txt.gz Metadata for each signature in the Level 5 matrix
GSE_Broad_LINCS_pert_metrics.txt.gz Calculated / derived / inferred metrics and annotations associated with each perturbagen that was used in the experiments.
GSE*_Broad_LINCS_sig_metrics.txt.gz Calculated / derived / inferred metrics and annotations associated with each signature in the Level 5 matrix

Datasets

Each dataset is described below, accompanied by a table or tables with information about the files.

GSE92742 (aka LINCS Phase I L1000 dataset)

LINCS Phase 1 data is in GEO series GSE92742. As this represents an earlier phase of LINCS, it will not be updated except for bug fixes, if any. Please note that an earlier version of this release incorrectly contained ~5k profiles that should not have been included, and they have been removed.

Filename description File type
GSE92742_Broad_LINCS_Level1_LXB_n1403502.tar.gz Level 1 data (raw fluorescence measurements) Gzipped tar
GSE92742_Broad_LINCS_Level2_GEX_delta_n49216x978.gctx.gz Level 2 data for delta probes/features (raw gene expression / GEX) Gzipped gctx
GSE92742_Broad_LINCS_Level2_GEX_epsilon_n1269922x978.gctx.gz Level 2 data for epsilon probes/features (raw gene expression / GEX) Gzipped gctx
GSE92742_Broad_LINCS_Level3_Q2NORM_n1319138x12328.gctx.gz Level 3 data (normalized & inferred / INF) Gzipped gctx
GSE92742_Broad_LINCS_Level4_ZSPC_n1319138x12328.gctx.gz Level 4 data, plate-control normalized (robust z-scores / ZSPC) Gzipped gctx
GSE92742_Broad_LINCS_Level5_MODZS_n473647x12328.gctx.gz Level 5 data (moderated z-scores / MODZS) Gzipped gctx
Experimental metadata
GSE92742_Broad_LINCS_cell_info.txt.gz Metadata for each cell line that was used in the experiments Gzipped tab-delimited text
GSE92742_Broad_LINCS_gene_info.txt.gz Metadata for each measured feature / gene (metadata for rows of the data matrices) Gzipped tab-delimited text
GSE92742_Broad_LINCS_gene_info_including_delta.txt.gz Metadata for each measured feature / gene (metadata for rows of the data matrices) including the delta probes that were used in a small number of initial experiments Gzipped tab-delimited text
GSE92742_Broad_LINCS_inst_info.txt.gz Metadata for each experiment in the Levels 3-4 matrices (metadata for the columns in the Levels 3-4 data matrices) Gzipped tab-delimited text
GSE92742_Broad_LINCS_pert_info.txt.gz Metadata for each perturbagen that was used in the experiments Gzipped tab-delimited text
GSE92742_Broad_LINCS_pert_metrics.txt.gz Calculated / derived / inferred metrics and annotations associated with each perturbagen that was used in the experiments. Gzipped tab-delimited text
GSE92742_Broad_LINCS_sig_info.txt.gz Metadata for each signature in the Level 5 matrix (metadata for the columns in the Level 5 data matrix) Gzipped tab-delimited text
GSE92742_Broad_LINCS_sig_metrics.txt.gz Calculated / derived / inferred metrics and annotations associated with each signature in the Level 5 matrix Gzipped tab-delimited text
Supplemental data
GSE92742_Broad_LINCS_auxiliary_datasets.tar.gz Auxiliary datasets used for supplementary calculations. Please see the table below for a description of the individual files included in the archive. Gzipped tar of set of gctx
Inventory / hash codes
GSE92742_SHA512SUMS.txt.gz Text file containing checksums calculated for each of the above files, for use in verifying integrity of downloaded files Gzipped text

Overview of the GSE92742 dataset. This flow chart is a graphical explanation of the numbers of entries in each file at the different levels of data processing. An overview of the data levels and signature generation process is described in a separate article of Connectopedia.

For GSE92742, there is more data at level 1 than the other levels because level 1 data includes samples that failed to pass QC. The level 2 data (GEX, which represents direct measurements) has been split into 2 files ("delta" and "epsilon"), based on which set of genes was measured directly; the total number of columns for these level 2 files is 49,216 + 1,278,882 = 1,328,098. That total (1,328,098) matches the number of columns for the level 3 (INF) and level 4 (ZSPC) data. As described below for GSE70138, the level 5 data (MODZS) is calculated by aggregating across the replicates within the level 4 data, and thus there are fewer columns - about ⅓ as many columns in the level 5 data as in the level 4 data.

Supplemental Datasets Contents of GSE92742_Broad_LINCS_auxiliary_datasets.tar.gz
Name Brief description / use case
DS_GEO_n12031x22268.gctx Collection of publicly available gene expression profiles on Affymetrix HGU133A arrays that was used to select landmark genes and train the inference model. rows: 22,268 features columns: 12,301 samples
DS_GEO_OLS_WEIGHTS_n979x21290.gctx The matrix of weights learned by training the L1000 inference algorithm, ordinary least squares (OLS) linear regression, on DSGEO. rows: 21,290 inferred features columns: 978 landmark genes + intercept = 979
DS_CMAP_AFFXB01_n566x22268.gctx Pilot connectivity Map dataset of 455 gene expression profiles of human cell lines treated with 164 small molecule compounds that were used in simulations to determine the optimal number of landmarks. rows: 22,268 features columns: 455 samples
DS_REPRODUCIBILITY_n216x978.gctx Samples of purified total RNA from six human cancer cell lines, purchased from Life Technologies, were subjected to L1000 profiling. L1000 expression profiles were generated consisting of 12 technical replicates for each of the six cell lines, all done in three consecutive, independent LMA batches, yielding 36 replicate profiles per cell line and a total of 216 total profiles (6 cell line x 12 replicates x 3 batches). These profiles were used to assess the technical reproducibility of the L1000 assay. rows: 978 landmark genes columns: 216 samples
DS_KDLM_n955x978.gctx We evaluated if probes designed against individual target landmark genes worked in the multiplexed gene assay format. To assess the specificity of L1000 landmark probe measurements, we procured shRNAs from The RNAi Consortium (TRC) that targeted landmark genes, treated MCF7 and PC3 cell lines with these shRNAs. The resulting dataset contains, as columns, an individual shRNA targeting a landmark gene performed in either MCF7 or the PC3 cancer cell line. Rows are z-scores of all measured landmark genes. For each gene in each sample, we computed differential expression values (z-scores) by comparing the gene's expression value in the given sample to that same gene's expression values in all other samples. rows: 978 landmark genes columns: 955 samples
DS_GTEX_RNASEQ_n8555x12320.gctx Compendium of 8,555 RNA-seq samples obtained from the GTEx consortium (version 6). A subset of 3,176 of these samples were also profiled on L1000 and were used to compare the two platforms. rows: 12,320 genes columns: 8,555 samples
DS_GTEX_L1000_n3176x12320.gctx From the GTEx version 6 collection, a subset of 3,176 samples were generously donated by the GTEx consortium for profiling in the L1000 assay. rows: 970 landmark + 11,350 inferred = 12,320 genes columns: 3,176 samples
DS_GTEX_RNASEQ_LMONLY_n8555x970.gctx For convenience of use during assay validation and inference testing, we separated out from the overall GTEx RNA-seq dataset (DSGTEx-RNA-seq), the subset of landmark genes and refer to it as DSGTEx-RNA-seq-lmonly in the methods. Note that the GTEx pre-processing provided mappings for 970 of the landmark genes. rows: 970 landmark genes columns: 8,555 samples
DS_GTEX_RNASEQ_INF_n8555x12320.gctx Inferred version of DSGTEx-RNA-seq dataset generated by applying DSGEO-OLS to DSGTEx-RNA-seq-lmonly. rows: 970 landmark + 11,350 inferred = 12,320 genes columns: 8,555 samples

GSE70138 (aka LINCS Phase II L1000 dataset)

LINCS Phase 2 data is in GEO series GSE70138. This series will be updated every 6 months as more L1000 data is produced and QC'ed over the duration of the LINCS program (Starting in 2016 through 2020). The same data will also become available through the LINCS DCIC portal.

Filename description File type
GSE70138_Broad_LINCS_Level1_LXB_n345976.tar.gz Level 1 data (raw fluorescence intensity measurements / LXB) Gzipped tar of directory containing binary lxb files
GSE70138_Broad_LINCS_Level2_GEX_n345976x978.gctx.gz Level 2 data (raw gene expression / GEX) GCTX
GSE70138_Broad_LINCS_Level3_INF_mlr12k_n78980x22268_2015-06-30.gct.gz Level 3 data (normalized & inferred / INF) GCTX
GSE70138_Broad_LINCS_Level4_ZSPCINF_mlr12k_n345976x12328.gctx.gz Latest Level 4 data (robust z-scores / ZSPC) GCTX
GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328.gctx.gz Latest Level 5 data (signatures from aggregating replicates) GCTX
Experimental metadata
GSE70138_Broad_LINCS_inst_info.txt.gz Metadata for individual experiments (levels 1-4) Gzipped tab-delimited text
GSE70138_Broad_LINCS_sig_info.txt.gz Metadata for signatures of aggregated replicates (level 5) Gzipped tab-delimited text
GSE70138_Broad_LINCS_gene_info.txt.gz Metadata for rows / genes of matrices Gzipped tab-delimited text
GSE70138_Broad_LINCS_gene_info_delaprime.txt.gz Metadata for rows / genes of matrices, applies to all profiles generated using the delta prime probe pool. See FAQ for details Gzipped tab-delimited text
GSE70138_SHA512SUMS Text file containing checksums calculated for each of the most recent above files, for use in verifying integrity of downloaded files Gzipped text

For GSE70138, all data levels have the same number of columns (345,976) except for level 5 (MODZS) which has 118050 columns. The difference arises because the level 5 data is calculated by aggregating across individual replicates to generate a single signature for each group of replicates (generally 3 per experiment).

Note: We release data before full analysis and before publication, with the expectation that the data will be useful to others even in its early form. Therefore, some best practices to keep in mind for using released data:

  • check back here or contact us before you publish to see if any details have been updated

  • if you notice something odd / inconsistent / unexpected in the data, please email us with details

Over the course of the LINCS initiative, L1000 data generated as part of LINCS will be released to GEO twice a year. L1000 data generated from other sources will be released into the public domain as funding sources permit.

Data releases are appended to the GEO series object, allowing you to view earlier releases in addition to the current one. Each release is a full package (i.e earlier releases are not overwritten), enabling researchers using the data for the first time to get the latest release, while at the same time providing a persistently addressable source for earlier datasets. Thus, please download the data labelled with the most recent date. The figure below shows the most recent files, circled in green, for the GSE70138 dataset.

GSE106127 (aka RNAi And CRISPR datasets)

RNAi and CRISPR datasets are in GEO series GSE106127. Note that this data is not new; it is subsetted from the data in GSE92742 and GSE70138 and is the portion of that data corresponding to genetic perturbational signatures from shRNA and CRISPR reagents.

Filename description File type
GSE106127_level_4_zspc_n341336x978.gctx.gz Level 4 data, plate-control normalized (robust z-scores / ZSPC) Gzipped gctx
GSE106127_level_5_modz_n119013x978.gctx.gz Level 5 data (moderated z-scores / MODZS) Gzipped gctx
GSE106127_CGS_n33839x978.gctx.gz Consensus gene signatures generated by combining level 5 signatures of individual shRNAs Gzipped gctx
GSE106127_level_4_PRIME_zspc_n341336x978.gctx.gz Level 4 data with the global first principal component (PC1) removed Gzipped gctx
GSE106127_level_5_PRIME_modz_n119013x978.gctx.gz Level 5 data with the global first principal component (PC1) removed Gzipped gctx
GSE106127_CGS_PRIME_n33839x978.gctx.gz CGS data with the global first principal component (PC1) removed Gzipped gctx
GSE106127_pc_coeff_global_n978x978.gctx.gz The principal component loadings derived by running PCA on the entire CMap level 5 data matrix (~470k signatures) Gzipped gctx
GSE106127_inst_info.txt.gz Metadata for each experiment in the Level 4 matrices (metadata for the columns in the Level 4 data matrices) Gzipped tab-delimited text
GSE106127_sig_info.txt.gz Metadata for each signature in the Level 5 matrices (metadata for the columns in the Level 5 data matrices) Gzipped tab-delimited text
GSE106127_sig_metrics.txt.gz Calculated / derived / inferred metrics and annotations associated with each signature in the Level 5 matrices Gzipped tab-delimited text
GSE106127_CGS_meta.txt.gz Metadata for each CGS signature in the CGS matrix (metadata for the columns in the CGS data matrix) Gzipped tab-delimited text
GSE106127_CGS_PRIME_meta.txt.gz Metadata for each CGS signature in the CGS PRIME matrix (metadata for the columns in the CGS PRIME data matrix) Gzipped tab-delimited text
GSE106127_gene_info.txt.gz Metadata for each measured feature / gene (metadata for rows of the data matrices) Gzipped tab-delimited text
GSE106127_SHA512SUMS.txt.gz Text file containing checksums calculated for each of the above files, for use in verifying integrity of downloaded files Gzipped tab-delimited text

GSE92743 (aka CMap-HBS Contest)

Additionally, CMap / LINCS organizes datasets into discrete bundles to address particular questions in the form of contests, which are archived at GSE92743. The goal of this is to both engage the wider computational community in the improvement of analytics as well as to provide to LINCS users datasets along with benchmarks that are well organized for easier use.

Filename description File type
GSE92743_Broad_Affymetrix_training_Level3_Q2NORM_n100000x12320.gctx.gz Affymetrix data for 100,000 samples used by contestants for building their models Gzipped gctx
GSE92743_Broad_GTEx_L1000_Level3_Q2NORM_n3176x12320.gctx.gz All of the level 3 data (normalized & inferred / INF) of L1000 measurements on GTEx samples Gzipped gctx
GSE92743_Broad_GTEx_L1000_Holdout_Level3_Q2NORM_n1000x12320.gctx.gz Just the holdout level 3 data (normalized & inferred / INF) of L1000 measurements on GTEx samples Gzipped gctx
GSE92743_Broad_GTEx_L1000_NotUsed_Level3_Q2NORM_n1526x12320.gctx.gz Just the unused level 3 data (normalized & inferred / INF) of L1000 measurements on GTEx samples Gzipped gctx
GSE92743_Broad_GTEx_L1000_Test_Level3_Q2NORM_n650x12320.gctx.gz Just the test level 3 data (normalized & inferred / INF) of L1000 measurements on GTEx samples Gzipped gctx
GSE92743_Broad_GTEx_RNAseq_Log2RPKM_q2norm_n3176x12320.gctx.gz Level 3 data (normalized) of RNA-seq measurements on GTEx samples Gzipped gctx
GSE92743_Broad_GTEx_gene_info.txt Metadata for each measured feature / gene (metadata for rows of the data matrices) tab-delimited text
GSE92743_Broad_GTEx_inst_info.txt Metadata for each experiment in the Levels 3-4 matrices (metadata for the columns in the Levels 3-4 data matrices) Gzipped tab-delimited text
GSE92743_Broad_OLS_WEIGHTS_n979x11350.gctx.gz Matrix of weights used in current CMap L1000 inference model Gzipped gctx
GSE92743_SHA512SUMS Text file containing checksums calculated for each of the above files, for use in verifying integrity of downloaded files text

Explanations of the metadata and metrics data column headers can be found in the Glossary. For additional help email us at clue@broadinstitute.org or attend Office Hours.

Last modified: Tue Jan 30 2018 14:54:19 GMT-0500 (EST)

Was this article helpful? (Feedback is anonymous)