Access a suite of analysis apps by clicking on the menu (or type command-K to open)

The first step in using the Query App to compute connections with your gene expression data is to assign a name to your query. Results will be stored in your Analysis History after your query is submitted.

Enter an up-regulated gene of interest, hit enter, and type in subsequent genes in the set you would like to query. You may also have down-regulated genes of interest. They can be entered in the box to the right.

Hit submit and the query algorithm will find connections between your genes of interest and perturbagens in CMap that have signatures most similar to your query. Data are generated in approximately 5 minutes and will be stored in your Analysis History.

The L1000 assay directly measures or infers the expression levels of 12,328 genes. By evaluating the current statistical model against a large compendium of RNA-Seq profiles from over 100 tissues from the GTEx consortium, we have identified a subset of 10,174 genes that are either measured or well inferred. This subset is known as the Best INferred Gene (BING) space. The Query App uses BING space to compute similarities between users' gene sets and the gene expression signatures in the CMap database. Each user entry is therefore mapped into one of the three following categories.
Invalid gene: Not a valid HUGO symbol or Entrez ID, and therefore not used in the query.
Valid gene: A valid HUGO symbol or Entrez ID that is also part of BING space, and therefore is used in the query.
Valid but not used in query: A valid HUGO symbol or Entrez ID that is not part of BING space, and therefore is not used in the query.

Click on a perturbagen in this table to see a CLUE Card that contains all of the information available for this perturbagen. You can also select any compound in the table to query connections with all other compounds in Touchstone. Click on Detailed List to view connections in a table, or click Heatmap to see connections in a matrix powered by the Morpheus App.

Filter the Touchstone data table by selecting perturbagen type or perturbational classes of interest.

Average transcriptional impact

Impact is assessed as a transcriptional activity score, which is calculated as a mean value of median replicate correlation and median signature strength of a perturbagen across multiple cell lines and doses. The score describes a perturbagen’s transcriptional activity, relative to all other perturbagens, as derived from its replicate reproducibility and magnitude of differential gene expression.

PCTCCi =  rank( median( CCi ) )N


PCTSSi =  rank( median( SSi ) )N


TASi =  PCTCCi + PCTSSi2


where:

TASi is the transcriptional impact score for the i-th perturbagen

PCTCCi is the percentile, relative to all other perturbagens, of the i-th perturbagen’s median replicate correlation coefficient (CC) across all of its signatures

PCTSSi is the percentile, relative to all other perturbagens, of the i-th perturbagen’s signature strength (SS) across all of its signatures

N is the total number of perturbagens

Signature diversity

Thick black bars signify Transcriptional Activity Scores greater than or equal to 0.5; thinner black bars denote scores less than 0.5. Absence of a bar means no data available. Colored lines (chords) signify similar connectivity scores between cell lines; red for positive connectivity scores of 80-100 (pale to intense color according to the score); blue for negative connectivity. Chords are only shown when TAS scores are > 0.5; thus absence of a chord either means that the perturbagen TAS score is very low, or that no data is available. Chords for individual cell lines can be isolated from the rest of the figure by hovering over the cell line name.

Baseline expression of this gene in each cell line is represented as a z-score (top numbers). Scores were calculated using robust z-score formula:

z-scorei = ( xi - median( X ) )/( MAD( X ) * 1.4826 ),

where:

xi is expression value of a given gene in i-th cell line

X = [ x1, x2 ... xn ] is a vector of expression values for a given gene across n cell lines

MAD( X ) is a median absolute deviation of X

1.4826 is a constant to rescale the score as if the standard deviation of X instead of MAD was used

Median and MAD expression values were calculated using RNA-Seq profiles from a total of 1022 cell lines, comprising data from the Cancer Cell Line Encyclopedia (CCLE; Barretina, et al.) and cell lines nominated by the CMap team. Plots show z-score values only for the core LINCS lines used by CMap in L1000 experiments. Light red or light blue regions indicate positive or negative outlier expression, respectively, of the gene relative to the other lines shown; z-score of a positive outlier in the corresponding cell line is in dark red and a negative outlier is in dark blue.

Summary class connectivity shows a boxplot that summarizes the connectivity of a class. Each data point, shown as a light gray dot, represents the median value of connectivity of one member to the other class members. (This corresponds to the median for each row, excluding the main diagonal, in the heatmap shown below.) The box is the distribution of those data points, where the box boundary represents the interquartile range, the vertical line within the box is the median, and the whiskers reflect the minimum and maximum values of the data (exclusive of extreme outliers, which may appear beyond the whiskers).

Connectivity between members of class is a standard heat map of the connectivity scores, summarized across cell lines, between members of the class, where dark red represents the highest positive scores and deep blue the highest negative scores. Individual scores are revealed to the left below the map by hovering over each cell of the map.

Class inter-cell line connectivity is a plot of the median (black line) and Q25-Q75 connectivity scores (blue area around black line) for each cell line as well as the summary scores across cell lines. In some cases perturbations have not been tested in every cell line; the absence of data is indicated by a “0” for that cell line. The example shown reveals that these estrogen agonists show the strongest connectivity to each other in MCF7, a human breast cancer cell line that expresses the estrogen receptor.

Profile status

Colored portion of top bar indicates the Broad assays in which this compound has been profiled.

L1000 cell/dose coverage

For compounds profiled by L1000, cell lines and dose range for which signatures are available are indicated by dark gray bars (lighter gray bar indicates no data is available for that cell line/dose combination). A bar displayed one row above the 10 uM row indicates that doses higher than 10uM were tested. The 6 rows correspond to 6 canonical doses: 20 nM, 100 nM, 500 nM, 1 uM, 2.5 uM, and 10 uM. (In some cases non-canonical doses were tested; these are rounded to the nearest canonical dose for the purpose of this display. For example, if the dose tested was 3.33uM, the 2.5uM bar is shown in dark gray here.)

Knowledge Base

Support

Contact Us

Email

If you need help, email clue@broadinstitute.org.
Please include the following information so that we can best help you:

  • Analysis you are trying to run
  • Tools and data you are using
  • Error message you are seeing (if applicable)

Office Hours

We are available every Thursday from 1:00 to 2:00 PM EST to help answer questions you might have about accessing our data, running analyses, or using our APIs, Apps, and Command Line Interfaces.

Attend in Person

Broad Institute
75 Ames St.
7th Floor Studio (7055)
Cambridge, MA 02142
Map

Or join by phone

+1 (415) 762-9988
+1 (646) 568-7788 (US Toll)
See global numbers
Meeting ID: 576 595 665

Attend via the webcast

Join us online via Zoom web conferencing.

Getting Started

For New Users

Creating and activating your account: To create an account, please visit clue.io/login, or click on the Log in button in the upper right hand corner of the screen. At the bottom of the popup, click on Create an account and fill in all of the information on the Sign-up page. At the moment, CLUE registration is free for emails affiliated with academic institutions and non-profit organizations. Once you’ve successfully created a CLUE account, you will receive an email that will contain an activation link.

Troubleshooting: If you are having trouble creating an account, make sure that you have filled in all of the information on the Sign-up page. Double check that the email you’ve used is affiliated with a non-profit organization. If your email is affiliated with a non-profit organization, and you are still unable to create an account, please email our team at clue@broadinstitute.org and we will contact you with further instructions.

Using CLUE to query the Connectivity Map

Introduction

CLUE environment allows users to interact with and analyze Connectivity Map (CMap) data via a collection of robust and easy to use web-based applications. These applications are intended to facilitate biological inquiry and hypothesis generation for both experienced and novice CMap users.

The Touchstone Dataset

CMap has generated expression profiles from thousands of perturbagens in a variety of cell contexts. While the entirety of CMap data and associated annotations are available for download from GEO and via the CLUE API, it is useful for analysis to focus on a set of well-annotated perturbagens assayed systematically in standardized conditions. We refer to these perturbagens as our Touchstone (TS) set, which consists of 8,870 perturbagens with annotated biological functions that have been profiled in the majority of the nine core CMap cell lines (see Tables 1 and 2 below). Users can access the Touchstone dataset via the Touchstone App, available in the CLUE menu or at clue.io/touchstone.

What follows are descriptions of our Apps to help you get started using the Connectivity Map.

Starting a Search from the Homepage

To find CMap data and metadata associated with a perturbagen of interest, enter the compound or gene name in the search box of the CLUE homepage (please note that searches in CLUE uses the standard HUGO human gene names). The search brings up a page with 2 tables showing the top ten perturbagens from the Touchstone reference set whose signatures show the highest positive (upper table) and negative (lower table) connectivities in CMap (Fig. 1).

Fig 1. Top connectivities for MDM2.

The columns of the tables include:

The Connectivity Score: a value between +100 and -100 representing the relative strength of a given signature in an instance from the total set of instances calculated upon execution of a query. A high positive connectivity score indicates that the corresponding perturbagen induced the expression of the query signature. A high negative connectivity score indicates that the corresponding perturbagen reversed the expression of the query signature. A zero or "null" connectivity score indicates that the corresponding perturbagen had no self-consistent effect upon expression of the query signature. Each score shown is a summary of the connectivity scores across the core cell lines tested in CMap. Instances are rank ordered by descending connectivity score.

The Perturbagen Type:

  • minus sign enclosed in a purple circle represents gene knockdown (shRNA)
  • plus sign enclosed in a blue circle represents gene overexpression
  • gold half circle represents a compound treatment

ID: a CMap ID assigned to the perturbagen

Name: gene name (HUGO standard gene name), or compound name

Description: gene description (HUGO Approved Name description) or compound mechanism of action (compiled from public sources by CMap curators)

Starting a Search from the Touchstone App

“Touchstone” refers to the set of perturbagens that 1) are annotated for their biological roles or functions, and 2) have been profiled in the majority of the nine core cell lines used by CMap (referred to as “core” cell lines).

Cell LineDescription
A375Human malignant melanoma cell line
A549Human non-small cell lung carcinoma cell line
HA1EHuman kidney epithelial immortalized cell line
HCC515Human non-small cell lung adenocarcinoma cell line
HEPG2Human hepatocellular carcinoma cell line
MCF7Human breast adenocarcinoma cell line
PC3Human prostate adenocarcinoma cell line
VCAPHuman metastatic prostate cancer cell line
HT29Human colorectal adenocarcinoma cell line
Table 1. Core cell lines in which Touchstone perturbagens have been tested.

Currently 8870 perturbagens comprise the Touchstone set:

Perturbagen TypeNumber of Profiles
Compound2911
Gene Knock-Down3799
Gene Over-Expression2160
Total8870
Table 2. Composition of the Touchstone set.

Begin a search in the Touchstone App by typing a perturbagen into the search box on the top of the page, and then checking the box next to that perturbagen’s name. Selecting the perturbagen opens it’s CLUE card on the right side of the window. Information about the CLUE card is found here.

Above the table showing the perturbagen names are three buttons:

The left button offers 3 options for rapid selection/de-selection of perturbagens. Below those are “shopping cart” options that save selections for later viewing and manipulation. Finally, selections can be exported to a file using the “Export” option.

Once you have searched for the perturbagen and selected your choice(s) from the search results, you can view the connections between that perturbagen and the rest of the our dataset by clicking the Detailed List or Heatmap button. These are explained in detail below.

Detailed List View

The Detailed List button opens a table of the entire list of CMap perturbagens ranked according to their connectivity to the gene or compound originally searched. The histograms at the top of the page represent connectivity scores for each of the perturbagens in each of 8 different cell lines; “Summary” refers to a summarized score across cell lines for each perturbagen. Clicking on any perturbagen in the table 1) highlights its position in each histogram, which reflects its connectivity score in that cell line, and 2) exposes the “CLUE Card” (right side of page) which contains graphical presentations of CMap data, and other metadata, for that particular perturbagen (Fig 2). Detailed explanations of the information in CLUE Cards can be found here.

Fig 2. Detailed List view. Connectivity of SR-95639A to MDM2 knockdown in the core cell lines and summarized across cell lines is shown at the top. CLUE Card of SR-95639A is shown on the right.

On the left side of the page are options for filtering the table to reveal categories or groups of data. Data can be filtered primarily by perturbagen type (compound, gene knockdown, gene overexpression) or CMap Class, which includes 1)compounds of similar curated mechanism of action that show strong connectivity to each other in CMap, and 2) genes that function in the same pathway and show strong connectivity to each other. The connectivity score between the perturbagen searched and the CMap Class is indicated next to each CMap Class name.

Heat Map View

Connectivity relationships between the perturbagen searched and the Touchstone reference set can also be viewed as a heat map in our Integrated Connectivity Viewer (ICV). The heat map is a matrix where the rows are Touchstone perturbagens and the columns are the perturbagen searched, for each of the cell lines in which it was assayed, as well as the summary data for that perturbagen across cell lines. The viewer is interactive, allowing the user, for example, to sort and subset data, or to view more information about the perturbagens or the data. These functionalities are accessed from the toolbar of icons at the top of the window and by opening the Quick Tools menu via the button on the far right top of the page. A tutorial is also accessible from the blue and red icon next to the “Show Quick Tools” button (Fig 3). Here we review basic sorting, searching and subsetting functions in the ICV tool. Additional ICV features are also present in the Morpheus tool and are described in that section below.

Fig 3. Integrated Connectivity Viewer. Shown is the top section of the viewer; functionalities are accessible from the tool icons (blue circle) and the Quick Tools (arrows).

Using the Integrated Connectivity Viewer (ICV)

1. Selecting rows/columns: click on the row or column header; to deselect, right click on that header to pull up the contextual menu, and click “Clear Selection”.

2. Sorting the data in each column or row is done by double clicking:

  • double click on the column or row header once to sort the data from highest to lowest value
  • double click twice on the column or row header to sort the data from the lowest to the highest value
  • double click three times on the column or row header to order the data from the highest to lowest values in groups of 10

Note: sorting can also be done using the Sort tool in the toolbar:

3. Searching rows or columns

  • To get a quick view of the data in a column or row, type the data to search for in the row/column search box. A blue tick mark will appear in the side or bottom scroll bar. Hover the cursor over the tick mark to view the heat map from that data point.
  • To move that data to the top or side of the matrix, click the Up Arrow button next to the search box of the data you searched for. If you searched in the rows, the data will move to the top of the matrix; if you searched a column, it will move to the far left side of the matrix.

4. Subsetting the matrix - Slices of the data can be isolated from the rest of the matrix to be viewed and manipulated separately. This can be done in one of two ways:

  • Position the rows or columns to be sliced next to each other in the matrix. Drag the cursor over the set of rows or columns, and type “command x”. The sliced data will appear in a new tab of the Morpheus window.
  • “command select” each column or row to be moved. Hit “command x” to slice them into a single view.

Note: the “command x” function is also accessible from the tools menu, as “New Heat Map”:

Submitting a Query Using the Query App

Use the Query app to submit a list of up- and down-regulated genes to determine which perturbagens in CMap generate the most similar (or opposite) gene lists to your query, indicating their degree of connectivity to your query signature.

To add context around how the Query app works, the L1000 assay (from which the Touchstone dataset profiles are generated) directly measures or infers the expression levels of 12,328 genes. By evaluating the current statistical model against a large compendium of RNA-Seq profiles from over 100 tissues from the GTEx consortium, the CMap team has identified a subset of 10,174 genes that are either measured or well-inferred. This subset is known as the Best INferred Gene (BING) space. The Query App uses BING space to compute similarities between users' gene sets and the gene expression signatures in the CMap database. Each user entry is therefore mapped into one of the three following categories:

Invalid gene: Not a valid HUGO symbol or Entrez ID, and therefore not used in the query.

Valid gene: A valid HUGO symbol or Entrez ID that is also part of BING space, and therefore is used in the query.

Valid but not used in query: A valid HUGO symbol or Entrez ID that is not part of BING space, and therefore is not used in the query.

Make sure the list of genes includes only those that are valid, and submit the query. The results will post to your analysis history after a few minutes. Your history can be accessed from the Utility section of the Menu display. Select your query results when the status says “completed”, and view them either as a detailed list or a heat map (you can also preview the heat map). The Detailed List appears in the Connections app, and the heat map opens in the ICV, both of which are described above.

Users can rapidly access examples by clicking on the “example” link and making a selection on the popup menu. Selecting “MEK inhibitor,” for example, will populate the up-regulated and down-regulated with a set of genes associated with that compound class.

It takes a few minutes for this query to run, and users can access results in their Analysis History, accessed either by the link next to the “Submit” button following the execution of a query, or in the main menu.

Working With Our Other Apps

Repurposing

The Repurposing App allows for easy exploration of a comprehensive library of clinical drugs. With over 4,000 unique compounds in the library, this resource is a critical first step in identifying new indications for existing drugs. The app consists of compiled information from public and proprietary data sources as well as vendor catalogs. Each compound has been assayed for purity, registered in the Broad compound management system, and annotated for structure, clinical development status, vendor information, mechanism of action, protein targets, and approved indications. Opening the app displays all currently annotated compounds in the library, which can be refined by either searching or filtering.

Compounds can be searched or filtered by:

  • Mechanism of action (MOA)
  • Target
  • Disease Area
  • Clinical Phase
  • Vendor
  • Purity

The compounds listed in the table will generate a target protein class tree at the top of the page as well as graphs displaying the purity and clinical phase distributions. Once a compound in the table is selected, the window on the right of the screen will display a card featuring all annotated information for that compound. Data can also be exported from the compound list table by clicking the export button.

Example:

Morpheus

The Morpheus tool is the matrix viewer that powers the ICV; the main difference between these tools is that users can load any type of data using Morpheus (not just gene expression data) as long as it has a matrix structure. The ICV, on the other hand, is pre-loaded with the gene expression data available in CLUE.

The following is a detailed description of how to use the features in both the ICV and Morpheus.

1. Selecting rows/columns: click on the row or column header; to deselect, right click on that header to pull up the contextual menu, and click “Clear Selection”.

2. Sorting the data in each column or row is done by:

  • double clicking on the column or row header once to sort the data from highest to lowest value
  • double clicking twice on the column or row header to sort the data from the lowest to the highest value
  • double clicking three times on the column or row header to order the data from the highest to lowest values in groups of 10

Note: sorting can also be done using the Sort tool in the toolbar:

3. Searching rows or columns

  • To get a quick view of the data in a column or row, type the data to search for in the row/column search box. A blue tick mark will appear in the side or bottom scroll bar. Hover the cursor over the tick mark to view the heat map from that data point.
  • To move that data to the top or side of the matrix, click the Up Arrow button next to the search box of the data you searched for. If you searched in the rows, the data will move to the top of the matrix; if you searched a column, it will move to the far left side of the matrix.

4. Subsetting the matrix - Slices of the data can be isolated from the rest of the matrix to be viewed and manipulated separately. This can be done in one of two ways:

  • Position the rows or columns to be sliced next to each other in the matrix. Drag the cursor over the set of rows or columns, and type “command x”. The sliced data will appear in a new tab of the Morpheus window.
  • “command select” each column or row to be moved. Hit “command x” to slice them into a single view.

Note: the “command x” function is also accessible from the tools menu, as “New Heat Map”:

5. Creating dynamically cut dendrograms

  • To create dendrogram go to Tools > Hierarchical Clustering, select clustering parameters, and click OK.
  • Once the dendrogram is generated, a dotted line will appear that can be clicked and dragged to adjust the cut of the clusters.

6. Manipulating Metadata

  • Metadata can be filtered to display a subset of the data by clicking the Filter icon and selecting various parameters.
  • Metadata annotations can also be added using an existing csv file. To import this data go to Open File and from the Open file action drop down menu select Annotate columns or Annotate rows.
  • Annotations can also be added from the metadata collection from the Options tool. Select the Annotations tab and use the pull-down to add row or column annotations.
  • To remove annotations from the view right-click on the annotation header and click on “delete”.
  • Calculation annotations can be added using the “Create Calculated Annotation” function in the Tools menu. For example: Enter "MAD" for the "Annotation name". Enter "MAD()" for "Formula" and click "OK". A new row annotation named "MAD" will appear to the right of the heat map.

7. Importing data to Morpheus

  • To import annotation data a tab delimited text file, Excel file, or a GMT file that indicates set membership can be used.
  • To import matrices to Morpheus gct, maf, tab delimited text, or Excel files can be used.

8. Exporting data from Morpheus

  • The dataset can be saved as a GCT file by clicking the Save Dataset icon.
  • An image can be saved as a png or svg file by clicking the Save Image icon.
  • Column and row names can also be copy and pasted from Morpheus.

9. Viewing Preferences

  • An additional matrix can be overlayed on top of the existing matrix by opening the file of a new matrix and selecting Overlay onto current dataset. Data from the new matrix can be viewed with additional parameters such as size or shape by selecting conditional rendering.
  • Color schemes can be customized to easily view multiple data types, using the “Color Scheme” tab in the “Options” menu.
  • The matrix can be viewed with various degrees of detail by utilizing the Zoom features as well as the Fit To Window button which will allow the entire matrix to be viewed within the designated window size.
  • Data can be filtered to show only selected rows or columns. From the filter tool, choose Rows or Columns. Click the “Add” button to display the Fields available for filtering. Choose the field from the pull down. Click “Remove” to turn off the filtering.

10. Integrating

  • By clicking the Morpheus icon in the upper right corner and clicking “Linking”, a page with Morpheus Configuration Options will pop up.

11. Tools

  • Clicking the Tools icon leads to a list of analytical tools that can be applied to the data.

Hierarchical Clustering recursively merges objects based on their pairwise distance. Objects closest together are merged first, objects furthest apart are merged last. The result is a tree structure, referred to as a dendrogram, where the leaf nodes represent the original items and internal (higher) nodes represent the merges that occurred. Click here for a more detailed description of the hierarchical clustering algorithm and here for a comparison of the Pearson and Spearman correlation methods.

Marker Selection allows to you specify the rows and columns to view.

Nearest Neighbors generates a new row annotation named “Pearson correlation”, and indicates the genes that are most highly correlated with the selected gene.

New Heat Map allows for subsetting the matrix as discussed above.

Adjust allows for mathematical conversions of the data.

Collapse lets you alter the view to contain only the specified data.

Similarity Matrix allows you to change the matrix according the metrics offered in the pull down.

Transpose reverses the rows and columns. Hit “Transpose” a second time to undo.

12. Creating Charts

  • Select Columns and Rows to display in the chart and hit the chart icon. Various chart types and parameters can be selected. The rows and columns selected can be changed with the chart window still open.

Datasets

Data Formats

GCT (Gene Cluster Text Format)

GCT is metadata annotated matrix based file format In addition to a matrix of numerical data a gct file contains row and column ids and annotations. In the case of L1000 gene expression data, the columns of the matrix correspond to profiles and the rows correspond to genes. The values of the cells correspond to some type of expression measurement (raw, normalized, differential, etc.) made on each gene in each profile. The column annotations link to the experimental conditions for each profile. For example, they might contain the compound name, dose, and treatment time for a small molecule treatment. Similarly the row annotations describe the genes and might contain the gene name, Affymetrix probe set ID, and a short description.

GCT files are available in three main formats, GCT v1.3, GCT v1.2 (text), and GCTX (binary).

An example GCT v1.3 file can be downloaded here.

GCT v1.3

A GCT file (version #1.3) is a tab-delimited file with the following required fields:

The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows:

#1.3

The second line is tab-delimited and contains numbers indicating the following:

  • n1: number of rows in data matrix
  • n2: number of columns in data matrix
  • n3: number of row metadata fields
  • n4: number of column metadata fields.

The top half of the rest of the file features the column metadata, row metadata, and dataframe of expression values concatenated as illustrated in the figure below; note that the area to the left of the column metadata is a tab-delimited matrix of “#” for proper spacing and alignment. To be a valid GCT object, there must be a dimensional match between the unique row ids used for row metadata annotations and rows of the data matrix, as well as a dimensional match between the unique column ids used for column metadata annotations. Furthermore, the headers for row and column metadata fields must be unique (to disambiguate access).

For example, for a matrix of 20 samples by 1,000 genes with 5 row annotations and 10 column annotations, the first two lines would look like this:

#1.3

1000 20 5 10

The first few rows and columns of an example GCT v1.3 file are seen below.

GCTX

To enable efficient, random-access to these data we use a binary format based on HDF5 technology called GCTX. While conceptually similar to GCT v1.3, GCTX is a way to structure content in an HDF5 file compatible with the data we generate. Importantly, this allows for the storage of every profile we generate in a single unified file that new signatures can simply be appended to as they are generated; each column in the GCTX matrix is a signature and each row is a gene in that signature. The current size of the file GCTX containing all LINCS profiles is 110GB. The GCTX file format can easily handle files of this size by allowing on-disk slicing of data using hyperslab selection. Hyperslabs are constructed and read into memory as subsets only when needed. Therefore, if we did not want to read in the entire file, we can read it in separate subsets at a time, dramatically improving the performance of analytics run on the dataset (i.e tools do not need 110GB of memory to read data). The image below shows a hyperslab schematic and the read time for various sizes of random column sets.

Below is an example of how a GCTX file is structured:

Tools for parsing and interacting with GCT and GCTX files:

1. Graphical User Interfaces

The HDF5 group provides a java based HDF5 file format viewer called HDF5View. HDF5View is compatible with all HDF5 formatted files including .gctx files. The Connectivity Map Group at the Broad Institute has developed a software package focused on gene expression data analysis called GENE-E that is capable of reading and writing the gctx file format.

2. Command Line Interfaces

HDF5 group provides a suite of command line tools for working with HDF5 files and can be used to explore GCTX files. These tools can be found on the HDF5 website.

3. Client Libraries

Connectivity Map Group at the Broad Institute has developed client libraries for reading and writing GCTX files in R, Python, Matlab, and Java. These tools can be found at the l1ktools github repository.

Below is a GCTX specific example of how the client libraries fit with the HDF5 software infrastructure.

GMT (Gene Matrix Transposed Format)

The GMX file format is a tab delimited file format that describes gene sets or other collections of elements. In the GMX format, each column represents a collection; in the GMT format, each row represents a collection.

An example GMT file can be downloaded here.

MAF (Mutation Annotation Format)

The MAF file format is a tab delimited text file that lists mutations. The format originates from The Cancer Genome Atlas (TCGA) project and pertains to human genomes. The full MAF specification can be found here.

GRP (Gene Set Format)

A GRP file contains a list of elements (typically gene symbols or feature identifiers) in a simple newline-delimited text format with one entry per line.

An example grp file can be downloaded here.

GMX (Gene Matrix Format)

The GMX file format is a tab delimited file format that describes gene sets or other collections of elements. In the GMX format, each column represents a collection; in the GMT format, each row represents a collection.

An example GMX file can be downloaded here.

JSON (JavaScript Object Notation)

APIs return data as standard JSON objects, where data is represented as tag-value pairs. The full JSON specification can be found at here.

Algorithms

Signature Generation and Data Levels

L1000 data proceeds through 5 levels in the signature generation pipeline. All levels other than level 1 are stored in GCT or GCTX format. Please see Datasets for a description of these formats.

LEVEL 1: Raw data from the Landmark Set - LXB file format

Level 1 data are the raw mean fluorescent intensity (MFI) values that come directly from the Luminex scanner. The MFI is proportional to the amount of amplicon bound to the bead, and hence also proportional to the expression levels of the genes that particular bead is interrogating. Each 384-well plate generates 384 LXB files, where each LXB file contains a fluorescent intentsity value for each observed bead in the well.

LEVEL 2: Deconvoluted gene expression (GEX)

MFI values are analyzed in a peak deconvolution step to associate the expression levels with the appropriate genes. This step is necessary because each bead color is associated with two genes rather than one. To facilitate the analysis, separate bead batches that identify each gene are mixed in a 2:1 ratio for use in the assay. To deconvolute the single resulting value into two values and associate them with the appropriate genes, we construct a histogram of mean fluorescent intensity versus bead count.  This yields a distribution that generally consists of two peaks, a larger one that designates expression of the gene for which double the amount of beads are present, and a smaller peak representing the other gene. Using the K-means clustering algorithm, the distribution is partitioned into two distinct components and the median expression value for each component is then assigned as the expression value of the appropriate gene.

LEVEL 3: Quantile normalized data (QNORM or Q2NORM)

Following deconvolution the L1000 data is taken through two normalization steps: Luminex Invariant Set Normalization (LISS) and Quantile Normalization (QNORM). To begin the LISS step, intensity values for the 10 invariant sets are log2-transformed to help stabilize the variance. We fit a power law curve to the median log2-intensities for the 10 invariant gene sets. This curve serves as a reference against which the experimental data are re-scaled, according to the equation:

y = axb + c

where x is the unscaled data and a, b, and c are constants.  LISS therefore serves as a method to convert between measured Luminex intensity and Affymetrix log2-expression values.

The QNORM step is then performed to standardize the shape of the expression profile distributions on each plate such that all of the data for a plate is on the same scale. First, the data is sorted by expression level. It is then normalized by setting the highest-ranking value in each profile to the mean of all the highest ranking values, the next highest value to the mean of the next highest values, and so on down to the data for the lowest expression level.  Following this quantile normalization step for a single plate, quantile-normalization is then performed on all plates within a cohort to standardize the data across plates (Q2NORM).  The following equation describes the normalized version of expression for each gene x:

xnorm = F2( F1( x ) )

where F1 and F2 are the first and second quantile-normalization steps, respectively.

To obtain expression values for all genes in the transcriptome, we assume that an unmeasured gene gk can be predicted from the measured "landmark" genes Li via linear regression:

gk = w0 + ∑978i = 1 wiLi

where the wi constitute the model weights as estimated from a diverse, ~12k-sample gene expression dataset from obtained from GEO. Repeating this procedure for all unmeasured genes gives predicted measurements of all genes in the transcriptome.

LEVEL 4: Differential expression data (ZSPC; Z-scored relative to population control)

To obtain a measure of relative gene expression, we use a Z-scoring procedure to generate signatures from normalized profiles. In this step, consider a particular gene x. We compute the differential expression of gene x in the ith sample on the plate as:

zi =  xi – median( X )MAD( X ) × 1.4826

where X is the vector of normalized gene expression of gene x across all samples on the plate. The "median" and "MAD" represent median and median absolute deviation, respectively, and the factor of 1.4826 is a scaling constant to rescale the data as if the the standard deviation were used instead of the median absolute deviation.

LEVEL 5: Replicate-collapsed signature data (MODZ)

L1000 experiments are typically done in 3 or more biological replicates. For each replicate of each experiment, we compute its average spearman correlation with all other replicates using level 4 (ZSPC) data. We then convert these average spearman correlations to weights by normalizing their sum to 1 across all replicates. We next use these weights to compute a weighted average signature where each replicate contributes proportionally to its average correlation with the other replicates. This moderated z-score (MODZ) procedure serves to mitigate the effects of uncorrelated or outlier replicates, and can be thought of as a "de-noised" representation of the given experiment's transcriptional consequences.