The information in this article is from the supplemental information in our published manuscript.
The fundamental unit of CMap analysis is the query. A query (q) consists of a set of genes corresponding to any biological state of interest. Each gene in the query carries a sign indicating whether it is up-regulated or down-regulated. Thus each query yields a pair of mutually exclusive gene lists (qup, qdown). The query is compared to each signature in the CMap reference database (Touchstone) using the similarity metric described below to assess connectivity viz. the degree to which the up-regulated query genes (qup) appear toward the top of the rank-ordered signature and the down-regulated query genes (qdown) appear toward the bottom of the signature (positive connectivity) or vice-versa (negative connectivity). The result of a query is a rank ordered list of CMap signatures ordered by their connectivity scores.
Computing similarities - Weighted Connectivity Score (WTCS)
The weighted connectivity score (WTCS) represents a non-parametric, similarity measure based on the weighted Kolmogorov-Smirnov enrichment statistic (ES) described previously (Subramanian et al., 2005). WTCS is a composite, bi-directional version of ES. For a given query gene set pair (qup, qdown) and a reference signature r, WTCS is computed as follows:
Where ESup is the enrichment of qup in r and ESdown is the enrichment of qdown in r. WTCS ranges between -1 and 1. It will be positive for signatures that are positively related and negative for those that are inversely related, and near zero for signatures that are unrelated. A null (0) score is assigned for cases when both ESup and ESdown are the same sign.
Normalization of Connectivity Scores
To allow for comparison of connectivity scores across cell types and perturbation types, the scores are normalized to account for global differences in connectivity that might occur across these covariates. Given a vector of WTCS values w resulting from a query, we normalize the values within each cell line and perturbagen type to obtain normalized connectivity scores (NCS) as follows:
where NCSc,t , wc,t , U+c,t and U-c,t are the normalized connectivity scores, raw weighted connectivity scores, and signed means of the raw weighted connectivity scores (the mean of positive and negative values evaluated separately) within the subset of Touchstone signatures corresponding to cell line c and perturbagen type t, respectively.
Overall, this procedure is similar to that used in Gene Set Enrichment Analysis, with the addition of bidirectional gene sets (i.e up and down) as queries.
Connectivity Map Score
Tau (τ) compares an observed enrichment score to all others in a reference database. In principle, τ can be computed by comparison to scores from any database of reference signatures, and the most common approach is to generate a null distribution by random permutation. However, a more stringent test that avoids having to make assumptions regarding the complex correlation structure of gene expression data is to use a compendium of diverse, biologically relevant perturbational signatures, such as those in CMap-L1000v1, as it is these reference signatures against which any novel connection must compete. Thus, query results are scored with τ as a standardized measure ranging from -100 to 100; a τ of 90 indicates that only 10% of reference perturbations showed stronger connectivity to the query. Because the reference is fixed, τ can be used to compare results across queries - a connection with a significant p-value and FDR but low τ would suggest a highly promiscuous relationship whose connections are not unique.
While meaningful comparisons can be made between the NCS values of reference signatures with respect to query q, it is also useful to assess if the connectivity between q and a particular signature r is significantly different from that observed between r and other queries. This is done by comparing each observed NCS value ncsq,r between the query q and a reference signature r to a distribution of NCS values representing the similarities between a reference compendium of queries (Qref) and r. This procedure results in a standardized measure we refer to as Tau (τ) that ranges from -100 to +100 and represents the percentage of queries in Qref with a lower |NCS| than |ncsq,r|, adjusted to retain the sign of ncsq,r :
where ncsq,r is the normalized connectivity score for signature r w.r.t query q, ncsi,r is the normalized connectivity score for signature r relative to the i-th query in Qref and N is the number of queries in Qref Our standard practice is that Qref be comprised of queries obtained from exemplar signatures of Touchstone perturbagens that match the cell line and perturbation type of signature r. In principle any arbitrary compendium of gene sets (as long as they are large enough) could be used.
Summarization Across Cell Lines
When examining query results, it is often convenient to obtain a perturbagen-centric measure of connectivity that summarizes the results observed in individual cell types. This can be particularly helpful when searching for connections that persist across cell lines or when one is unsure which cell line to examine. Given a vector of normalized connectivity scores for perturbagen p, relative to query q, across all cell lines in which p was profiled, a cell-summarized connectivity score is obtained using a maximum quantile statistic:
where ncsp,c is a vector of normalized connectivity scores for perturbagen p, relative to query q, across all cell lines in which p was profiled, and Qhi and Qlow are upper and lower quantiles respectively. This procedure compares the Qhi and Qlow quantiles of ncsp,c and retains whichever is of higher absolute magnitude. Thus, maximum quantile is more sensitive to signal in a subset of the cell lines than measures of central tendency such as mean or median. In the analyses presented here, we used Qhi = 67,Qlow = 33