The simplest form of a query is a list of genes. The query gene set should reflect the genes that are up- and/or down-regulated under a particular condition of interest, such as cells treated with a perturbagen or cells from a disease state. The list may come from your own work, in which case there are many tools, such as limma, DESeq2, and edgeR, to help you create a list of differentially-expressed genes. You can also query with a published gene set developed by another lab or obtained from a resource like MSigDB, or even use the Touchstone app to query CMap with our peturbagens.
Queries should be designed keeping in mind the directionality of expression for which you want to find related signatures. For example, let's say you want to query with a particular disease signature to retrieve potential therapeutic hypotheses. If your up- and down-regulated lists correspond to the disease state, then the most positive connections will be to perturbagens that elicit transcriptional effects similar to the disease state, and the most negative connections will be to perturbagens with the opposite effect (i.e., potential treatments). If you reverse the list so that it mimics the normal state, then the most positive connections will be to perturbagens that might reverse the disease effect.
We offer the following guidelines for constructing a gene set to use in the CLUE query app, based on our experience:
We recommend using sets of between 10 and 200 genes as query input.
Up and down gene sets can be of different size.
It is not necessary to have BOTH up and down gene sets. If you want to run a single-sided query, simply supply your gene set as either an 'up' or 'down' set.
Your up and down lists do not need to be ordered from most to least up or down regulated. As long as you have the up and down lists entered separately, the order of the genes within each list does not matter.
Note that we are developing methods to more robustly estimate the effect of query size on results; this is our rule-of-thumb guidance in the interim.
What if many of the genes in my query are 'valid but not used'?
For external queries (i.e defined in other datasets or provided by users) , the CLUE app uses as input genes that are in the best inferred gene (BING) space, which includes landmark and well-inferred genes. In the rare event that only a small number of input genes overlap with the BING space, as indicated by their being flagged as 'valid but not used', there are a couple things you can try to increase the overlap.
Try using Entrez IDs as input instead of gene symbols. Because gene symbols change frequently, it's possible that a symbol my not be recognized by the CLUE app. Entrez IDs are more stable and therefore are more reliable identifiers.
Redefine the gene set in the space of the BING genes. If you've generated the gene set yourself using a differential expression analysis method, you could try rerunning the analysis but first restricting the gene space to those that overlap with BING. The list of the BING genes can be obtained from the CLUE API.
What can you tell me about using gene spaces based on experience and current practices in the CMap group?
CLUE apps use the genes from the landmark or best-inferred sets in queries, as follows:
For internal queries (i.e with L1000 data), the default is to use the 50 most up and down-regulated landmark genes only as the query. Inferred genes are not used because they provide no additional information relative to the landmarks.
For external queries (i.e defined in other datasets or provided by users), the CLUE app uses as input genes that are in the best inferred (BING) gene set, which includes landmark and well-inferred genes.
For query analyses using other tools:
For an external dataset that includes genes from the entire transcriptome (e.g., data was generated on an Affymetrix chip or via RNA-Seq), use the BING space as it maximizes the gene overlap between query and perturbational signatures. Even if individual gene inferences aren't perfect, as many analyses focus on higher level results (e.g., summarization into pathways via GSEA), gene set based connections work well.
For queries using signatures generated from L1000, when a compound or gene has been profiled as part of a CMap-L1000 project, by default the apps on clue.io compute relationships based on the L1000 (978) genes only. When a collaborator has made L1000 data for their project (e.g., patient samples profiled on L1000), use L1000 (978) space for connections to perturbagens profiled in L1000, and use BING for analysis of pathways (because L1000 genes were not picked to emphasize preselected pathways, gene coverage of these pathways might be small, hence inference is likely to help).