L1000 Query Tutorial
This is a tutorial on how to perform L1000 queries from the CMap Query app using some example files (downloadable from the "Example gene lists" section below).
CLUE allows for users to query CMap in two ways: through individual queries, or in batches of up to 25 queries run at once. For an individual query, the input is a list of upregulated genes, and an optional list of downregulated genes. These can be entered as plain text or imported from a file, such as a txt or gcp file. For a batch query, the input is a collection of gene lists, each separate list representing a query (i.e. upregulated genes), and an optional collection of "down" lists. CMap usually stores these collections of gene lists as GMT files, where each row in the file represents a different list of upregulated/downregulated genes.
Example gene lists
For this tutorial, we will be using various example sets. Individual query has these examples loaded into the app itself, but batch query requires a file download.
For batch query, we will be using a set of four signatures. The up and downregulated gene lists are stored in these two GMT files: example_uptag_CRCGN009.gmt (representing upregulated genes) and example_dntag_CRCGN009.gmt (for downregulated genes). If you're curious, these signatures are from the CRCGN dataset.
Download both of these files by clicking on their links. Each row in the file is a distinct query, where the first column is the name for that query, the second is the description, and the remaining columns are Entrez gene IDs. For more information on the necessary format for these lists, see the "Other Tips" section in the bottom of this tutorial.
In the following steps, we will use these two files together to submit a batch of 4 queries with an "up" (uptag) and "down" (dntag) collection.
Optional: Creating gene set collections using Listmaker
The Query app uses files or the Listmaker app to load gene lists for queries. If you'd prefer to upload a file directly to Query without saving your gene lists in Listmaker, you can skip ahead to the next section.
In order to create a collection, go to Listmaker. From this page, you can upload a GMT file to create your collection of "up" gene lists and collection of "down" gene lists. For this tutorial, we will upload multiple lists at time using one GMT file, but you can also add lists individually. We will upload the GMT file with our up lists (uptag) and create a new collection. To do so, click the "+ Add" button, and drag and drop the GMT file in the box. Specify the type of your collection as "Gene" (this is so we can use it in the Query app), and create a new collection by typing in "CRCGN up" into the Collection Name field and pressing Enter.** **You may leave the tags field empty. Click "Create Lists", and click "Finish" to refresh the page and view your new collection.
Screenshots showing the steps from the section above:
We then repeat this process and create a separate collection "CRCGN_dn" with our down sets (dntag). At the end of the process, we should have two new collections with four lists each.
We can also use Listmaker to upload a list for our individual query example using the same process. However, this will upload a single list, not a collection. You must specify the collection you would like the list to belong to.
Submitting an individual query
You can launch individual L1000 queries from the Query app by selecting the dropdown on the top of the page and switching to "Individual query." Make sure that the dropdowns for Gene expression (L1000) and your chosen dataset are already selected in order to proceed to this option.
You must also name the query in order to submit. Once this is done, you may start to load in your genes to the Up and Down boxes below the dropdowns. In individual query, you may load genes via drag and drop of a plain text file, loading a list from Listmaker, or typing in genes one by one in the boxes. You may also choose to load one of the examples mentioned in the instruction paragraph on the page, which will auto-populate the name and gene boxes.
Only genes marked with the valid symbol will be used in the query, and you may choose to remove genes with the invalid or unused symbol. Once your gene lists are ready in the boxes, you may click the Submit button to submit your query.
Submitting a batch query
Submitting a batch query is very similar to submitting an individual query, only you will be submitting a collection of lists of genes instead of a single gene list. In the dropdown, select "Batch query" instead of "Individual query" to get started. In this mode, there is no option to type in genes, you must either upload files or select a Listmaker collection. If you are using file upload, you can drag and drop the files into the box. If using Listmaker, you can do so by clicking the Load Collection button under "UP-regulated genes" and "DOWN-regulated genes." If you don't see your collection as available, it is possible you didn't set the type of the lists as "Genes" when creating the collection. You can also check the "Compute with sig_fastgutc_tool" option, which will reduce the runtime of your query substantially. After you have loaded both UP and DOWN, click the submit button to submit your queries.
Your results in History
After you've submitted your query, you can check its progress from the History page. Once the status is marked as "complete", you can check the box next to it and click the "Heat Map" button to view its connectivity results. If your query's status is "error" and you are using your own gene sets, check out the next section for debugging tips.
If you see the download icon next to your query, you may click the icon to download your results. The downloaded files comprise outputs of the CMap query tool at various levels of granularity. They include the query inputs, raw connectivity scores, normalized scores and the background adjusted tau values (For computational details see: Subramanian, A. et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171, 1437–1452.e17 (2017)).
The following provides a brief description of the files provided:
uptag.gmt - Up component of the query (user input)
dntag.gmt - Down component of the query (user input)
cs_.gct - Text matrix of the weighted connectivity scores with row and column annotations. Scores match matrices/query/cs_n.gct
gutc_config.yaml - configuration file
/matrices/query/ Connectivity scores** **
up.gmt - Up component of the query (used by the query algorithm, uptag.gmt filtered to the requested feature space e.g. bing)
dn.gmt - Down component of the query (used by the query algorithm)
cs_n*.gctx - Combined weighted enrichment score (two-tailed weighted enrichment statistic of the query genesets applied to each signature in the database). Range [-1, +1]
cs_up_n*.gctx - Enrichment scores for the UP component of the query
cs_dn_n*.gctx - Enrichment scores for the Down component of the query
leadf_up_n*.gctx - Fraction of the query set in the leading edge of the enrichment (Up)
leadf_dn_n*.gctx - Fraction of the query set in the leading edge of the enrichment (Down)
/matrices/gutc/ Normalized and Summarized scores
cs_sig.gctx - Connectivity scores in cs_n*.gct filtered to Touchstone signatures
Normalized Connectivity scores
ns_sig.gctx - Signature level scores
ns_pert_cell.gctx - Per-perturbagen scores for each cell line
ns_pcl_cell.gctx - Per-PCL (Perturbation class) scores for each cell line
ns_pert_summary.gctx - Per-perturbagen scores summarized across cell lines via the Summly algorithm
ns_pcl_summary.gctx - Per-PCL scores summarized across cell lines
Background adjusted tau scores ranging [-100 to +100]. Correspond to the heatmap displayed in the connectivity viewer on clue.io
ps_pert_cell.gctx - Tau values for ns_pert_cell
ps_pcl_cell.gctx - Tau values for ns_pcl_cell
ps_pert_summary.gctx - Tau values for ns_pert_summary
ps_pcl_summary.gctx - Tau values for ns_pcl_summary
query_info.txt - Query metadata
Regarding extracting cell-line specific connectivity results:
I would recommend looking at the Tau scores which are available in the ps_pert_cell and ps_pcl_cell matrices and match the results displayed in clue.
The cs_n*.gct file in the top level has the signature-level connectivity scores matrix has annotation including the cell line for each signature
GMT format: For more information on the GMT format, please see the section on the GMT file format on the GSEA wiki.
Entrez Gene IDs: L1000 queries use Entrez Gene IDs as input. Before submitting a batch query, all genes must be converted to Entrez Gene IDs. For individual genes, we recommend using the NCBI gene database and looking at the "Gene ID" field. In order to convert many genes, there are several different online tools and packages including DAVID, MyGene.info, or the clue.io API gene service that you may find useful.
BING genes: L1000 queries are run against BING space signatures (i.e. including roughly 10,000 genes), so when you submit your batch query, only the BING space genes are used. Genes that are not in BING space will not affect the result. In order to determine which genes are in BING space and are used in the query, you may use the /gene-space command in Command or the clue.io API gene service and look for genes marked "landmark" or "best inferred".
Gene list names in collections: In order to match up and down collection contents, the names of your up and down gene lists must match. You can either match with identical names (e.g. "IMATINIB_LOW_DOSE" in the up collection matches to "IMATINIB_LOW_DOSE" in the down collection) or by including _UP and _DN suffixes (e.g. "IMATINIB_LOW_DOSE_UP" in the up collection matches to "IMATINIB_LOW_DOSE_DN" in the down collection).