CONNECTOPEDIA

What are the perturbagen types and controls in the CMap dataset?

TAGS:

Genetic and Compound Perturbagen Types

A perturbagen is a (chemical or genetic) reagent used in the laboratory to treat cells and measure the resulting biological response. In the case of CMap, changes in gene expression are measured following treatment with a perturbagen.

Chemical perturbagens are typically small molecule compounds (e.g.,, drugs, tool compounds etc). Genetic perturbagens include libraries of CRISPR/Cas9 constructs, short hairpin RNAs (shRNAs), and open reading frames (ORFs) used to edit, knockdown, or overexpress genes, respectively.

We capture the cell's response to perturbagens by measuring the expression changes for ~1000 genes (and inferring ~11k more for a total of ~12k genes). We refer to this vector of gene expression changes as the 'signature' of perturbing a particular gene or treating with a particular compound. We then compare these signatures, which contain many genes, to each other and to external gene sets to look for relationships between perturbagens.

Initially we generated gene knockdowns using shRNAs for silencing gene transcripts. We are now complementing that collection by creating loss-of-function mutations in an even larger assortment of genes using the CRISPR-Cas9 system. Note that these two methods may have different effects on the ability of the targeted gene to produce functional mRNA: shRNA infection may still give rise to some functional transcript, while CRISPR-Cas9 is designed to render a gene completely nonfunctional. Thus it's possible that connectivity results from an shRNA perturbation may not be identical to results from perturbation of the same gene using CRISPR. For gene overexpression, it's possible that a gain of function will result, such that increased production of the gene product increases activity of the protein or the pathway in which it functions, or reveals a new function for the protein. Thus we sometimes see opposite connectivities between LoF and overexpression perturbation of the same gene, as might be expected. However, we don't always see opposing connectivities, which may indicate that overexpression does not enhance transcription or function of the targeted gene; in fact overexpression may cause the cell to overcompensate such that its transcriptional state actually resembles the state of the knockdown. Alternatively, it's possible that the cell responds in a similar fashion to perturbation of a given gene in any direction. That is, regardless of whether the gene is over-expressed or knocked down, the cell arrives at the same transcriptional state.

These possibilities should be considered when exploring connections using genetic perturbations.

Controls The CMap data goal is to extract a signature that represents genes turned on or off upon treatment with the test perturbagen. To determine this signature it's necessary to establish a control population to which the test experiment is compared.

In the current CMap data processing workflow that computes differential expression for each perturbation, we use a population control, which represents all other perturbagens on the same physical plate. We have found that use of the population control results in robust differential expression signatures, in that the value for each gene indicates how much it was affected by a specific perturbagen relative to a diverse collection of other perturbagens on the same plate.

We also include control perturbations in our experimental designs for compound and genetic perturbations, and these controls are added to most plates. Vehicle control (designated ctl_vehicle/ctl_untrt) refers to the solvent used to administer compound treatments, which is usually DMSO. Vector control, or ctl_vector, is a negative control that refers to genetic perturbagens that either do not contain a gene-specific sequence, or whose gene-specific sequence targets a gene not expressed in the human genome (such as GFP or RFP). In all cases, negative controls are expected to be largely inert and therefore should not cause notable gene expression changes, and use of these controls helps us to monitor the technical fidelity of the experiment.

The table below describes the different types of perturbations found in the L1000 dataset, including controls and consensus signatures derived in silico by combining signatures of related reagents. (Note that we recommend using data from cgs consensus signatures over data from individual hairpins when possible.) The values in the 'pert_type' column are used in the data APIs and flat files to identify these types of perturbations.

Perturbagen Type	pert_type designation in metadata files
Compound	trt_cp
Peptides and other biological agents (e.g. cytokine)	trt_lig
shRNA for loss of function (LoF) of gene	trt_sh
Consensus signature from shRNAs targeting the same gene	trt_sh.cgs
cDNA for overexpression of wild-type gene	trt_oe
cDNA for overexpression of mutated gene	trt_oe.mut
CRISPR for LLoF	trt_xpr
Controls - vehicle for compound treatment (e.g DMSO)	ctl_vehicle
Controls - vector for genetic perturbation (e.g empty vector, GFP)	ctl_vector
Controls - consensus signature from shRNAs that share a common seed sequence	trt_sh.css
Controls - consensus signature of vehicles	ctl_vehicle.cns
Controls - consensus signature of vectors	ctl_vector.cns
Controls - consensus signature of many untreated wells	ctl_untrt.cns
Controls - Untreated cells	ctl_untrt

Last modified: Fri Dec 08 2017 10:23:22 GMT-0500 (EST)