The goal of this contest was to develop algorithms capable of predicting a drug’s mechanism of action from its pattern of cell viability and gene expression across 100 cancer cell lines.
The training dataset consisted of readouts for over 5,000 different drugs from the  Drug Repurposing Library  that were profiled at different time points and doses. Submissions were scored for accuracy on a pre-specified metric (the logarithmic loss function, also known as cross entropy, averaged over all mechanisms).
More than 4,000 teams made code submissions for a total of 88,000 solutions evaluated with our metric. The top four competitors won cash prizes for a total of $30,000.
The top algorithms used various ensembles of a variety of different neural network models, such as Convolutional Neural Networks. The top algorithms were capable of predicting between 60 and 100 targets with higher accuracy than the benchmark (a random forest algorithm), thus outperforming our previous effort on the same data.
The challenge was hosted by Kaggle.  See the competition’s website  for more details.
This challenge has concluded.We are grateful to all of the participants who developed a variety of intriguing solutions to address the deconvolution problem.
The results of the challenge are summarized in this preprint manuscript and the contest data are available in the clue data library.
The docker container used for converting the deconvoluted data to differential expression values can be downloaded from docker hub.
Aid biomedical discovery by predicting the expression of two genes from a composite measurement and compete for $23,000 in prizes
A key factor in enabling the scale-up of the CMap dataset is the practice of measuring the expression of two genes using the same physical material in the L1000 assay, thus dramatically reducing the costs and increasing the throughput of data generation. In the third challenge in the CMap series, we seek to improve the speed and accuracy of ‘d-peak’, the algorithm that deconvolutes the composite expression signal into two values and associates them with the appropriate genes.
For the purpose of this challenge, the core CMap technology can be described as follows. In a single experiment, CMap makes 488 measurements. Each measurement produces an intensity histogram (a vector of integers), which characterizes expression of two distinct genes in the sample (for a total of 488 x 2 = 976 genes). In the ideal case, each histogram consists of two peaks (see Figure above), each corresponding to a single gene. The genes are mixed in 2:1 ratio, thus the areas under the peaks have 2:1 ratio, which allows us to associate each peak with the specific gene. The median position of each peak corresponds to the gene’s expression level, and that's what you need to determine in this challenge.
The goal of this context was to improve the speed of the CMap query algorithm.
CMap enables the discovery of functional connections between drugs, genes and diseases through the generation and analysis of gene expression signatures, where each signature represents the transcriptional response of human cells to chemical or genetic perturbation. To identify connections, a researcher poses a biological question in the form of a “query” comprised of a list of genes of interest. The query algorithm then searches the CMap database to identify signatures that are most similar to the user's input. By using this algorithm and the Connectivity Map dataset, researchers have uncovered novel biological relationships and generated hypotheses for the development of new therapeutics.
The CMap L1000 assay quantifies the responses of 10,174 genes to an experimental perturbation. Due to recent technological improvements, it has become possible to massively scale-up data generation. As a result, the CMap matrix has grown to 476,251 signatures and this number is expected to continue to increase rapidly. The goal of the contest is to improve the speed of the query algorithm, as this has great practical importance for researchers using CMap.
In addition to helping advance biomedical research, the winners were awarded cash prizes, with a total purse of $20,000. The first place submission showed as much as a 100x improvement over the current query algorithm. An implementation of the winning submission is now available at clue.io/query as a replacement to the current CMap Query algorithm. Improvements to overall Query result time on CLUE, which includes some additional analysis such as GUTC, range from 2-4 fold with greater improvement seen in large batch queries. Overall, the contest removed query as a bottleneck in Query app compute time and allowed for the enabling of batch queries in the Query app on CLUE - a highly requested feature by users.
The goal of this contest was to maximize the accuracy of the inferred gene expression values used by the Connectivity Map, while minimizing the number of the measured gene expressions. Results of this contest expanded research horizons for computational biologists and scientists who seek to find drugs that cure diseases.
CMap utilizes a novel, high-throughput gene expression profiling technology to generate gene expression profiles at scale. The crux of this approach is that instead of measuring all ~20,000 genes in the human genome, CMap measures a select subset of approximately 1,000 genes and uses these “landmark” gene measurements to computationally infer a large portion of the remainder. The current algorithm is effective but imperfect, and improving the imputation methods will have an immediate impact on the quality of data and the biologically meaningful connections that can be discovered. With this in mind, we have designed our first contest to stimulate the exploration of new and improved inference methods.
Several of the top contestants achieved a notable improvement over the current inference model. Though contestants achieved improved accuracy relative to the ground truth RNAseq dataset, this improved accuracy did not reliably translate to improvements in downstream analyses such as connectivity. One potential reason for lack of improved connectivity is that the comparison to RNAseq, using normalized profiles, may be too far removed from connectivity, which is performed using replicate-collapsed differential expression signatures Further exploration of potential applications of the improved inference is ongoing. An implementation of the contestant solutions in R are available here in the cmapR library. Click here for more details on the results, or click here to see the final leaderboard on the TopCoder site.
The Connectivity Map (CMap) is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive, small molecules and pattern-matching algorithms. When these elements are brought together, the results enable the discovery of functional connections between drugs, genes and diseases through the transitory feature of common gene-expression changes. For more on CMap, click here.