Citation

These data are useful for benchmarking peak detection algorithms, as was done in our Bioinformatics (2017) paper Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning .

Description of data set

We manually annotated several ChIP-seq data sets from the McGill Epigenomes Portal by visually inspecting them using UCSC genome browser software. When we saw peaks, we created some annotated regions:

When we saw regions without peaks, we created some noPeaks annotated regions (1 or more overlapping peaks is a false positive).

Links to download data

We saved the 7 data sets of annotated regions to a database that can be viewed and downloaded. The original annotation files can be found under the annotations/ subdirectory.

To download the signal, annotated regions, and peak calls, use this R script or this list of data files (all genome positions are relative to hg19).

Each annotation data set is named like H3K4me3_PGP_immune:

Test error benchmarks to compute

We used 4-fold cross-validation to train and evaluate each algorithm. We assigned each chunk to a fold ID number between 1 and 4, listed in this csv file. For example, the first line of this file is 1,H3K36me3_AM_immune/11 which indicates that chunk H3K36me3_AM_immune/11 was assigned to fold ID 1.

To make predictions for each chunk, we train a model on all other fold IDs. For example to make a prediction for chunk H3K36me3_AM_immune/11, we train peak calling parameters using labels from folds 2-4 (because that chunk was assigned to fold ID 1).

Some examples of benchmarks that can be computed using these data: