McGill ChIP-seq peak detection benchmark

Citation

These data are useful for benchmarking peak detection algorithms, as was done in our Bioinformatics (2017) paper Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning .

Description of data set

We manually annotated several ChIP-seq data sets from the McGill Epigenomes Portal by visually inspecting them using UCSC genome browser software. When we saw peaks, we created some annotated regions:

peakStart means there should be exactly 1 peak start in the region (no predicted peak starts is a false negative, two or more is a false positive).
peakEnd means there should be exactly 1 peak end in the region (no predicted peak ends is a false negative, two or more is a false positive).
peaks means there should be at least one peak overlapping the region (no predicted peaks is a false negative).

When we saw regions without peaks, we created some noPeaks annotated regions (1 or more overlapping peaks is a false positive).

Links to download data

We saved the 7 data sets of annotated regions to a database that can be viewed and downloaded. The original annotation files can be found under the annotations/ subdirectory.

To download the signal, annotated regions, and peak calls, use this R script or this list of data files (all genome positions are relative to hg19).

Each annotation data set is named like H3K4me3_PGP_immune:

H3K4me3 is the histone mark type,
PGP are the initials of the person who created the annotated regions,
immune means cell types bcell, tcell and monocyte (other means all other cell types).

Test error benchmarks to compute

We used 4-fold cross-validation to train and evaluate each algorithm. We assigned each chunk to a fold ID number between 1 and 4, listed in this csv file. For example, the first line of this file is 1,H3K36me3_AM_immune/11 which indicates that chunk H3K36me3_AM_immune/11 was assigned to fold ID 1.

To make predictions for each chunk, we train a model on all other fold IDs. For example to make a prediction for chunk H3K36me3_AM_immune/11, we train peak calling parameters using labels from folds 2-4 (because that chunk was assigned to fold ID 1).

Some examples of benchmarks that can be computed using these data:

Test error for the same data set: which peak detector gives the lowest test error if you train on some chunks of one data set, and test on some other chunks of the same data set? This is discussed the section 3.2, "Trained models are more consistent with test labels than default models." The peaks that each model predicted in this 4-fold cross-validation experiment are listed in this csv file.
Test error for different cell types: which peak detector gives the lowest test error if you train on one data set (e.g. H3K4me3_TDH_immune, folds 1-3), and test on another data set with the same mark and annotator, but different cell types? (e.g. H3K4me3_TDH_other, fold 4) This is discussed in Section 3.6, "Trained models predict accurate peaks in samples of the same experiment."
Test error for different annotators: which peak detector gives the lowest test error if you train on one data set (e.g. H3K4me3_TDH_immune, folds 1-3), and test on another data set of the same cell types and mark, but a different annotator? (e.g. H3K4me3_PGP_immune, fold 4) This is discussed in Section 3.5, "Labels from different people are highly consistent."