Citation
These data are useful for benchmarking peak detection algorithms,
as was done in our Bioinformatics (2017) paper
Optimizing ChIP-seq peak detectors using visual labels and
supervised machine learning
.
Description of data set
We manually annotated several ChIP-seq data sets from the
McGill Epigenomes Portal
by visually inspecting them using UCSC genome browser software.
When we saw peaks, we created some annotated regions:
- peakStart means there should be exactly 1 peak start in the region
(no predicted peak starts is a false negative,
two or more is a false positive).
- peakEnd means there should be exactly 1 peak end in the region
(no predicted peak ends is a false negative,
two or more is a false positive).
- peaks means there should be at least one peak overlapping the region
(no predicted peaks is a false negative).
When we saw regions without peaks, we created some noPeaks
annotated regions (1 or more overlapping peaks is a false positive).
Links to download data
We saved the 7 data sets of annotated regions to
a database that can be viewed
and downloaded.
The original annotation files can be found under
the
annotations/ subdirectory.
To download the signal, annotated regions, and peak calls, use
this R script or
this list of data files
(all genome positions are relative to
hg19).
Each annotation data set is named like H3K4me3_PGP_immune:
- H3K4me3 is the histone mark type,
- PGP are the initials of the person who created the annotated regions,
- immune means cell types bcell, tcell and monocyte
(other means all other cell types).
Test error benchmarks to compute
We
used 4-fold
cross-validation to train and evaluate each algorithm. We assigned
each chunk to a fold ID number between 1 and 4, listed
in this csv file. For example,
the first line of this file is 1,H3K36me3_AM_immune/11 which
indicates that
chunk H3K36me3_AM_immune/11
was assigned to fold ID 1.
To make predictions for each chunk, we train a model on all other
fold IDs. For example to make a prediction for chunk
H3K36me3_AM_immune/11, we train peak calling parameters using labels
from folds 2-4 (because that chunk was assigned to fold ID 1).
Some examples of benchmarks that can be computed using these data:
- Test error for the same data set: which peak detector
gives the lowest test error if you train on some chunks of one
data set, and test on some other chunks of the same data set? This
is discussed the section 3.2, "Trained models are more consistent
with test labels than default models." The peaks that each model
predicted in this 4-fold cross-validation experiment are listed
in this csv file.
- Test error for different cell types:
which peak detector gives the lowest test error if you train
on one data set (e.g. H3K4me3_TDH_immune, folds 1-3),
and test on another data set with the same mark and annotator,
but different cell types? (e.g. H3K4me3_TDH_other, fold 4)
This is discussed in Section 3.6,
"Trained models predict accurate peaks in samples of the same experiment."
- Test error for different annotators:
which peak detector gives the lowest test error if you train
on one data set (e.g. H3K4me3_TDH_immune, folds 1-3),
and test on another data set of the same cell types and mark,
but a different annotator? (e.g. H3K4me3_PGP_immune, fold 4)
This is discussed in Section 3.5,
"Labels from different people are highly consistent."