Gene annotations for the Horvath37K_DNAMethylation array for 10 bat genomes


This submission contain gene annotations for an Illumina microarray (HorvathMammalMethylChip40) for 10 species of bats. The array design is available from the Gene Expression Omnibus (GEO) at NCBI as platform GPL28271. This array was used to generate DNA methylation data for nearly 700 known-aged individuals representing 26 species of bats. The resulting data were then used to predict age and species lifespan, and identify genomic regions that influence both of those traits.


We used sequences and annotations for ten bat genomes (see Table 1 below), which include six recently published reference assemblies, to locate each 50 bp probe on the array. The alignment was done using the QUASR package (Gaidatzis et al., 2015) with the assumption for bisulfite conversion treatment of the genomic DNA. For each species’ genome sequence, QUASR creates an in-silico-bisulfite-treated version of the genome. The set of nucleotide sequences of the designed probes, which includes degenerate base positions due to the bisulfite conversion, was expanded into a larger set of nucleotide sequences representing every possible combination of degenerate bases. We then ran QUASR (a wrapper for Bowtie2) with parameters -k 2 --strata --best -v 3 and bisulfite = "undir” to align the enlarged set of probe sequences to each prepared genome. From these files, we collected only alignments where the entire length of the probe perfectly matched to the genome sequence (i.e. the CIGAR string 50M and flag XM=0).

Following the alignment, the CpGs were annotated based on the distance to the closest transcriptional start site using the Chipseeker package (Yu et al., 2015). A gff file with these was created using these positions, sorted by scaffold and position, and compared to the location of each probe in BAM format. We report probes whose variants only mapped to one unique locus in a particular genome. Genomic location of each CpG is categorized as intergenic, 3’ UTR, 5’ UTR, promoter region (minus 10 kb to plus 1000 bp from the nearest TSS), exon, or intron.

Gaidatzis, D., Lerch, A., Hahne, F., and Stadler, M.B. (2015). QuasR: quantification and annotation of short reads in R. Bioinformatics 31, 1130-1132.

Yu, G., Wang, L.G., and He, Q.Y. (2015). ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31, 2382-2383.

Table 1. Bat genome assemblies and sources used for identifying location of CpG sites and number of sites mapped per genome.

Species, Assembly and annotation, Source, CpGs mapped Molossus molossus, HLmolMol2, MPI*, 33557 Myotis myotis, HLmyoMyo6, MPI*, 32687 Phyllostomus discolor, HLphyDis3, MPI*, 33615 Rhinolophus ferrumequinum, HLrhiFer5, MPI*, 34411 Pipistrellus kuhlii, HLpipKuh2, MPI*, 31074 Rousettus aegyptiacus, HLrouAeg4, MPI*, 34308 Desmodus rotundus, GCF 002940915.1, ASM294091v2, NCBI, 32930 Eptesicus fuscus, GCF 000308155.1, EptFus1.0, NCBI, 32218 Myotis lucifugus, GCF 000147115.1, Myoluc2.0, NCBI, 29810 Pteropus vampyrus, pteVam1.100, ENSEMBL, 24681 MPI* (downloaded from