Epigenetic Pacemaker Algorithm

EPM Description

Given $i$ methylation sites and $j$ individuals a single methylation site can be described as $\hat{m}_{ij} = m^0_i + r_is_j + \epsilon _{ij}$ where $\hat{m}_{ij}$ is the observed methylation value, $m^0_i$ is the initial methylation values, $r_i$ is the rate of change, $s_j$ is the epigenetic state, and $\epsilon _{ij}$ is a normally distributed error term. Given an input matrix $\hat{M} = [\hat{m_{ij}}]$ the goal of the EPM is find the optimal values of $m^0_i$ , $r_i$ , and $s_j$ to minimize the error between the predicted and observed methylation values across a system of methylation sites. Under the EPM $m^0_i$ and $r_i$ are characteristic of the site for all individuals and $s_j$ is shared by all sites within a system of methylation sites for every individual.

The EPM optimization is accomplished through an implementation of a fast conditional expectation maximization algorithm that maximizes the model likelihood by minimizing the residual sum of squares error. When fitting the EPM each methylation site is assigned an independent rate of change and starting methylation value, while each individual is assigned an epigenetic state. The initial epigenetic state is provided by the user and should represent a best guess. The epigenetic state is then updated through each iteration of the EPM to minimize the error across the observed epigenetic landscape. Because the $s_j$ is updated while fitting the EPM the condition of linearity between the methylation values and trait of interest is relaxed.

EPM Implementation

The EPM algorithm is implemented as follows

fit $i$ site models using the user provided state predictions to get $r_i$ and $m_0$
update $s_j$ to minimize $\epsilon_{ij}^2$
- $s_j = \frac{\sum_{i \leq n} r_i(\hat{m_{ij}} - m^0_i)}{\sum_{i \leq n} r^2_i}$
refit site models using $s_j$
repeat step 2 and 3 until model improvements $\leq$ specified threshold or maximum number of iterations reached