In the original Elastic Weight Consolidation paper, the authors show that given a sequence of two tasks $$t_1,t_2$$, one can estimate the posterior probability of the parameters $$\theta$$ over the combined datasets even with sequential access to the data. In other words, even though the learner never sees data from $$t_1$$ and $$t_2$$ simultaneously (only sequentially), one can still estimate $$\log p(\theta \mid \mathcal{D}_1, \mathcal{D}_2)$$.

Specifically, they show the following result :

$$${ \log p(\theta \mid \mathcal{D}_1, \mathcal{D}_2) = \log p(\theta \mid \mathcal{D}_1) + \log p(\mathcal{D}_2 \mid \theta) - \log p(\mathcal{D}_2). }$$$

Looking at the three terms on the R.H.S, the last two terms only depend on the second dataset, and the dependance on the first task is only through $$\log p(\theta \mid \mathcal{D}_1)$$, the posterior probability of the parameters over the first dataset. In other words this posterior distribution is sufficient to encapsulate all the information of the first task. So this is a cool result, i.e. continual learning reduces to learning this posterior.

But how do we obtain this result ? Let’s go through the derivation.

\begin{align} \log p(\theta | \mathcal{D}_1, \mathcal{D}_2) &= \log p(\mathcal{D}_2, \mathcal{D}_1, \theta) - \log p(\mathcal{D}_1, \mathcal{D}_2) \\ &= \log p(\mathcal{D}_2 | \theta, \mathcal{D}_1) + \log p(\theta | \mathcal{D}_1) + \log p(\mathcal{D}_1) - \log p(\mathcal{D}_1, \mathcal{D}_2) \\ &= \log p(\mathcal{D}_2 | \theta ) + \log p(\theta | \mathcal{D}_1) - \log p(\mathcal{D}_2|\mathcal{D}_1)), \end{align}

In the first row, we expand the definition of a conditional probability. In the second row, we do a similar step and rewrite the first term $$p(\mathcal{D}_2, \mathcal{D}_1, \theta)$$ as a product of conditional probabilities. But the magic really happens in the last row, where we leverage the conditional independence of $$D_1$$ and $$D_2$$ given $$\theta$$ to remove the tricky dependance over information from the first task.

That’s it! Now we have an objective which can be optimized without simultaneous access to the data.

#### Comment 1

As you can see, equations (1) and (4) don’t perfectly line up; the EWC paper makes the additional assumption that $$\mathcal{D}_1$$ and $$\mathcal{D}_2$$ are independent (rather than conditionally on $$\theta$$), i.e. $$p(\mathcal{D}_2 | \mathcal{D}_1) = p(\mathcal{D}_2)$$. My understanding is that you do not need to make this assumption, as this term does not depend on $$\theta$$ anyways. If you have any thoughts on why they do this, please let me know :)

#### Comment 2

While prior based methods like EWC are theoretically sound, in practice they are quite bad. I strongly believe this is because the assumptions we make so that computing $$p(\theta | \mathcal{D}_1)$$ is tractable are wayy to restrictive. Maybe more flexible posterior inference is a good research direction. More on that in the next blog post!