Talk at Ivado's AI Safety and Alignment Research Cluster

Presenting "Hidden in plain sight: injecting LLM memories for evaluation and safeguarding" at Ivado’s AI Safety and Alignment Research Cluster.

In this talk, I focus on a memory injection paradigm and present two use cases. Firstly, I introduce our new unlearning testbed, LACUNA, which injects synthetic personally identifiable information (PII) into known parameters to evaluate whether unlearning methods actually target the parameters that store the memories they aim to unlearn. We use LACUNA to demonstrate that state-of-the-art unlearning methods do not target the correct parameters, raising questions about whether they truly unlearn or merely obfuscate PII. Secondly, I present Tiered Language Models, in which the memory injection is aimed at safeguarding certain information and capabilities. These can only be unlocked with the right key, enabling controlled access in open-weight models.

Posted on:
May 27, 2026
Length:
1 minute read, 130 words
Categories:
talks
See Also: