Jen
Hiding secrets behind neural nets.
Last updated
Hiding secrets behind neural nets.
Last updated
We went on a side-quest during the hackathon to test an idea.
Anthropic released a paper during the hackathon titled Best-of-N Jailbreaking. The paper found that about 10,000 augmented samples (prompts) were needed to achieve roughly an 80% jailbreak success rate across pretty much all single model systems. This means that if you attempted a prompt 10,000 times, each time augmenting the prompt by e.g. capitalizing letters, switching out "e's" for "3's," you could jailbreak the model, and among other things, get it to reveal it's system prompt. Interestingly enough, this jailbreaking method followed a power law whereby every additional attempt increased the jailbreaking success rate. We wanted to see how this difficulty scales when we move from one model to an ensemble of k models. Specifically, we wanted to determine whether the complexity grows multiplicatively (like N^k) or exponentially (like 2^k) as we increase the number of models (where N is the attempts required for one model) in an ensemble. Hiding a Solana account key in the kth model felt like a good way of incentivizing people to attempt to jailbreak it. And it did. Over 120+ people attempted to jailbreak the model unsuccessfully. We found that the difficulty of getting the ensemble to reveal the kth model's system prompt increased multiplicatively (N^k), so a system could in fact be devised to hide secrets behind nueral nets. One practical use case of this approach is that, if you can convince an ensemble of k models you are who you claim to be, then it could be like a new form of identity-based auth as opposed to traditional biometrics or passwords. And it could also be used for social recovery. The challenge is still running, and you can submit your attempt by following this link.