AI Control: Improving Safety Despite Intentional Subversion

No se pudo agregar al carrito

Solo puedes tener X títulos en el carrito para realizar el pago.

Add to Cart failed.

Por favor prueba de nuevo más tarde

Error al Agregar a Lista de Deseos.

Por favor prueba de nuevo más tarde

Error al eliminar de la lista de deseos.

Por favor prueba de nuevo más tarde

Error al añadir a tu biblioteca

Por favor intenta de nuevo

Error al seguir el podcast

Intenta nuevamente

Error al dejar de seguir el podcast

Intenta nuevamente

AI Control: Improving Safety Despite Intentional Subversion

Escúchala gratis

Ver detalles del espectáculo

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:

We summarize the paper;
We compare our methodology to what the one used in other safety papers.

The next post in this sequence (which we’ll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.

Here's the abstract of the paper:

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models [...]

---

First published:
December 13th, 2023

Source:
https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion

---

Narrated by TYPE III AUDIO.

Todavía no hay opiniones