🤖 Inference-Time Scaling for Generalist Reward Modeling
No se pudo agregar al carrito
Add to Cart failed.
Error al Agregar a Lista de Deseos.
Error al eliminar de la lista de deseos.
Error al añadir a tu biblioteca
Error al seguir el podcast
Error al dejar de seguir el podcast
-
Narrado por:
-
De:
This paper explores enhancing reward modeling (RM) for large language models (LLMs) by improving inference-time scalability. The authors introduce Self-Principled Critique Tuning (SPCT), a novel learning method that encourages RMs to generate their own guiding principles and accurate critiques through online reinforcement learning. Their approach, embodied in the DeepSeek-GRM models, utilizes pointwise generative reward modeling for greater flexibility. By employing parallel sampling and a meta RM to refine the reward voting process, they demonstrate significant improvements in the quality and scalability of their GRMs across various benchmarks. Notably, inference-time scaling with their method shows competitive or superior performance compared to simply increasing model size.