草莓的真面目终揭开,OpenAI o1 全方位解析!





GPQA Diamond :一个困难的智力基准,用于测试化学、物理和生物学方面的专业知识。






CoT(思维链)



用户输入问题后,o1 相比之前 GPT 系列模型多使用一个叫做“推理标记”的东西,你可以理解为它学会了像人一样选择在什么时候进行思考,并输出当前的想法,而这些“推理标记”中的“思考”的内容,并不会展示出来。这也是为什么有体验者提到,模型的等待时间比较久,正是因为思考的过程并不会显式地展示; 在新一轮的对话中(用户第二次输入),上一轮“思考”的内容全部被删除,开始全新的“思考”; 依此类推,当对话到达128k Tokens的上限时,模型会给出一个“删减版”答案,也避免了我们白白等待却碰到了上下文上限。
Life can only be understood backward, but it must be lived forward - Søren Kierkegaard (Quiet-STaR 在论文的 Abstract 引用了这句话,当时觉得挺有意境的)



总结(o1 很强,但不要尬吹)

According to these evaluations, o1-preview hallucinates less frequently than GPT-4o, and o1-mini hallucinates less frequently than GPT-4o-mini. However, we have received anecdotal feedback that o1-preview and o1-mini tend to hallucinate more than GPT-4o and GPT-4o-mini. More work is needed to understand hallucinations holistically, particularly in domains not covered by our evaluations (e.g., chemistry). Additionally, red teamers have noted that o1-preview is more convincing in certain domains than GPT-4o given that it generates more detailed answers. This potentially increases the risk of people trusting and relying more on hallucinated generation.





©️版权声明:若无特殊声明,本站所有文章版权均归AI工具集原创和所有,未经许可,任何个人、媒体、网站、团体不得转载、抄袭或以其他方式复制发表本站内容,或在非我站所属的服务器上建立镜像。否则,我站将依法保留追究相关法律责任的权利。