As AI Systems Become More Capable, We Would Like to Enlist their Help to Supervise Other AIs

This paper introduces Constitutional AI, a method for training helpful and harmless assistants without human labels for harmful behavior. Models critique and revise their own outputs using written principles, then improve further with reinforcement learning from AI feedback (RLAIF). The result is safer, non-evasive AI with clearer reasoning and far less human supervision.