When we test a process or system, we put in inputs and check that the outputs are correct. If they aren't, that's an error. But how do you test something that gives a different answer every time—and when varied responses are the whole point? Welcome to the world of AI evals (short for "evaluations"): ways to test whether AIs are functioning effectively.
That's right, the future of metrics and ROI looks a lot like a vibe check.
To know whether your AI model is trustworthy, conversational, and capable of handling real-world tasks, we need to evaluate it. In this session, Elena Yunusov, Executive Director at the Human Feedback Foundation, shares a rigorous framework for understanding AI through human experience. The framework, known as HUMAINE, is a joint effort by her organization, Prolific, Hugging Face, MLCommons, the Collective Intelligence Project, and others.
She'll discuss how evals work today, and why human-led evaluations become increasingly important as AI models insinuate themselves into our creative and professional lives. Going beyond theory, Elena will include AI evals case studies from leading nonprofits that support and participate in RAISE (Responsible AI for Social Impact) pilot, backed by DIGITAL and delivered in collaboration with the Creative Destruction Lab and TMU's The Dais. Attendees will also learn how agentic AI changes evals, and some of the latest research, gaps and open questions in this crucial, emerging field.