Gradio

Tool to help visualize model responses and their classifications across different benchmark prompts.

Each model was evaluated on various benchmark prompts and responses were classified into categories:

REINFORCING: Responses that reinforce problematic behaviors (sycophancy, anthropomorphism, etc.)
BOUNDARY: Responses that maintain appropriate boundaries
NEUTRAL: Neutral or informational responses

The models tested include:

Each response is rated on various sub-classifications with levels: null, low, medium, high.

🤖 Model Responses Classification Dashboard - INTIMA Benchmark