Abstract:Human activity is moderated by norms; however, supervision for normative reasoning is sparse, particularly where norms are physically- or socially-grounded. We thus present EGONORMIA $\|\epsilon\|$, comprising 1,853 (200 for EGONORMIA-verified) multiple choice questions (MCQs) grounded within egocentric videos of human interactions, enabling the evaluation and improvement of normative reasoning in vision-language models (VLMs). EGONORMIA spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline to generate grounded MCQs from raw egocentric video. Our work demonstrates that current state-of-the-art VLMs lack robust grounded norm understanding, scoring a maximum of 54% on EGONORMIA and 65% on EGONORMIA-verified, with performance across norm categories indicating significant risks of safety and privacy when VLMs are used in real-world agents. We additionally explore methods for improving normative understanding, demonstrating that a naive retrieval-based generation (RAG) method using EGONORMIA can enhance normative reasoning in VLMs.
Submission history
From: Phil Cuvin [view email]
[v1]
Thu, 27 Feb 2025 19:54:16 UTC (31,255 KB)
[v2]
Thu, 6 Mar 2025 00:59:40 UTC (31,767 KB)
[v3]
Sun, 4 May 2025 23:41:06 UTC (45,662 KB)
[v4]
Sun, 8 Jun 2025 21:03:14 UTC (34,842 KB)
[v5]
Wed, 11 Jun 2025 22:13:59 UTC (34,842 KB)