Document Type

Thesis - Open Access

Award Date

2025

Degree Name

Master of Science (MS)

Department / School

Electrical Engineering and Computer Science

First Advisor

Chulwoo Pack

Abstract

Existing video description evaluation metrics fail to capture the long-range chronology and semantic alignment essential for long-form descriptions. An effective evaluation metric for long-form descriptions must (i) assess global thematic alignment, (ii) measure local semantic alignment, and (iii) evaluate chronological alignment while detecting corrupted content. We introduce Video Comprehension Score (VCS), a reference-based metric, which directly addresses these evaluation requirements through three components: Global Alignment Score for thematic alignment, Local Alignment Score for local semantic alignment, and Narrative Alignment Score for chronological alignment with adjustable tolerance. We evaluate VCS on two large-scale synthetic datasets designed to test corruption detection and cross-author consistency. VCS consistently outperforms traditional metrics on corruption detection tasks, being the only metric capable of distinguishing valid variations from invalid corruptions. On cross-author consistency tasks, VCS is the only metric that consistently produces scores >80% regardless of which authorial reference is used for evaluation. VCSshort, our implementation for short-form descriptions, attains state-of-the-art human correlation on VATEX-EVAL in the 9-ref setting (Kendall’s τ = 41.5, Spearman’s ρ = 52.8) and competitive results in the 1-ref setting (Kendall’s τ = 30.0, Spearman’s ρ = 38.1). These results demonstrate VCS effectiveness for evaluating both long-form and short-form video descriptions.

Library of Congress Subject Headings

Video description -- Evaluation.
Machine learning.

Publisher

South Dakota State University

Share

COinS
 

Rights Statement

In Copyright