Document Type

Thesis - Open Access

Award Date

2026

Degree Name

Master of Science (MS)

Department / School

Electrical Engineering and Computer Science

First Advisor

Kwanghee Won

Abstract

Vision-Language-Action (VLA) models have recently achieved strong performance across manipulation benchmarks, but these benchmarks rely on highly templated instructions on which the models are typically fine-tuned, leaving their behavior under realistic language-side perturbation unclear. It remains an open question whether the linguistic flexibility inherited from vision-language pretraining survives this fine-tuning, or whether the resulting policies become narrowly tuned to benchmark phrasing and brittle to the intent-preserving language variation that real users naturally produce. We address this gap with a systematic study of VLA robustness under non-canonical instructions, comprising three components: a structured taxonomy of intent-preserving variations spanning linguistic, orthographic, and sociolinguistic dimensions; an LLM-based instruction recovery framework that rewrites non-canonical inputs into canonical robot commands as a model-agnostic preprocessing layer; and a proxy-metric framework that selects the best recovery configuration without expensive policy rollouts. The recovery model is fine-tuned on a benchmark-independent corpus generated through taxonomy-guided LLM augmentation, never seeing benchmark-specific instructions during training. Across four VLA models (OpenVLA, $\pi_{0.5}$, SmolVLA, FLOWER-VLA) evaluated on LIBERO and CALVIN, we observe sharply different robustness profiles even among models with comparable baseline performance, and find that lightweight recovery restores a substantial portion of the lost performance. Our results indicate that canonical benchmark success does not imply robustness to natural instruction variation, and that recovery is most effective when policies are already partially robust.

Publisher

South Dakota State University

Recommended Citation

Samson, Viraj, "A Taxonomy-Driven Modular Defense Against Non-Canonical Language in Vision-Language-Action Models" (2026). Electronic Theses and Dissertations. 2022.
https://openprairie.sdstate.edu/etd2/2022

Download

Included in

Electrical and Computer Engineering Commons

COinS

Rights Statement

Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange

Electronic Theses and Dissertations

A Taxonomy-Driven Modular Defense Against Non-Canonical Language in Vision-Language-Action Models

Document Type

Award Date

Degree Name

Department / School

First Advisor

Abstract

Publisher

Recommended Citation

Included in

Rights Statement

Search

Browse

Author Corner

Links

Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange

Electronic Theses and Dissertations

A Taxonomy-Driven Modular Defense Against Non-Canonical Language in Vision-Language-Action Models

Author

Document Type

Award Date

Degree Name

Department / School

First Advisor

Abstract

Publisher

Recommended Citation

Included in

Share

Rights Statement

Search

Browse

Author Corner

Links