Document Type
Thesis - Open Access
Award Date
2026
Degree Name
Master of Science (MS)
Department / School
Electrical Engineering and Computer Science
First Advisor
Kwanghee Won
Abstract
Vision-Language-Action (VLA) models have recently achieved strong performance across manipulation benchmarks, but these benchmarks rely on highly templated instructions on which the models are typically fine-tuned, leaving their behavior under realistic language-side perturbation unclear. It remains an open question whether the linguistic flexibility inherited from vision-language pretraining survives this fine-tuning, or whether the resulting policies become narrowly tuned to benchmark phrasing and brittle to the intent-preserving language variation that real users naturally produce. We address this gap with a systematic study of VLA robustness under non-canonical instructions, comprising three components: a structured taxonomy of intent-preserving variations spanning linguistic, orthographic, and sociolinguistic dimensions; an LLM-based instruction recovery framework that rewrites non-canonical inputs into canonical robot commands as a model-agnostic preprocessing layer; and a proxy-metric framework that selects the best recovery configuration without expensive policy rollouts. The recovery model is fine-tuned on a benchmark-independent corpus generated through taxonomy-guided LLM augmentation, never seeing benchmark-specific instructions during training. Across four VLA models (OpenVLA, $\pi_{0.5}$, SmolVLA, FLOWER-VLA) evaluated on LIBERO and CALVIN, we observe sharply different robustness profiles even among models with comparable baseline performance, and find that lightweight recovery restores a substantial portion of the lost performance. Our results indicate that canonical benchmark success does not imply robustness to natural instruction variation, and that recovery is most effective when policies are already partially robust.
Publisher
South Dakota State University
Recommended Citation
Samson, Viraj, "A Taxonomy-Driven Modular Defense Against Non-Canonical Language in Vision-Language-Action Models" (2026). Electronic Theses and Dissertations. 2022.
https://openprairie.sdstate.edu/etd2/2022