Electronic Theses and Dissertations

Computation-Efficient Deep Learning Models for Computer Vision and Multimodal Vision-Language Tasks via Network Pruning

Abir Mohammad Hadi, South Dakota State UniversityFollow

Document Type

Dissertation - Open Access

Award Date

2025

Degree Name

Doctor of Philosophy (PhD)

Department / School

Electrical Engineering and Computer Science

First Advisor

Kwanghee Won

Second Advisor

Mostafa Tazarv

Abstract

With the rapid evolution of deep neural networks over the past decade, the demand for efficient, generalizable, and task-adaptable models, especially in computer vision, has increased significantly. To address the computational and deployment challenges posed by overparameterized models, the research community has extensively explored model compression techniques such as pruning, quantization, and distillation. These approaches aim to enhance model efficiency without compromising performance, particularly when adapting to domain-specific tasks under limited resources. This dissertation investigates several underexplored yet critical aspects of task-aware deep learning model compression, spanning both convolutional and vision-language architectures. In the early part of this work, we demonstrate that shallow convolutional neural networks, when carefully initialized with orthogonal weights and constrained using orthogonal regularization, can outperform deeper counterparts on specialized classification tasks. Specifically, we show that models tailored to scale-sensitive feature learning can yield competitive performance with significantly fewer parameters. Building on this, we introduce a novel framework for adaptive, structural pruning using deep reinforcement learning. By modeling the pruning decision process as a state-action optimization problem, our agent dynamically adjusts pruning ratios based on the intrinsic dimensionality of training data—a proxy for task complexity. This approach eliminates the need for retraining or extensive post-pruning fine-tuning and reduces compute overhead, offering a more efficient and automated path to model compression. Furthermore, we show that careful reward design and action space construction are pivotal to the agent’s success, particularly when targeting structured modules such as convolutional filters. In the final chapter, we extend the pruning framework to transformer-based vision-language architectures and introduce a task-agnostic, data-dependent approach for structured pruning during continual pretraining. This strategy emphasizes module-aware compression and integrates online knowledge distillation to preserve alignment with pretrained representations. Separately, in earlier chapters, we show that applying pruning after a gradient-informed delay during training and using a hybrid action space significantly improves compression outcomes in convolutional models with minimal loss in generalization. The final contribution focuses on large vision-language models, which are increasingly deployed in zero-shot or few-shot scenarios across tasks such as image-text retrieval, captioning, and classification. We propose TA3DP (Task-Agnostic, Data-Dependent Distillation and Pruning), a pruning framework that integrates online knowledge distillation and module-aware compression to preserve pretrained alignment while adapting to new domains. Our findings demonstrate that distilling from pretrained teachers, rather than fine-tuned ones, yields superior performance, particularly for generative tasks like captioning. This contribution addresses the overlooked problem of catastrophic forgetting during domain-specific fine-tuning under compression. Overall, this dissertation contributes a unified view of task-sensitive model optimization through structured pruning, providing scalable solutions for both CNN-based models and large multimodal architectures. These insights and frameworks lay the foundation for efficient, generalization-preserving model deployment in practical AI systems across diverse domains.

Library of Congress Subject Headings

Deep learning (Machine learning)
Computer vision.
Neural networks (Computer science)

Publisher

South Dakota State University

Recommended Citation

Hadi, Abir Mohammad, "Computation-Efficient Deep Learning Models for Computer Vision and Multimodal Vision-Language Tasks via Network Pruning" (2025). Electronic Theses and Dissertations. 1731.
https://openprairie.sdstate.edu/etd2/1731

Download

Included in

Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Rights Statement

Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange

Electronic Theses and Dissertations

Computation-Efficient Deep Learning Models for Computer Vision and Multimodal Vision-Language Tasks via Network Pruning

Document Type

Award Date

Degree Name

Department / School

First Advisor

Second Advisor

Abstract

Library of Congress Subject Headings

Publisher

Recommended Citation

Included in

Rights Statement

Search

Browse

Author Corner

Links

Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange

Electronic Theses and Dissertations

Computation-Efficient Deep Learning Models for Computer Vision and Multimodal Vision-Language Tasks via Network Pruning

Author

Document Type

Award Date

Degree Name

Department / School

First Advisor

Second Advisor

Abstract

Library of Congress Subject Headings

Publisher

Recommended Citation

Included in

Share

Rights Statement

Search

Browse

Author Corner

Links