Agentic VQA System
This project implements a Production-Grade Visual Question Answering (VQA) Pipeline. The pipeline is designed to take an image (e.g., a dashcam frame) and a text prompt, process them using an open-source Vision-Language Model (VLM), and return a descriptive answer.
PythonPyTorchHugging Face TransformersBLIP
Key Engineering Features
- Framework: Built with PyTorch, ensuring efficient tensor operations and GPU acceleration where available.
- Model Library: Integrated with the Hugging Face
transformerslibrary to utilize theBlipForQuestionAnsweringarchitecture. - Pre-trained Model: Defaults to
Salesforce/blip-vqa-base, which is optimized for visual question answering tasks. - Processing: Uses
BlipProcessorfor multimodal input handling (image + text).