An Integrated YOLO and VLM System
for Fire Detection in Enclosed Environments

International Conference on Learning Representations Workshop (ICLRW), 2025

Jongeun Kim*, Yejin Lee*, Dongsik Yoon*, Chansung Jung, Gunhee Lee

HDC LABS

Qualitative results on real-world fire comparing on VLM only, YOLO only, and our proposed approach of VLM and YOLO integration.

Abstract

While YOLO models show promise in car fire detection, they remain insufficient for real-world deployment in underground and indoor car parks due to dataset limitations, evaluation gaps, and deployment constraints. We first fine-tune YOLO on fire/smoke-augmented dataset, but analysis reveals that it struggles with ambiguous fire-smoke boundaries, leading to false predictions.

To address this, we propose a real-time end-to-end framework integrating YOLOv8s with Florence2 VLM, combining object detection with contextual reasoning. While YOLOv8s with VLM improves detection reliability, challenges are still ongoing. Our findings highlight YOLO’s limitations in fire detection and the need for a more adaptive, environment-aware approach.

Proposed Methods

An End-to-End fire detection framework designed for detecting car fire/smoke events in real-time CCTV footage of confined environments. The framework consists of four main stages: data augmentation, training, real-time inference, and alerting security.

Quantitative Comparison

Automatic evaluation of YOLO models with and without the VLM integration on test set.

Ablation Study

Ablation study on the impact of confidence threshold adjustments and the exclusion of synthetic data.

Fire / Smoke Dataset Augmentation

Fire/Smoke Dataset Generation Workflow for Underground Parking Scenarios. The process breaks down to three stages: Location and mask preparation, fire/smoke object generation, and image refinement.

Examples of synthetically generated fire and smoke images.