Title: Multimodal Image Fusion for Enhanced Vehicle Identification in Intelligent Transport Authors: Naif Al Mudawi1, Muhammad Waqas Ahmed, Haifa F. Alhasson, Naif S. Alshammari, Abdulwahab Alazeb, Mohammed Alshehri, and Bayan Alabdullah Abstract: Target detection in remote sensing is essential for applications like law enforcement, military surveillance, and search-and-rescue. With advancements in computational power, deep learning methods have excelled in using unimodal aerial imagery. The availability of diverse imaging modalities like RGB, infrared, hyperspectral, multispectral, synthetic aperture radar, and LIDAR allows researchers to leverage complementary data. Integrating these multi-modal datasets has significantly enhanced detection performance, making these technologies more effective in real-world scenarios. In this work, we propose a novel approach utilizing a deep learning-based attention mechanism to generate depth maps from aerial images. These depth maps are fused with RGB images for enhanced feature representation. Our methodology applies Markov Random Fields (MRF) for segmentation and YOLOv4 for vehicle detection. Additionally, we propose a novel feature extraction technique combines HOG and BRISK descriptors within the Vision Transformer (ViT) framework. Finally, ResNet-18 is used for classification. The performance of our model is evaluated on three datasets Roundabout Aerial, AU-Air and VAID datasets, with precision score of 98.4%, 96.2% and 97.4% respectively for object detection. Our approach outperforms existing methods in the industry in terms of vehicle detection and classification in aerial images. Key Contributions: The paper proposes a novel framework for vehicle detection and classification in aerial images by generating depth maps from RGB images using a deep learning-based encoder-decoder architecture enhanced with self-attention mechanisms. These depth maps are fused with RGB images using guided image filtering to enhance scene representation by preserving structural details. A hybrid feature extraction approach integrates HOG and BRISK features within the Vision Transformer's patch embedding layer to capture both global and local features effectively. The extracted features are then classified using a modified ResNet-18 model with a custom fully connected layer for improved accuracy. This multi-modal fusion approach achieves superior precision and classification performance across the Roundabout, AU-AIR, and VAID datasets, outperforming existing methods. Usage: Data Availability: The Roundabout Aerial Images, AU-AIR Dataset and VAID Dataset used in this study are available at: Roundabout Aerial Images:https://www.kaggle.com/datasets/javiersanchezsoriano/roundabout-aerial-images-for-vehicle-detection AU-AIR Dataset:https://bozcani.github.io/auairdataset VAID Dataset:https://universe.roboflow.com/chandler-sun/vaid-mnnde Code Information: Fusion Module (Fusion.py): Loads RGB and depth images, applies guided filtering for image fusion, and visualizes the results. Depth Estimation (Depth images.py): Implements a convolutional-based model with attention mechanisms to generate depth maps from RGB images. Feature Extraction (Feature.py): Extracts HOG and BRISK features from image patches, embedding them within a Vision Transformer framework. Classification (Classification.py): Uses a modified ResNet-18 model to classify vehicles based on extracted features. Usage Instructions: Running the Code: Prepare Dataset: Download and extract dataset images into the specified directories. Generate Depth Images: usiing python file named as Depth images.py Perform Image Fusion: using python file named as Fusion.py Extract Features: using python file named as Feature.py Train and Test Classifier: using python file named as Classification.py 6. Requirements Python 3.8+ Required Libraries: pip install numpy torch torchvision opencv-python tensorflow matplotlib scikit-image pillow GPU support is recommended for deep learning-based tasks. Code Repository: The code for implementation is available in the supplemental files provided with the submission. Requirements: This work was developed on a Windows 10 (x64) machine with the following specifications: Intel Core i3-4010U 1.70GHz CPU 4GB RAM. You will need Python and the following libraries: PyTorch, OpenCV, NumPy, SciPy, and other dependencies.