Abstract
The integration of multimodal data is critical in advancing artificial intelligence models capable of interpreting diverse and complex inputs. While standalone models excel in processing individual data types like text, image, or audio, they often fail to achieve comparable performance when these modalities are combined. Generative Adversarial Networks (GANs) have emerged as a transformative approach in this domain due to their ability to synthesize and learn across disparate data types effectively. This study addresses the challenge of bridging multimodal datasets to improve the generalization and performance of AI models. The proposed framework employs a novel GAN architecture that integrates textual, visual, and auditory data streams. Using a shared latent space, the system generates coherent representations for cross-modal understanding, ensuring seamless data fusion. The GAN model is trained on a benchmark dataset comprising 50,000 multimodal instances, with 25% allocated for testing. Results indicate significant improvements in multimodal synthesis and classification accuracy. The model achieves a text-to-image synthesis FID score of 14.7, an audio- to-text BLEU score of 35.2, and a cross-modal classification accuracy of 92.3%. These outcomes surpass existing models by 8-15% across comparable metrics, highlighting the GAN’s effectiveness in handling data heterogeneity. The findings suggest potential applications in areas such as virtual assistants, multimedia analytics, and cross-modal content generation.
Original language | English |
---|---|
Article number | 0497 |
Pages (from-to) | 3567-3577 |
Number of pages | 11 |
Journal | ICTACT Journal of Soft Computing |
Volume | 15 |
Issue number | 3 |
DOIs | |
Publication status | Published - 1 Jan 2025 |
Keywords
- Multimodal AI
- Generative Adversarial Networks
- Cross-Modal Synthesis
- Text-Image-Audio Fusion
- Model Performance Enhancement