
JCTAM OPEN ACCESS
Journal of Computer Technology and Applied Mathematics
ISSN:3007-4126 (print) | ISSN:3007-4134 (online) | Publication Frequency: Bimonthly
Overview of Multimodal Generative Models in Natural Language Processing and Computer Vision
* Corresponding Author1: Liang Li, E-Mail: liliang150851@163.com
Publication
Accepted Unknow ; Published 2024 November 2
Journal of Computer Technology and Applied Mathematics, 2024, 1(4), 3007-4126.
Abstract
Multimodal generative models have become essential in the deep learning renaissance, as they provide unparalleled flexibility over a diverse context of applications within Natural Language Processing (NLP) and Computer Vision (CV). In this paper, we systematically review the basic concepts and technical improvements in multimodal generative models by discussing their applications across different modalities such as text, images, audio,and video. These models though augment the strength of AI to comprehend and perform complicated tasks by coalescing data from various modalities. In this paper, we investigate how these principles apply to many of the existing mainstream models (including CLIP, DALL·E, Flamingo), and consider their applications in VQA,text-to-image-synthesis; medical image analysis; edutainment content creation & user research developments. This paper also examines the existing difficulties of such technologies including paucity in data availability, modality fusion effectiveness and constraints on computational resources while suggesting pathways for future research. The paper goes on to state privacy parallels between multi-modal generative models (GGMs) calls for a model of safety over responsibility when it comes to technological innovation.
Keywords
Multimodal Generative Models , Natural Language Processing , Computer Vision , Data Fusion , Deep Learning , CLIP , DALL·E .
Metadata
Pages: 69-78
References: 19
Disciplines: Artificial Intelligence
Subjects: Multimodal Generative Models
Cite This Article
APA Style
Li, L. (2024). Overview of multimodal generative models in natural language processing and computer vision. Journal of Computer Technology and Applied Mathematics, 1(4), 69-78. https://doi.org/10.5281/zenodo.13988327
Acknowledgments
The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.
FUNDING
Not applicable.
INSTITUTIONAL REVIEW BOARD STATEMENT
Not applicable.
DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
INFORMED CONSENT STATEMENT
Not applicable.
CONFLICT OF INTEREST
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
AUTHOR CONTRIBUTIONS
Not applicable.
References
PUBLISHER'S NOTE
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Copyright © 2025 The Author(s). Published by Southern United Academy of Sciences.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Persistent Identifiers





Abstracting and Indexing




Quality Assurance


Archiving Services
t



