Applications of Large Language Models in Multimodal Learning

Peiyang Yu; Xiaochuan Xu; Jiani Wang

doi:10.5281/zenodo.14001455

OPEN ACCESS |Research Article ||2 November 2024

Applications of Large Language Models in Multimodal Learning

Peiyang Yu

Carnegie Mellon University

peiyangy@alumni.cmu.edu

Information Networking Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, peiyangy@alumni.cmu.edu.

Xiaochuan Xu

Carnegie Mellon University

xiaochux@alumni.cmu.edu

Information Networking Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, xiaochux@alumni.cmu.edu.

Jiani Wang

Stanford University

jwang.tech@outlook.com

Department of computer science, Stanford university, Stanford CA 94305, jwang.tech@outlook.com.

* Corresponding Author¹: Peiyang Yu, E-Mail: peiyangy@alumni.cmu.edu

Publication

Accepted Unknow ; Published 2024 November 2

Journal of Computer Technology and Applied Mathematics, 2024, 1(4), 3007-4126.

Abstract

In this paper, we provide a systematic review of the emerging field on applications for Large Language Models (LLMs) in multimodal learning, especially how such methodologies help improve orchestrated task performance by integrating different modalities like images, text, and audio. Multimodal learning is a field where we combine various types of data to make models learn multiple attributes and generate meaningful outputs. It is widely applied in image captioning, cross-modal retrieval, sentiment analysis, and speech recognition. It reviews the main multimodal learning approaches, such as feature extraction, modality alignment, and fusion strategies (early fusion, late fusion, and hybridization), and the performance of LLMs in cross-modal tasks. It highlights the present technological challenges, emphasizing concerns regarding computational resource utilization, model complexity, as well as a lack of multimodal fusion. Lastly, the article provides some suggestions for future applications on how to better integrate modalities and few-shot learning in cross-modal generation models. It also discusses ways to make multimodal machine translation systems run faster using less distributed computational power.

Keywords

Large Language Models (LLMs) , Multimodal Learning , Cross-modal Tasks , Few-shot Learning , Cross-modal Generation .

Metadata

DOI:

10.5281/zenodo.14001455

ARK:

ark:/40704/JCTAM.v1n4a13

Pages: 108-116

References: 25

Disciplines: Computer Sciences

Subjects: Large Language Models

Cite This Article

APA Style

Yu, P., Xu, X. & Wang, J. (2024). Applications of large language models in multimodal learning. Journal of Computer Technology and Applied Mathematics, 1(4), 108-116. https://doi.org/10.5281/zenodo.14001455

Acknowledgments

The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

FUNDING

Not applicable.

INSTITUTIONAL REVIEW BOARD STATEMENT

Not applicable.

DATA AVAILABILITY STATEMENT

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

INFORMED CONSENT STATEMENT

Not applicable.

CONFLICT OF INTEREST

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

AUTHOR CONTRIBUTIONS

Not applicable.

References

1.

Liu, W., Zhou, L., Zeng, D., Xiao, Y., Cheng, S., Zhang, C., ... & Chen, W. (2024). Beyond Single-Event Extraction: Towards Efficient Document-Level Multi-Event Argument Extraction. arXiv preprint arXiv:2405.01884.

2.

Zhang, Y., Wang, F., Huang, X., Li, X., Liu, S., & Zhang, H. (2024). Optimization and Application of Cloud-based Deep Learning Architecture for Multi-Source Data Prediction. arXiv preprint arXiv:2410.12642.

3.

Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., & Hussain, A. (2023). Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 91, 424-444.

4.

Liang, P. P., Zadeh, A., & Morency, L. P. (2024). Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10), 1-42.

5.

Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., & Hussain, A. (2023). Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 91, 424-444.

6.

Bačić, B., Feng, C., & Li, W. (2024). JY61 IMU SENSOR EXTERNAL VALIDITY: A FRAMEWORK FOR ADVANCED PEDOMETER ALGORITHM PERSONALISATION. ISBS Proceedings Archive, 42(1), 60.

7.

Liu, W., Cheng, S., Zeng, D., & Hong, Q. (2023, July). Enhancing Document-level Event Argument Extraction with Contextual Clues and Role Relevance. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 12908-12922).

8.

Leong, H. Y., Gao, Y. F., Shuai, J., Zhang, Y., & Pamuksuz, U. (2024). Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation. arXiv preprint arXiv:2409.09324.

9.

Rashkin, H., Nikolaev, V., Lamm, M., Aroyo, L., Collins, M., Das, D., ... & Reitter, D. (2023). Measuring attribution in natural language generation models. Computational Linguistics, 49(4), 777-840.

10.

Miah, M. S. U., Kabir, M. M., Sarwar, T. B., Safran, M., Alfarhood, S., & Mridha, M. F. (2024). A multimodal approach to cross-lingual sentiment analysis with ensemble of transformer and LLM. Scientific Reports, 14(1), 9603.

11.

Scherrer, N., Shi, C., Feder, A., & Blei, D. (2024). Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems, 36.

12.

Asaithambi, S. P. R., Venkatraman, R., & Venkatraman, S. (2023). A thematic travel recommendation system using an augmented big data analytical model. Technologies, 11(1), 28.

13.

Ataallah, K., Shen, X., Abdelrahman, E., Sleiman, E., Zhu, D., Ding, J., & Elhoseiny, M. (2024). Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413.

14.

Marchisio, K., Ko, W. Y., Bérard, A., Dehaze, T., & Ruder, S. (2024). Understanding and mitigating language confusion in llms. arXiv preprint arXiv:2406.20052.

15.

Atrey, K., Singh, B. K., & Bodhey, N. K. (2024). Multimodal classification of breast cancer using feature level fusion of mammogram and ultrasound images in machine learning paradigm. Multimedia Tools and Applications, 83(7), 21347-21368.

16.

Quiles Pérez, M., Martínez Beltrán, E. T., López Bernal, S., Horna Prat, E., Montesano Del Campo, L., Fernández Maimó, L., & Huertas Celdran, A. (2024). Data fusion in neuromarketing: Multimodal analysis of biosignals, lifecycle stages, current advances, datasets, trends, and challenges. Information Fusion, 105, 102231.

17.

Atz, K., Cotos, L., Isert, C., Håkansson, M., Focht, D., Hilleke, M., ... & Schneider, G. (2024). Prospective de novo drug design with deep interactome learning. Nature Communications, 15(1),

18.

Wang, D. (Ed.). (2016). Information Science and Electronic Engineering: Proceedings of the 3rd International Conference of Electronic Engineering and Information Science (ICEEIS 2016), January 4-5, 2016, Harbin, China. CRC Press.

19.

Khemani, B., Patil, S., Kotecha, K., & Tanwar, S. (2024). A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions. Journal of Big Data, 11(1), 18.

20.

Akkem, Y., Biswas, S. K., & Varanasi, A. (2024). A comprehensive review of synthetic data generation in smart farming by using variational autoencoder and generative adversarial network. Engineering Applications of Artificial Intelligence, 131, 107881.

21.

Ghassemiazghandi, M. (2024). An Evaluation of ChatGPT's Translation Accuracy Using BLEU Score. Theory and Practice in Language Studies, 14(4), 985-994.

22.

Li, X., & Liu, S. (2024). Predicting 30-Day Hospital Readmission in Medicare Patients: Insights from an LSTM Deep Learning Model. medRxiv. doi:10.1101/2024.09.08.24313212

23.

Lu, J. (2024). Optimizing E-Commerce with Multi-Objective Recommendations Using Ensemble Learning.

24.

Liu T, Wu Y, Ye A, Cao L, Cao Y. Two-stage sparse multi-objective evolutionary algorithm for channel selection optimization in BCIs. Frontiers in Human Neuroscience. 2024 May 22;18:1400077.

25.

Yu, P., Cui, V. Y., & Guan, J. (2021, March). Text classification by using natural language processing. In Journal of Physics: Conference Series (Vol. 1802, No. 4, p. 042010). IOP Publishing.

PUBLISHER'S NOTE

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Copyright © 2025 The Author(s). Published by Southern United Academy of Sciences.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Journal of Computer Technology and Applied Mathematics (JCTAM)
Published by Southern United Academy of Sciences Limited ISNI: 0000000512776460, operated by the Publications Division.

JCTAM OPEN ACCESS

Journal of Computer Technology and Applied Mathematics

Applications of Large Language Models in Multimodal Learning

Publication

Abstract

Keywords

Metadata

Cite This Article

Acknowledgments

FUNDING

INSTITUTIONAL REVIEW BOARD STATEMENT

DATA AVAILABILITY STATEMENT

INFORMED CONSENT STATEMENT

CONFLICT OF INTEREST

AUTHOR CONTRIBUTIONS

References

PUBLISHER'S NOTE

Persistent Identifiers

Abstracting and Indexing

Quality Assurance

Archiving Services