[#321] 2025-06-26 [ICCV 2025] Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations (by Jeong Hun Yeo) is accepted to ICCV 2025.
Title: Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICCV 2025
[#320] 2025-06-23 [Recent Ph.D. Graduate: Postdoc] Junho Kim Joins a postdoc in AI research at UIUC
Junho Kim has successfully defended his Ph.D. and is set to join the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign (UIUC) as a postdoctoral researcher. His move marks a forward step in his burgeoning career in artificial intelligence research.
During his doctoral studies, Junho's research focused on enhancing the interpretability and robustness of AI models. He developed novel perturbation-based methods for explaining the behavior of blackbox AI models. A key contribution of his work involved a comprehensive analysis of AI models' vulnerabilities to adversarial perturbations, leading to the successful distinction between robust and non-robust features. These significant findings were published in AI Top-tier venues such as IEEE Transactions on Image Processing (TIP) and the Conference on Neural Information Processing Systems (NeurIPS).
More recently, Junho's research addressed the pressing issue of hallucinations in Multimodal Large Language Models (MLLMs). He proposed innovative mitigation strategies to alleviate these issues in large-scale models, introducing techniques based on counterfactual approaches and decoding-time interventions. This work was presented at the Conference on Computer Vision and Pattern Recognition (CVPR).
Junho's transition to UIUC is anticipated to further deepen his contributions to AI research, reinforcing his position as an emerging international scholar in the field.
[#319] 2025-06-22 [Meta Internship] [Meta Internship] Se Jin Park will join Meta for a research scientist intern.
Se Jin Park, a PhD student in Human Multimodal LLMs from the IVL lab, has secured a research internship at Meta in the USA. Meta is recognized as a leading institution in the field of LLM research.
This internship will provide Se Jin with the opportunity to enhance her ongoing PhD research in human multimodal LLMs. She has previously published notable papers on human multimodal AI, focusing on processing and understanding human-relevant modalities such as spoken language and human speech.
As Se Jin approaches the completion of her PhD, this internship is intended to build upon her prior research experiences. The aim is to expand and deepen her doctoral work, which is expected to strengthen her global competitiveness for future roles in leading AI institutions post-PhD.
[#318] 2025-06-03 [IEEE TMM] TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages (by Minsu Kim) is accepted to IEEE Transactions on Multimedia.
Title: TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Minsu Kim*, Jee-weon Jung*, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro (*equal contribution)
The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), IEEE Transactions on Multimedia
[#317] 2025-05-28 [ACL 2025] MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens (by Jeong Hun Yeo, Hyeongseop Rha) is accepted to the Findings of ACL 2025.
Title: MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Jeong Hun Yeo*, Hyeongseop Rha*, Se Jin Park, Yong Man Ro (* equal contribution)
Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due to the high temporal resolution of audio-visual speech processed by LLMs. In this work, we introduce an efficient multimodal speech LLM framework that minimizes token length while preserving essential linguistic content. Our approach employs an early av-fusion module for streamlined feature integration, an audio-visual speech Q-Former that dynamically allocates tokens based on input duration, and a refined query allocation strategy with a speech rate predictor to adjust token allocation according to speaking speed of each audio sample. Extensive experiments on the LRS3 dataset show that our method achieves state-of-the-art performance with a WER of 0.74% while using only 3.5 tokens per second. Moreover, our approach not only reduces token usage by 86% compared to the previous multimodal speech LLM framework, but also improves computational efficiency by reducing FLOPs by 35.7%.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ACL 2025
[#316] 2025-05-14 [ICML 2025] Long-Form Speech Generation with Spoken Language Models (by Se Jin Park) is accepted as Oral (~1%) in ICML 2025.
Title: Long-Form Speech Generation with Spoken Language Models
Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time. With these considerations we propose SpeechSSM, the first speech language model to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling. Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long. Speech samples and the dataset are released at this https URL.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICML 2025
[#315] 2025-04-18 [2025 가을학기 연구실 학생 모집] MLLM (Multimodal large language model)+ (Vision, Audio, Language) 분야를 연구할 인재를 초청합니다.
국비 석사 1명, KAIST 장학생 (석사, 박사), KAIST 프로그램(KEPSI, EPSS, LGenius, EPSD) 등 학생을 모집합니다.
학생 초청 연구분야는 MLLM 을 기반으로 비전, 멀티모달분야 입니다.
MLLM (Multimodal large language model) + (Vision, Audio, Language)
Integration Vision, Language and Audio
많은 학생들의 관심과 문의 감사드립니다.
IVL 연구실에서 MLLM+ 를 연구하고자 하는 학생은 합격후 교수님(ymro@kaist.ac.kr)에게 면담신청 하시기 바랍니다.
[#314] 2025-03-12 [Recruited by Deepmind] Dr. Minsu Kim and Dr. Joanna Hong, have been recruited by DeepMind.
IVL Lab proudly announces that two of PhD graduates, Dr. Minsu Kim and Dr. Joanna Hong, have been recruited by DeepMind, a world leader in artificial intelligence research. Both Dr. Kim and Dr. Hong were instrumental in pioneering Human Multimodal research during their time at IVL Lab, making significant strides in the field.
We are incredibly proud of Dr. Kim and Dr. Hong's achievements, said Prof. YM Ro, director of IVL Lab. Their dedication and innovative research have left a lasting impact on IVL lab, and we are excited to see the contributions they will make at DeepMind."
Dr. Kim and Dr. Hong's work at IVL Lab focused on Human multimodality transformation, e.g., developing advanced multimodal learning models, exploring human-computer interaction through AI. Their research has been published in top-tier conferences and journals, showcasing the cutting-edge work produced at IVL Lab. You can explore their research further:
Dr. Kim's Research: https://sites.google.com/view/ms-dot-k
Dr. Hong's Research: https://joannahong.github.io/
This achievement highlights IVL Lab's commitment to fostering top talent and conducting impactful research in the field of artificial intelligence. We wish Dr. Kim and Dr. Hong all the best in their new endeavors at DeepMind and look forward to their future contributions to the AI community.
[#313] 2025-02-27 [CVPR 2025] SALOVA: Segment-Augmented Long Video Assistance for Targeted Retrieval and Routing in Long-Form Video Analysis (by Junho Kim, Hyunjun Kim) is accepted in CVPR 2025.
Title: SALOVA: Segment-Augmented Long Video Assistance for Targeted Retrieval and Routing in Long-Form Video Analysis
Junho Kim*, Hyunjun Kim*, Hosu Lee, Yong Man Ro (* equal contributor)
Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2025
[#312] 2025-02-27 [CVPR 2025] VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models (by Byung-Kwan Lee) is accepted in CVPR 2025.
Title: VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu
The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs’ layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2025
[#311] 2024-12-10 [AAAI 2025] Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language (by Jeong Hun Yeo) is accepted in AAAI 2025.
Title: Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
Jeong Hun Yeo, Chae Won Kim, Hyunjun Kim, Hyeongseop Rha, Seunghee Han, Wen-Huang Cheng, Yong Man Ro
Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. The effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in the previous works. Moreover, existing datasets for speaker adaptation have limited vocabulary size and pose variations, limiting the validation of previous speaker-adaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. In addition, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in wild, sentence-level lip reading for the first time. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, with the proposed adaptation method, we show that the proposed method achieves larger improvements when applied to the target speaker, compared to the previous works.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), AAAI 2025