[#315] 2025-04-18 [2025 가을학기 연구실 학생 모집] MLLM (Multimodal large language model)+ (Vision, Audio, Language) 분야를 연구할 인재를 초청합니다.
국비 석사 1명, KAIST 장학생 (석사, 박사), KAIST 프로그램(KEPSI, EPSS, LGenius, EPSD) 등 학생을 모집합니다.
학생 초청 연구분야는 MLLM 을 기반으로 비전, 멀티모달분야 입니다.
MLLM (Multimodal large language model) + (Vision, Audio, Language)
Integration Vision, Language and Audio
많은 학생들의 관심과 문의 감사드립니다.
IVL 연구실에서 MLLM+ 를 연구하고자 하는 학생은 합격후 교수님(ymro@kaist.ac.kr)에게 면담신청 하시기 바랍니다.
[#314] 2025-03-12 [Recruited by Deepmind] Dr. Minsu Kim and Dr. Joanna Hong, have been recruited by DeepMind.
IVL Lab proudly announces that two of PhD graduates, Dr. Minsu Kim and Dr. Joanna Hong, have been recruited by DeepMind, a world leader in artificial intelligence research. Both Dr. Kim and Dr. Hong were instrumental in pioneering Human Multimodal research during their time at IVL Lab, making significant strides in the field.
We are incredibly proud of Dr. Kim and Dr. Hong's achievements, said Prof. YM Ro, director of IVL Lab. Their dedication and innovative research have left a lasting impact on IVL lab, and we are excited to see the contributions they will make at DeepMind."
Dr. Kim and Dr. Hong's work at IVL Lab focused on Human multimodality transformation, e.g., developing advanced multimodal learning models, exploring human-computer interaction through AI. Their research has been published in top-tier conferences and journals, showcasing the cutting-edge work produced at IVL Lab. You can explore their research further:
Dr. Kim's Research: https://sites.google.com/view/ms-dot-k
Dr. Hong's Research: https://joannahong.github.io/
This achievement highlights IVL Lab's commitment to fostering top talent and conducting impactful research in the field of artificial intelligence. We wish Dr. Kim and Dr. Hong all the best in their new endeavors at DeepMind and look forward to their future contributions to the AI community.
[#313] 2025-02-27 [CVPR 2025] SALOVA: Segment-Augmented Long Video Assistance for Targeted Retrieval and Routing in Long-Form Video Analysis (by Junho Kim, Hyunjun Kim) is accepted in CVPR 2025.
Title: SALOVA: Segment-Augmented Long Video Assistance for Targeted Retrieval and Routing in Long-Form Video Analysis
Junho Kim*, Hyunjun Kim*, Hosu Lee, Yong Man Ro (* equal contributor)
Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2025
[#312] 2025-02-27 [CVPR 2025] VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models (by Byung-Kwan Lee) is accepted in CVPR 2025.
Title: VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu
The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs’ layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2025
[#311] 2024-12-10 [AAAI 2025] Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language (by Jeong Hun Yeo) is accepted in AAAI 2025.
Title: Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
Jeong Hun Yeo, Chae Won Kim, Hyunjun Kim, Hyeongseop Rha, Seunghee Han, Wen-Huang Cheng, Yong Man Ro
Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. The effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in the previous works. Moreover, existing datasets for speaker adaptation have limited vocabulary size and pose variations, limiting the validation of previous speaker-adaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. In addition, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in wild, sentence-level lip reading for the first time. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, with the proposed adaptation method, we show that the proposed method achieves larger improvements when applied to the target speaker, compared to the previous works.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), AAAI 2025