Notice

[#293]   2024-05-19  [Recent Ph.D. graduate: postdocs]  Minsu, Ph.D graduate of 2024 has joined postdoc in AI research at META.

Dr. Minsu Kim, who received his Ph.D. in February 2024, has joined the AI research group at META in London as a postdoctoral researcher. We extend our congratulations to him and hope that he will achieve outstanding results in AI research. By combining the research skills he developed during his Ph.D. at the IVY and LVL labs, particularly in human multimodal AI, with the cutting-edge research he will undertake at META, we believe Dr. Kim will make significant contributions to the field of AI.


[#292]   2024-05-19  [Amazon, Google Internships]   Sungjune and Se Jin will join Amazon and Google for research internships, respectively.

Two PhD students from the IVY lab have secured research internships at Amazon and Google in USA, both  leading institutions in the field of AI. Sungjune Park will join Amazon, and Se Jin Park will join Google to enhance their ongoing research during their PhD studies. Sungjune Park has published several top-tier papers on multimodal AI, focusing on integrating vision and language, while Se Jin Park has published several top-tier papers on human multimodal AI, specifically on the ability to process and understand human-relevant modalities such as spoken language and facial-audio expressions. They expect to complete a paper as an outcome of their internships. This research internship experience will enable them to expand and deepen their PhD research, thereby building global competitiveness.


[#291]   2024-05-16  [ACL 2024]   CoLLaVO: Crayon Large Language and Vision mOdel (Byung-Kwan Lee) is accepted in Findings of the Association for Computational Linguistics, ACL 2024

Title: CoLLaVO: Crayon Large Language and Vision mOdel

Authors: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ACL 2024

[#290]   2024-05-16  [ACL 2024]   Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation (Se Jin Park, Chae Won Kim) accepted In Proceedings of the Annual Meeting of the Association for Computational Linguistics, ACL 2024

Title: Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Authors: Se Jin Park*, Chae Won Kim*, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeonghun Yeo, and Yong Man Ro 

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (\ie, audio and visual) spoken dialogue corpus containing 387 hours of approximately 10,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. All the data will be open-sourced.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ACL 2024

[#289]   2024-04-26  [Pattern Recognition]  Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank (by Sungjune Park, Hyunjun Kim) is accepted in Pattern Recognition

Title: Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank 

Authors: {Sungjune Park, Hyunjun Kim: equal first authors}, and Yong Man Ro

Pedestrian detection is a crucial field of computer vision research which can be adopted in various real-world applications (e.g., self-driving systems). However, despite the noticeable evolution of pedestrian detection, the pedestrian representations learned within a detection framework are usually limited to the particular scene data in which they were trained. Therefore, in this paper, we propose a novel approach to construct versatile pedestrian knowledge bank containing representative pedestrian knowledge which can be applicable to various detection frameworks and adopted in diverse scenes. We extract generalized pedestrian knowledge from a large-scale pretrained model, and we curate them by quantizing most representative features and guiding them to be more distinguishable from various background scenes. After they are stored in the versatile pedestrian knowledge bank, we leverage them to complement and enhance pedestrian features within a detection framework. Through comprehensive experiments, we validate the effectiveness of our method, demonstrating its versatility and outperforming state-of-the-art  detection performances.



IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), Pattern Recognition

[#288]   2024-03-26  [IEEE TCSVT]  Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection (by Sungjune Park, Hyunjun Kim) is accepted in IEEE Trans. on CSVT

Title: Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection 

Authors: {Sungjune Park, Hyunjun Kim: equal first authors}, and Yong Man Ro

Large language models (LLMs) have shown their capability in understanding contextual and semantic information regarding appearance knowledge of instances. In this paper, we introduce a novel approach to utilize the strength of an LLM in understanding contextual appearance variations and to leverage its knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of crucial tasks directly related with our safety (e.g., intelligent driving system), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection. To this end, we establish description corpus which includes numerous narratives describing various appearances of pedestrians and others. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. After that, we perform a task-prompting process to obtain appearance elements which are representative appearance knowledge guided to be relevant to a downstream pedestrian detection task. The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector. Through comprehensive experiments with various pedestrian detectors, we verify the adaptability and effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance on two public pedestrian detection benchmarks (i.e., CrowdHuman and WiderPedestrian).


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), IEEE TCSVT

[#287]   2024-03-12  [2024 가을학기 대학원생 모집]  국비 석사 2명, KAIST박사 1명, 산학장학생 등 모집합니다. 관심있는 학생은 ymro@kaist.ac.kr 로 메일하기 바랍니다.

국비 석사 2명,  KAIST박사 1명,  산학장학생 등 모집합니다.

모집 연구분야

관심있는 학생은 ymro@kaist.ac.kr 로 메일하기 바랍니다.



IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2024

[#286]   2024-02-27  [CVPR 2024]  Causal Mode Multiplexer: A Novel Framework for Unbiased Data (by Taeheon Kim) is accepted in CVPR 2024

Title: Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection

Authors: {Taeheon Kim, Sebin Shin: equal first authors}, Youngjoon Yu, Hak Gu Kim, and Yong Man Ro

RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However, the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically, datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the majority of the pedestrian labels statistically co-occur with their thermal features. As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data. To address this problem, we propose a novel Causal Mode Multiplexer (CMM) framework that effectively learns the causalities between multispectral inputs and predictions. Moreover, we construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection. ROTX-MP mainly includes ROTX examples not presented in previous datasets. Extensive experiments demonstrate that our proposed CMM framework generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new ROTX-MP. We will release our new dataset to the public for future research.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2024

[#285]   2024-02-27  [CVPR 2024]  AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation (by Se Jin Park, Minsu Kim) is accepted in CVPR 2024

Title: AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Authors:  {Jeongsoo Choi, Se Jin Park, Minsu Kim: equal first authors}, and Yong Man Ro

This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2024

[#284]   2024-02-27  [IEEE TMM]  AKVSR: Compressing Audio Knowledge of a Pretrained Model (by Jeong Hun Yeo) is accepted in IEEE Trans. on Multimedia

Title: AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio  Knowledge of a Pretrained Model

Authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, and Yong Man Ro

Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), IEEE Transactions on Multimedia

[#283]   2024-02-22  Recruitment for PhD and MS Students

Title: Recruitment for PhD and MS Students

The IVY Laboratory is promoting international exchanges. For students applying to join the lab after September 2024, we prefer PhD candidates who are interested in international growth after completing their doctoral program and aim to pursue international opportunities upon graduation. Additionally, for master's degree candidates interested in joining our laboratory, we invite even those who aspire to pursue a PhD abroad or seek international career paths. We particularly welcome students who already have a lot of interest and experience in studying and researching deep learning-based approaches. Interested students are encouraged to contact us via email at ymro@kaist.ac.kr.

We look forward to hearing from you.

[#282]   2024-02-21  Prof. Yong Man Ro Named ICT Endowed Chair Professor at KAIST

Title: Prof. Yong Man Ro Named ICT Endowed Chair Professor at KAIST

Prof. Yong Man Ro has been appointed as the ICT Endowed Chair Professor at KAIST. Since establishing the IVY Lab in 1997, Prof. Ro has been instrumental in advancing research in image processing, computer vision, artificial intelligence (AI), and multimedia.

Under his guidance, the IVY Lab has achieved remarkable milestones, including the graduation of 25 PhD and 70 Master's students, who have gone on to make significant contributions in IT area in the world. The laboratory's research output is highly competitive and excellent, including more than 520 peer-reviewed journal articles and top conference papers.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICASSP 2024

[#281]   [ICASSP 2024]   Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens (by Minsu Kim) is accepted in ICASSP 2024

Title: Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

Authors: Minsu Kim,  Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo,  Shinji Watanabe, and Yong Man Ro

In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained visionlanguage model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. The speech units mainly contain linguistic information while suppressing other characteristics of speech. This allows us to incorporate the language modeling capability of the pre-trained vision-language model into the spoken language modeling of Im2Sp. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases, COCO and Flickr8k. Then, we further improve the efficiency of the Im2Sp model. Similar to the speech unit case, we convert the original image into image units, which are derived through vector quantization of the raw image. With these image units, we can drastically reduce the required data storage for saving image data to just 0.8% when compared to the original image data in terms of bits. Demo page: bit.ly/3Z9T6LJ


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICASSP 2024

[#280]   2023-12-20   [ICASSP 2024]   Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper (by Jeong Hun Yeo and Minsu Kim) is accepted in ICASSP 2024

Title: Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

Authors: Jeong Hun Two, Minsu Kim, Shinji Watanabe, and Yong Man Ro

This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: bit.ly/3Lajr6w


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICASSP 2024

[#279]   2023-12-20   [ICASSP 2024]   Persona Extraction through Semantic Similarity for Emotional Support Conversation Generation (by Seunghee Han) is accepted in ICASSP 2024

Title: Persona Extraction through Semantic Similarity for Emotional Support Conversation Generation

Authors: Seunghee Han,  Se Jin Park,  Chae Won Kim, and Yong Man Ro

Providing emotional support through dialogue systems is becoming increasingly important in today’s world, as it can support both mental health and social interactions in many conversation scenarios. Previous works have shown that using persona is effective for generating empathetic and supportive responses. They have often relied on pre-provided persona rather than inferring them during conversations. However, it is not always possible to obtain a user persona before the conversation begins. To address this challenge, we propose PESS (Persona Extraction through Semantic Similarity), a novel framework that can automatically infer informative and consistent persona from dialogues. We devise completeness loss and consistency loss based on semantic similarity scores. The completeness loss encourages the model to generate missing persona information, and the consistency loss guides the model to distinguish between consistent and inconsistent persona. Our experimental results demonstrate that high-quality persona information inferred by PESS is effective in generating emotionally supportive responses.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICASSP 2024

[#278]   2023-12-20   [ICASSP 2024]   Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models (by Jeongsoo Choi) is accepted in ICASSP 2024

Title: Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models 

Authors: Jeongsoo Choi,  Minsu Kim,  Se Jin Park,  and Yong Man Ro

In this paper, we present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. Consequently, we can easily generate face videos that articulate the provided textual sentences, eliminating the necessity of recording speech for each inference, as required in the audio-driven model. To this end, we propose to embed the input text into the learned audio latent space of the pre-trained audio-driven model, while preserving the face synthesis capability of the original pretrained model. Specifically, we devise a Text-to-Audio Embedding Module (TAEM) which maps a given text input into the audio latent space by modeling pronunciation and duration characteristics. Furthermore, to consider the speaker characteristics in audio while using text inputs, TAEM is designed to accept a visual speaker embedding. The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio. The main advantages of the proposed framework are that 1) it can be applied to diverse audio-driven talking face synthesis models and 2) we can generate talking face videos with either text inputs or audio inputs with high flexibility.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICASSP 2024

[#277]   2023-12-20   [ICASSP 2024]   Exploring Phonetic Context-aware Lip-Sync for Talking Face Generation (by Se Jin Park) is accepted in ICASSP 2024

Title: Exploring Phonetic Context-aware Lip-Sync for Talking Face Generation

Authors: Se Jin Park, Minsu Kim, Jeongsoo Choi, and Yong Man Ro

Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-toFace module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICASSP 2024

[#276]   2023-12-10   [AAAI 2024]   OSR via Visual Prompts from Common-Sense Knowledge (by Seongyeop Kim) is accepted in AAAI 2024

Title: Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge

Authors: Seongyeop Kim, Hyung-Il Kim, and Yong Man Ro

Open Set Recognition (OSR) poses significant challenges in distinguishing known from unknown classes. In OSR, the overconfidence problem has become a persistent obstacle, where visual recognition models often misclassify unknown objects as known objects with high confidence. This issue stems from the fact that visual recognition models often lack the integration of common-sense knowledge, a feature that is naturally present in language-based models but lacking in visual recognition systems. In this paper, we propose a novel approach to enhance OSR performance by distilling common-sense knowledge into visual prompts. Utilizing text prompts that embody common-sense knowledge about known classes, the proposed visual prompt is learned by extracting semantic common-sense features and aligning them with image features from visual recognition models. The unique aspect of this work is the training of individual visual prompts for each class to encapsulate this common-sense knowledge. Our methodology is model-agnostic, capable of enhancing OSR across various visual recognition models, and computationally light as it focuses solely on training the visual prompts. This research introduces a method for addressing OSR, aiming at a more systematic integration of visual recognition systems with common-sense knowledge. The obtained results indicate an enhancement in recognition accuracy, suggesting the applicability of this approach in practical settings.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), AAAI 2024

[#275]   2023-12-04   [IEEE TDSC]   Defending Video Recognition Model against Adversarial Perturbations via Defense Patterns (by Hong Joo Lee) is accepted in IEEE TDSC

Title: Defending Video Recognition Model against Adversarial Perturbations via Defense Patterns

Authors: Hong Joo Lee and Yong Man Ro

Deep Neural Networks (DNNs) have been widely successful in various domains, but they are vulnerable to adversarial attacks. Recent studies have also demonstrated that video recognition models are also susceptible to adversarial perturbations, but the existing defense strategies in the image domain do not transfer well to the video domain due to the lack of considering temporal development and require a high computational cost for training video recognition models. This paper, first, investigates the temporal vulnerability of video recognition models by quantifying the effect of temporal perturbations on the model’s performance. Based on these investigations, we propose Defense Patterns (DPs) that can effectively protect video recognition models by adding them to the input video frames. The DPs are generated on top of a pre-trained model, eliminating the need for retraining or fine-tuning, which significantly reduces the computational cost. Experimental results on two benchmark datasets and various action recognition models demonstrate the effectiveness of the proposed method in enhancing the robustness of video recognition models.

“Note: This work was done when Dr. Lee was a PhD student at KAIST. He is now a Postdoctoral Researcher at Technical University of Munich (TUM) after completing his PhD.”

[#274]   2023-10-08   [EMNLP 2023]   Intuitive Multilingual AVSR with a Single-Trained Model (Joanna Hong) is accepted in Findings of the Association for Computational Linguistics, EMNLP 2023

Title:  Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model

Authors: Joanna Hong, Se Jin Park, and Yong Man Ro

We present a novel approach to multilingual audio-visual speech recognition tasks by in troducing a single model on a multilingual  dataset. Motivated by the human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, the proposed model can capture which language is given as an input speech by distinguishing the inherent similarities and  differences between languages. To do so, we  design prompt fine-tuning into the largely pretrained audio-visual  representation model in  order to provide language information, both  label and nuance. Thus, the network can predict the correct speech with the correct language. To verify the effectiveness of the pro posed model, we conduct experiments on a multilingual audio-visual corpus, namely MuAViC, containing 9 languages. Our work contributes  to developing more robust and efficient multilingual audio-visual speech recognition systems, reducing the need for language-specific models.

[#273]   2023-10-08   [IEEE TNNLS]   Enabling Visual Object Detection with Object Sounds via Visual Modality Recalling Memory  (by Jung Uk Kim) is accepted in IEEE TNNLS

Title: Enabling Visual Object Detection with Object Sounds via Visual Modality Recalling Memory 

Authors: Jung Uk Kim and Yong Man Ro

When humans hear sound of an object, they recall associated visual information and integrate the sounds and the recalled visual information to detect the objects. In this paper, we present a novel sound-based object detector that mimics these processes of humans. We design a Visual Modality Recalling (VMR) memory that recalls information of a visual modality, given an audio modal input (i.e., sound). To achieve this goal, we propose visual modality recalling loss and audiovisual association loss to guide VMR memory to memorize the visual modal information by establishing associations between the audio and visual modalities. With the recalled visual modal information through the VMR memory and the original audio modal input, audio-visual integration is conducted. In this step, we introduce integrated  feature contrastive loss which allows the integrated feature to be embedded as if it were encoded using both the audio and visual modalities. This guidance enables our sound-based object detector to perform robust object detection even when the only sound is provided. We believe that our work is a cornerstone study that can provide a new perspective to the conventional object detection studies that rely only on the visual modality. Comprehensive experimental results demonstrate the effectiveness of the proposed method with VMR memory.

“Note: This work was done when Dr. Jung was a PhD student at KAIST. He is now a professor at KyungHee University after completing his PhD.”

[#272]   2023-10-04   2024 전기 학생 모집

2024년 전기 합격한 학생들 축하합니다. 연구실에서 2024년도 전기에 국비 박사과정 2명(완료), KAIST 박사과정 2명(완료), KAIST 석사과정 1명 및 산학장학생 (KEPSI, EPSS, LGenius) 등을 모집합니다.


모집 연구분야

- Multi-modal (vision-sound-language) learning

- Deep learning AI (XAI, Competency, Robustness)

- Vision (object detection, classification etc.) + Large Scale Model


최근 연구실 석박사과정 딥러닝 관련 해외 학회 발표실적 - link

최근 연구실 석박사과정 해외 저널 실적 - link


연구실 입학관련 면담요청은 노용만 교수님 (ymro@kaist.ac.kr)에게 이메일 하기 바랍니다.

[#271]   2023-07-17   [ICCV 2023]   Lip Reading for Low-resource Languages by General Speech Knowledge (by Minsu Kim and Jeong Hun Yeo) is accepted in ICCV 2023

Title: Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

Authors: Minsu Kim*, Jeong Hun Yeo*, Jeongsoo Choi, and Yong Man Ro (* equally contributed)

This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.

[#270]   2023-07-17   [ICCV 2023]   Mitigating Adversarial Vulnerability through Causal Parameter Estimation (by Byung-Kwan Lee and Junho Kim) is accepted in ICCV 2023

Title: Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Double Machine Learning

Authors: Byung-Kwan Lee*, Junho Kim*, and Yong Man Ro (* equally contributed)

Adversarial examples derived from deliberately crafted perturbations on visual inputs can easily harm decision process of deep neural networks. To prevent potential threats, various adversarial training-based defense methods have grown rapidly and become a de facto standard approach for robustness. Despite recent competitive achievements, we observe that adversarial vulnerability varies across targets and certain vulnerabilities remain prevalent. Intriguingly, such peculiar phenomenon cannot be relieved even with deeper architectures and advanced defense methods. To address this issue, in this paper, we introduce a causal approach called Adversarial Double Machine Learning (ADML), which allows us to quantify the degree of adversarial vulnerability for network predictions and capture the effect of treatments on outcome of interests. ADML can directly estimate causal parameter of adversarial perturbations per se and mitigate negative effects that can potentially damage robustness, bridging a causal perspective into the adversarial vulnerability. Through extensive experiments on various CNN and Transformer architectures, we corroborate that ADML improves adversarial robustness with large margins and relieve the empirical observation.

[#269]   2023-07-17   [ICCV 2023]   DiffV2S: Diffusion-based Video-to-Speech Synthesis (by Jeongsoo Choi and Joanna Hong) is accepted in ICCV 2023

Title: DiffV2S: Diffusion-based Video-to-Speech Synthesiswith Vision-guided Speaker Embedding

Authors:  Jeongsoo Choi*, Joanna Hong*, and Yong Man Ro (* equally contributed)

Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and P-tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations, we further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. The proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.

[#268]   2023-06-23   [ICIP 2023]   2 papers have been accepted (Sungjune Park and Yeon Ju Kim) in IEEE ICIP 2023

1. Title: Robust multispectral pedestrian detection via spectral position-free feature mapping

Authors: Sungjune Park, Jung Uk Kim, Jin Mo Song, and Yong Man Ro

Abstract: Recently, although multispectral pedestrian detection has achieved remarkable performances, there is still a problem to be handled, position shift problem. Due to the problem, a pedestrian looks like existing in different positions between each modal image. Then, a single bounding box usually fails to capture an entire pedestrian properly in both modal images at the same time, which means it would not contain some parts of a pedestrian and includes noisy backgrounds instead. In this paper, we propose a novel approach, that is, a pedestrian feature mapping from mis-captured pedestrian features to well-captured pedestrian features which encode an entire pedestrian properly in both modal images. To this end, we utilize a memory architecture which stores well-captured pedestrian features, and then, the well-captured features can enhance the quality of pedestrian representation by providing the distinctive information of a pedestrian. We validate the effectiveness of our approach with comprehensive experiments on two multispectral pedestrian detection datasets, achieving state-of-the-art performances.

2. Title: Mitigating Dataset Bias in Image Captioning through CLIP Confounder-free Captioning Network

Authors: YeonJu Kim, Junho Kim, Byung-Kwan Lee, Sebin Shin, and Yong Man Ro

Abstract: The dataset bias has been identified as a major challenge in image captioning. When the image captioning model predicts a word, it should consider the visual evidence associated with the word, but the model tends to use contextual evidence from the dataset bias and results in biased captions, especially when the dataset is biased toward some specific situations. To solve this problem, we approach from the causal inference perspective and design a causal graph. Based on the causal graph, we propose a novel method named C 2Cap which is CLIP  confounder-free captioning network. We use the global visual confounder to control the confounding factors in the image and train the model  to produce debiased captions. We validate our proposed method on MSCOCO benchmark and demonstrate the effectiveness of our method.

[#267]   2023-06-22   [Image and Vision Computing]   Adversarial anchor-guided feature refinement for adversarial defense (Hakmin Lee) is accepted in ELSEVIER Image and Vision Computing

Title: Adversarial anchor-guided feature refinement for adversarial defense

Authors: Hakmin Lee and Yong Man Ro 

Abstract: Adversarial training (AT), which is known as a robust training method for defending against adversarial examples, usually loses the performance of models for clean examples due to the feature distribution discrepancy between clean and adversarial. In this paper, we propose a novel Adversarial Anchor-guided Feature Refinement (AAFR) defense method aimed at reducing the discrepancy and delivering reliable performances for both clean and adversarial examples. We devise adversarial anchor that detects whether the feature comes from clean or adversarial example. Then, we use adversarial anchor to refine the feature to reduce the discrepancy. As a result, the proposed method substantially achieves adversarial robustness while preserving the performance for clean examples. The effectiveness of the proposed method is verified with comprehensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets.

[#266]   2023-06-16   2023 후기 학생 모집

2023년 후기 학생들 축하해요. 연구실에서 2023년도 후기에 국비 석사과정 1명 및 산학장학생 (KEPSI, EPSS, LGenius), 국비 박사(완료) 등을 모집합니다. 

모집 연구분야

- Multi-modal (vision-sound-language) learning

- Deep learning AI (XAI, Competency, Robustness)

- Large Scale Model + Alpha


최근 연구실 석박사과정 딥러닝 관련 해외 학회 발표실적 - link 

최근 연구실 석박사과정 해외 저널 실적 - link 


연구실 입학관련 면담요청은 노용만 교수님 (ymro@kaist.ac.kr)에게 이메일 하기 바랍니다.

[#265]   2023-06-09   [IEEE TIFS]   Robust Proxy (Hong Joo Lee) is accepted in IEEE Transactions on Information Forensics and Security

Title: Robust Proxy: Improving Adversarial Robustness by Robust Proxy Learning

Authors: Hong Joo Lee and Yong Man Ro

Recently, it has been widely known that deep neural networks are highly vulnerable and easily broken by adversarial attacks. To mitigate the adversarial vulnerability, many defense algorithms have been proposed. Recently, to improve adversarial robustness, many works try to enhance feature representation by imposing more direct supervision on the discriminative feature. However, existing approaches lack an understanding of learning adversarially robust feature representation. In this paper, we propose a novel training frameworks called Robust Proxy Learning. In the proposed method, the model explicitly learns robust feature representations with robust proxies. To this end, firstly, we demonstrate that we can generate class representative robust features by adding class-wise robust perturbations. Then, we use the class representative features as robust proxies. With the classwise robust features, the model explicitly learns adversarially robust feature through the proposed robust proxy learning framework. Through extensive experiments, we verify that we can manually generate robust features, and our proposed learning framework could increase the robustness of the DNNs.

[#264]   2023-05-30   [Recent Ph.D. graduates: postdocs]   Ph.D graduates of 2023 have joined postdocs in AI research at UIUC and TUM.

[Recent PhD graduates: postdocs] Ph.D graduates of 2023 have joined postdocs in AI research at UIUC and TUM.

Dr. Sangmin Lee and Dr. Hong Joo Lee, who received their Ph.Ds in 2023, joined the AI research group at the University of Illinois at Urbana-Champaign (UIUC) and the Technical University of Munich (TUM) as postdocs, respectively. We congratulate them and hope that they will establish world competitiveness in AI research by combining their research skills built in their PhDs at IVY lab and the research they will build at new institutions.

On the other hand, Dr. Jung Uk Kim, who received his PhD in 2022, was appointed as a professor in the school of computing at Kyung Hee University last year.

Also, recent PhD graduates from IVY lab had excellent AI research achievements and were selected as postdocs at the AI top institutes and as professors in AI research. Prof. Hak Gu Kim and Prof. Sung Tae Kim, who were postdocs at EPFL and TUM, respectively, returned to Korea a few years ago and are continuing their AI research as professors at Chung-Ang University and Kyung Hee University, respectively.

[#263]   2023-05-29   [META, CMU Internships]   Joanna and Minsu will join META and CMU for research internships, respectively.

[META, CMU Internships] Joanna and Minsu will join META and CMU for research internships, respectively.

Two PhD students from the human multimodal AI research group in IVY lab have secured research internships at META and CMU, the leading institutes in the AI field. Joanna Hong will join META (https://about.meta.com/realitylabs/ ) and Minsu Kim will join CMU (https://lti.cs.cmu.edu/work ) for a few months, respectively. They have both published several top-tier papers on human multimodal AI that deal with the ability to process and understand human-related modalities, such as facial __expression__, speech, and language. They expect to collaborate with their mentors and colleagues at the two institutes to publish a top-tier paper during their internships. This experience will enable them to expand and deepen their PhD research and establish their world competitiveness.

[#262]   2023-05-19   [Interspeech 2023]   Intelligible Lip-to-speech Synthesis with Speech Units (by Jeongsoo Choi and Minsu Kim) is accepted in Interspeech 2023

TItle: Intelligible Lip-to-speech Synthesis with Speech Units 

Authors: Jeongsoo Choi and Minsu Kim and Yong Man Ro

In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing intelligible speech from a silent lip movement video. Specifically, to complement the insufficient supervisory signal of the previous L2S model, we propose to use quantized self-supervised speech representations, named speech units, as an additional prediction target of the proposed L2S model. Therefore, the proposed L2S model is trained to generate multi-target, mel-spectrogram and speech units. As the speech units are discrete representations while mel-spectrogram is continuous, the proposed multi-target L2S model can be trained with strong content supervision, even without using text-labeled data. Moreover, to accurately convert the synthesized mel-spectrogram into a waveform, we introduce a multi-input vocoder that can generate a clear waveform even from blurry and noisy mel-spectrogram by referring to the speech units. Evaluation results confirm the effectiveness of the proposed method.


[#261]   2023-05-19   [Frontiers in Medicine]   Deep learning-based classification for Ophthalmology (Hyebin Lee) is accepted in Frontiers in Medicine

Title: Deep learning-based classification system of bacterial keratitis and fungal keratitis using anterior segment images

Authors:  Yeo Kyoung Won*, Hyebin Lee*, Youngjun Kim, Gyule Han, Tae-Young Chung, Yong Man Ro and Dong Hui Lim

(* equal contributor)

Introduction: Infectious keratitis is a vision threatening disease. Bacterial and fungal keratitis are often confused in the early stages, so right diagnosis and optimized treatment for causative organisms is crucial. Antibacterial and antifungal medications are completely different, and the prognosis for fungal keratitis is even much worse. Since the identification of microorganisms takes a long time, empirical treatment must be started according to the appearance of the lesion before an accurate diagnosis. Thus, we developed an automated deep learning (DL) based diagnostic system of bacterial and fungal keratitis based on the anterior segment photographs using two proposed modules, Lesion Guiding Module (LGM) and Mask Adjusting Module (MAM).

[#260]   2023-04-27   [IEEE TIP]   Stereoscopic Vision Recalling Memory for Monocular 3D Object Detection (Jung Uk Kim) is accepted in IEEE Transactions on Image Processing

Title: Stereoscopic Vision Recalling Memory for Monocular 3D Object Detection

Authors: Jung Uk Kim, Hyung-Il Kim, and Yong Man Ro

Monocular 3D object detection has drawn increasing attention in various human-related applications, such as autonomous vehicles, due to its cost-effective property. On the other hand, a monocular image alone inherently contains insufficient information to infer the 3D information. In this paper, we propose a new monocular 3D object detector that can recall the stereoscopic visual information about an object, given a monocular each object by being aware of its location. Next, given the object appearance of the monocular image, we devise Monocular-to-tereoscopic (M2S) memory that can recall the object appearance of the counterpart view and corresponding depth information. For this purpose, we introduce a stereoscopic vision memorizing loss that guides M2S memory to store the stereoscopic visual information. Further, we propose a binocular vision association loss to guide M2S memory that can associate information of the left-right view about the object when estimating the depth. As a result, our monocular 3D object detector with M2S memory can effectively exploit the recalled stereoscopic visual information in the inference phase. The comprehensive experimental results on the two public datasets, KITTI 3D Object Detection Benchmark and Waymo Open Dataset, demonstrate the effectiveness of the proposed method. We claim that our method is a step forward method that follows the behaviors of humans that can recall the stereoscopic visual information even when one eye is closed.

"Note: Jung Uk Kim is a professor at KyungHee University after completing his PhD."

[#259]   2023-03-27 [IEEE TNNLS]   Advancing Adversarial Training by Injecting Booster Signal (by Hong Joo Lee and Youngjoon Yu) is accepted in IEEE Transactions on Neural Networks and Learning Systems

Title: Advancing Adversarial Training by Injecting Booster Signal

Authors: Hong Joo Lee and Youngjoon Yu, and Yong Man Ro

Recent works have demonstrated that deep neural networks (DNNs) are highly vulnerable to adversarial attacks. To defend against adversarial attacks, many defense strategies have been proposed, among which adversarial training has  been demonstrated to be the most effective strategy. However, it has been known that adversarial training sometimes hurts natural accuracy. Then, many works focus on optimizing model parameters to handle the problem. Different from the previous approaches, in this paper, we propose a new approach to improvethe adversarial robustness by using an external signal rather than model parameters. In the proposed method, a well-optimized universal external signal called a booster signal is injected to theoutside of the image which does not overlap with the original content. Then, it boosts both adversarial robustness and natural accuracy. The booster signal is optimized in parallel to modelparameters step by step collaboratively. Experimental results show that the booster signal can improve both the natural and robust accuracies over the recent state-of-the-art adversarial training methods. Also, optimizing the booster signal is generaland flexible enough to be adopted on any existing adversarial training methods.

[#258]   2023-02-28 [CVPR 2023]    Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring (by Joanna Hong and Minsu Kim) is accepted in CVPR2023

Title: Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring 

Authors: Joanna Hong*, Minsu Kim*, Jeongsoo Choi, and Yong Man Ro (* equally contributed)

Visual Speech Recognition (AVSR) under multimodal input corruption situation where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. However, in real life, the clean visual inputs are not always accessible and can even be corrupted by occluded lip region or with noises. Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models. Then, we design multimodal input corruption modeling to develop robust AVSR models. Lastly, we propose a novel AVSR framework, namely Audio-Visual Reliability Scoring module (AV-RelScore), that is robust to the corrupted multimodal inputs. The AV-RelScore can determine which input modal stream is reliable or not for the prediction and also can exploit the more reliable streams in prediction. The effectiveness of the proposed method is evaluated with comprehensive experiments on popular benchmark databases, LRS2 and LRS3. We also show that the reliability scores obtained by AV-RelScore well reflect the degree of corruption and make the proposed model focus on the reliable multimodal representations.

[#257]   2023-02-28 [CVPR 2023]    Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression (by Junho Kim and Byung-Kwan Lee) is accepted in CVPR 2023

Title: Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression 

Authors: Junho Kim*, Byung-Kwan Lee*, and Yong Man Ro (* equally contributed)

The origin of adversarial examples is still inexplicable in research fields, and it arouses arguments from various viewpoints, albeit comprehensive investigations. In this paper, we propose a way of delving into the unexpected vulnerability in adversarially trained networks from a causal perspective, namely adversarial instrumental variable (IV) regression. By deploying it, we estimate the causal relation of adversarial prediction under an unbiased environment dissociated from unknown confounders. Our approach aims to demystify inherent causal features on adversarial examples by leveraging a zero-sum optimization game between a casual feature estimator (i.e., hypothesis model) and worst-case counterfactuals (i.e., test function) disturbing to find causal features. Through extensive analyses, we demonstrate that the estimated causal features are highly related to the correct prediction for adversarial robustness, and the counterfactuals exhibit extreme features significantly deviating from the correct prediction. In addition, we present how to effectively inoculate CAusal FEatures (CAFE) into defense networks for improving adversarial robustness.

[#256]   2023-02-16 [ICASSP 2023]    Lip-to-speech Synthesis in the Wild with Multi-task Learning (by Minsu Kim and Joanna Hong) is accepted in ICASSP 2023

Title: Lip-to-speech Synthesis in the Wild with Multi-task Learning

Authors: Minsu Kim∗, Joanna Hong∗, and Yong Man Ro (* equally contributed)

Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multi-task learning that guides the model using multimodal supervision, i.e. text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of the proposed method using LRS2, LRS3, and LRW datasets.

[#255]   2023-02-16 [ICASSP 2023]    Similarity Relation Preserving Cross-Modal Learning For Multispectral Pedestrian Detection Against Adversarial Attacks (by Jung Uk Kim is accepted in ICASSP 2023

Title: Similarity Relation Preserving Cross-Modal Learning For Multispectral Pedestrian Detection Against Adversarial Attacks

Authors: Jung Uk Kim and Yong Man Ro (* equally contributed)

Although multispectral pedestrian detection studies have shown remarkable detection performances, they are still vulnerable to adversarial attacks. We see the similarity relations between object candidates were not maintained because of the adversarial attacks, resulting in performance degradation. In this paper, we introduce a new method that can preserve the similarity relation between candidates against adversarial attacks using multispectral knowledge. First, we propose Similarity Relation Generation (SRG) module to generate the optimal similarity relation between clean candidates by referring to the two modalities (color and thermal). Second, we propose Adversarial Similarity Relation Preserving (ASRP) module to guide the similarity relation between adversarial candidates to be similar to that of the clean candidates. By maintaining the relationship between candidates, our multispectral detector can distinguish between pedestrian/background classes even in adversarial attacks. Comprehensive experimental results show that our method conspicuously improves the adversarial robustness.

[#254]   2023-02-16 [ICASSP 2023]    Multi-Temporal Lip-Audio Memory for Visual Speech Recognition (by Jeong Hun Yeo and Minsu Kim) is accepted in ICASSP 2023

Title: Multi-Temporal Lip-Audio Memory for Visual Speech Recognition

Authors: Jeong Hun Yeo, Minsu Kim, and Yong Man Ro

Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements. Some works have been recently presented which use audio signals to supplement visual information. However, existing methods utilize only limited information such as phoneme-level features and soft labels of Automatic Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements. The proposed method is mainly composed of two parts: 1) MTLAM saves multi-temporal audio features produced from short- and long-term audio signals, and the MTLAM memorizes a visual-to-audio mapping to load stored multi-temporal audio features from visual features at the inference phase. 2) We design an audio temporal model to produce multi-temporal audio features capturing the context of neighboring words. In addition, to construct effective visual-to-audio mapping, the audio temporal models can generate audio features time-aligned with visual features. Through extensive experiments, we validate the effectiveness of the MTLAM achieving state-of-the-art performances on two public VSR datasets.

[#253]   2023-01-12 [Report]    The research results (IEEE Trans. IP) of Youngjoon has been introduced to YouTube as a KAIST research excellency.