Image and video systems (IVY) Lab at KAIST, was founded in 1997 and has been led by Prof. Ro since its establishment. Among the years IVY Lab has been conducting research in a wide spectrum of multimedia including image and video processing and multimodal deep learning. Some of the recent research topics of the IVY Lab are: Multimodal deep learning, integrating vision, speech, and language for AI, vision with large scale models, inclusive human multimodal conversation, interpretability and robustness of deep learning models, and computer vision and multimedia. IVY Lab has produced about 130 journal papers and 350 conference papers over the last years. The collaborative lab environment and the enthusiasm of its members have made it be in touch with the latest developments of standards and AI. For example, the lab has developed the homogeneous texture descriptor for the MPEG standard, ROI descriptor in SVC and various description schemes in user characteristics as a part of the MPEG LA. In AI of recent years, the lab has accomplished several outstanding research achievements: Deep learning based visual recognition, distinguishing homophenes using multi-head visual-audio memory, distilling robust and non-robust features in adversarial examples, Synctalkface: talking face generation, lip to speech synthesis with visual context attentional GAN, Cromm-VSR: cross-modal memory augmented visual speech recognition, multi-modality associative bridging through memory, video prediction recalling long-term motion context, structure boundary preserving segmentation, BMAN: bidirectional multi-scale aggregation networks, mode variational LSTM robust to unseen modes of variation, and multi-objective based spatio-temporal feature representation learning. The lab is continuously working hand to hand with industry to be able to innovate and challenge the state of the art in multiple aspects of Multimodal AI. Currently the lab is interested in the following research topics:
Deep learning and machine learning on computer vision and multimedia
Multimodal deep learning
Integrating vision, speech, and language for AI
Multimodal object and motion detection/recognition
Inclusive human machine teaming
Analysis for competency, interpretability, memorability, and robustness of deep learning model
Multimodal prompt with large scale model
Integrating vision, speech, and language for AI is an exciting research area. This topic aims to build AI systems that can solve problems of monomodality by co-learning with multimodal data. Some applications of this topic are multimodal chatbots, visual speech recognition, multimodal machine translation, multimodal information retrieval, and multimodal sentiment analysis.
How to use Pretrained Large Scale Model is the research question. As a one-way, multimodal prompt with large scale model is an interesting research topic that explores how to design multimodal prompts that can guide large scale models to solve AI tasks. The large-scale models need to handle multiple types of data, such as images, videos, audio, and text. Some applications of this topic are multimodal affective computing, multimodal machine translation, multimodal cognitive AI, Multimodal task-oriented dialogue, and multimodal instruction-tuning.
This research field combines different modalities of data, such as vision, language, and speech, to perform AI tasks. Currently, multimodal deep learning is being studied for human multimodality (speech, language, talking face) translation. The human multimodal translation includes visual speech recognition, speech synthesis, talking face generation, and audiovisual speech recognition. Also, image, text, and sound are transformed among them, e.g., image to text and text to image, and image to sound. Multimodal deep learning can help improve the performance and robustness of AI models by using complementary information from different modalities. Some applications of this topic are multimodal translation, multimodal video captioning, multimodal image retrieval, multimodal speech enhancement, and multimodal scene understanding.
Inclusive human multimodal conversation is a research topic that explores in any circumstance humans can communicate with each other via machine with use of AI empowered modes of conversion. This topic is important for understanding how humans can interact more effectively and empathetically in diverse contexts and situations, such as different culture and language. Some applications of this topic are inclusive education, inclusive health care, inclusive entertainment, and inclusive social media.
These research topics have goals to understand behavior of deep learning models on various tasks and domains. Competency refers to the ability of a model to achieve high accuracy and efficiency on a specific task or domain, such as image classification or natural language processing. Interpretability refers to the ability of a model to provide understandable and explainable outputs for human, such as revealing the relevant features or generating multimodal descriptions. Memorability refers to the ability of a model to store and retrieve knowledge from previous inputs or outputs, such as using attention mechanisms or memory networks. Robustness refers to the ability of a model to maintain its performance under various adversarial attacks. Some applications of this topic are competency analysis, interpretability analysis, memorability analysis, and robustness analysis to verify a reliable model.
Computer vision in AI deals with computational methods for machine to understand and interpret the content of visual data. Computer vision and multimedia aim to make machine see and understand multimodal data from cameras or sensors, and take interaction based on that information. Some applications of this topic are augmented reality (AR), virtual reality (VR), and mixed reality (MR), computer vision in interactive media, video editing, content analysis, content recommendation, content protection, content enhancement, image processing, image retrieval, image synthesis, image captioning, and image understanding.