Integrating Vision, Speech, and Language for AI
Integrating vision, speech, and language for AI is a challenging and exciting research area. This topic aims to build AI systems that interact with multimodal data. This topic is related multimodal deep learning topics below. Some of the ongoing works in IVL&IVL Lab are as follows:
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge (ICCV 2023)
Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation (arxiv 2023)
Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens (arxiv 2023)
Incorporating Language-Driven Appearance Knowledge Units with Visual Cues in Pedestrian Detection
Utilizing Large scale model, Multimodal Prompt
Multimodal prompt with large scale model is a research topic that explores how to design multimodal prompts that can guide large scale models to solve multimodality tasks. The large-scale models can handle multiple types of data, such as images, videos, audio, and text. Some of the ongoing works in IVL&IVL Lab are as follows:
Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge
Advancing Causal Intervention in Image Captioning with Causal Prompt
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
Meta Input: How to Leverage Off-the-Shelf Deep Neural Networks
Speaker-adaptive Lip Reading with User-dependent Padding (ECCV 2022)
Multimodal Deep learning
This research field combines different modalities of data, such as vision, language, and speech, to perform AI tasks. Currently, multimodal deep learning is being studied for human multimodality (speech, language, talking face) translation. The human multimodal translation includes visual speech recognition, speech synthesis, talking face generation, and audiovisual speech recognition. Also, image, text, and sound are transformed among them, e.g., image to text and text to image, and image to sound. Multimodal deep learning can help improve the performance and robustness of AI models by using complementary information from different modalities. Related papers published by IVY&IVL Lab are as follows:
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding (ICCV 2023)
Intelligible Lip-to-speech Synthesis with Speech Units (Interspeech 2023)
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring (CVPR 2023)
Lip-to-speech Synthesis in the Wild with Multi-task Learning (ICASSP 2023)
Multi-Temporal Lip-Audio Memory for Visual Speech Recognition (ICASSP 2023)
Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video (AAAI 2023)
Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment (ECCV 2022)
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition (Interspeech 2022)
Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory (CVPR 2022)
Distinguishing Homophenes using Multi-head Visual-audio Memory for Lip Reading (AAAI 2022)
Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory (AAAI 2022)
Towards Versatile Pedestrian Detector with Multisensory-Matching and Multispectral Recalling Memory (AAAI 2022)
Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS 2021)
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video (ICCV 2021)
Cross-Modal Memory Augmented Visual Speech Recognition (IEEE Trans. on Multimedia 2021)
Inclusive Human Multimodal Conversation
Inclusive human multimodal conversation is a research topic that explores in any circumstance humans can communicate with each other via machine with different modes of conversion, such as speech, language, talking face. This topic is important for understanding how humans can interact more effectively and empathetically in diverse contexts and situations, such as different culture and language. Some of the ongoing works in IVL&IVL Lab are as follows:
Persona extraction through semantic similarity for emotional support conversation generation
Enhanced empathetic dialogue response generation with emotion knowledge from large language models
Visual Speech Recognition for Low-resource Languages with Automatic Labels from Whisper Model
Reprogramming Audio-driven Talking Face Synthesis into Text-driven (arxiv 2023)
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection (ECCV 2022)
Competency, Interpretability, Memorability, and Robustness of Deep learning Model
These research topics have goals to understand behavior of deep learning models on various tasks and domains. Competency refers to the ability of a model to achieve high accuracy and efficiency on a specific task or domain, such as image classification or natural language processing. Interpretability refers to the ability of a model to provide understandable and explainable outputs for human, such as revealing the relevant features or generating multimodal descriptions. Memorability refers to the ability of a model to store and retrieve knowledge from previous inputs or outputs, such as using attention mechanisms or memory networks. Robustness refers to the ability of a model to maintain its performance under various adversarial attacks. Related papers published by IVL&IVL Lab are as follows:
Mitigating Adversarial Vulnerability through Causal Parameter Estimation (AAAI 2023)
Multispectral Invisible Coating: Laminated Visible-Thermal Physical Attack against Multispectral Object Detectors using Transparent Low-e films (AAAI 2023)
Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression (CVPR 2023)
Adversarial anchor-guided feature refinement for adversarial defense (Image and Vision Computing, 2023)
Robust Proxy: Improving Adversarial Robustness by Robust Proxy Learning (IEEE Trans. on IFS, 2023)
Advancing Adversarial Training by Injecting Booster Signal (IEEE Trans. on NNLS, 2023)
MAP: Multispectral Adversarial Patch to Attack Person Detection (ICASSP 2022)
Robust Thermal Infrared Pedestrian Detection by Associating Visible Pedestrian Knowledges (ICASSP 2022)
Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck (NeurIPS 2021)
Robust Small-scale Pedestrian Detection with Cued Recall via Memory Learning (ICCV 2021)
Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning (CVPR 2021)
Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network (CVPR 2022)
Robust Perturbation for Visual Explanation: Cross-checking Mask Optimization to Avoid Class Distortion (IEEE Trans. on IP 2021)
Visual Explanation of Challenging Conditioned Dataset with Bias-reducing Memory (BMVC 2021)
Generation of Multimodal Justification Using Visual Word Constraint Model for Explainable Computer-Aided Diagnosis (MICCAIW 2019)
Visual evidence for interpreting diagnostic decision of deep neural network in computer-aided diagnosis (Medical Imaging 2019)
Computer Vision and Multimedia
Computer vision in AI deals with computational methods for machine to understand and interpret the content of visual data. Computer vision and multimedia aim to make machine see and understand multimodal data from cameras or sensors, and take interaction based on that information. Related papers published by IVL&IVL Lab are as follows:
Robust multispectral pedestrian detection via spectral position-free feature mapping (ICIP 2023)
Stereoscopic Vision Recalling Memory for Monocular 3D Object Detection (IEEE Trans. on IP 2022)
Similarity Relation Preserving Cross-Modal Learning For Multispectral Pedestrian Detection Against Adversarial Attacks (ICASSP 2023)
Defending Person Detection Against Adversarial Patch Attack by using Universal Defensive Frame (IEEE Trans. on IP 2021)
Face Shape-Guided Deep Feature Alignment for Face Recognition Robust to Face Misalignment (IEEE Trans. on BIOM 2022)
Defending Physical Adversarial Attack on Object Detection via Adversarial Patch-Feature Energy (ACMM 2022)
Towards a Better Understanding of VR Sickness: Physical Symptom Prediction for VR Contents (AAAI 2021)
Visual Comfort Aware-Reinforcement Learning for Depth Adjustment of Stereoscopic 3D Images (AAAI 2021)
Structure Boundary Preserving Segmentation for Medical Image with Ambiguous Boundary (CVPR 2020)
Mode Variational LSTM Robust to Unseen Modes of Variation: Application to Facial Expression Recognition (AAAI 2019)
Bidirectional Multi-scale Aggregation Networks for Abnormal Event Detection (IEEE Trans. on IP 2019)
Uncertainty-Guided Cross-Modal Learning for Robust Multispectral Pedestrian Detection (IEEE Trans. on CSVT 2021)
Class Uncertainty-Aware Gradient Modulation for Robust Object Detection (IEEE Trans. on CSVT 2020)
Robust Video Frame Interpolation with Exceptional Motion Map (IEEE Trans. on CSVT 2021)