My research interests lie at the intersection of audio-visual learning, multi-modal perception, and generative AI. While recent advancements in LLMs have mastered language, I view text as a reduction of reality. Instead, I aim to build multi-modal systems that perceive the world through the synergy of sight and sound with spatial awareness and physical grounding.
My work is driven by two core philosophies: human-centered perception, where I model speech characteristics, affective dynamics, and joint cognitive attention to capture how humans naturally experience the world; and creative media, where I develop tools that offer precise, object-level control for content creation.
Future Directions
As an audio-visual researcher, I have witnessed the power of multimodal synergy, yet I
realize that correlation alone is insufficient. True perception requires understanding the physical laws, such
as geometry, dynamics, and material interactions, that govern the spaces where sight and sound coexist.
Consequently, my future research will focus on spatial and physical learning to contribute to the development
of a comprehensive world model. I aim to move beyond surface-level alignment to construct digital twins
that not only mimic the appearance of the environment but also simulate its underlying physical reality. By
grounding audio-visual generation in these physical truths, we can enable agents to reason about the world
through a unified sensory experience.
A realization of human's audio-visual selective attention that jointly emphasizes the selected object visually and acoustically based on flow-based Schrödinger bridge.