Kan Jen Cheng

Kan Jen Cheng 鄭堪任

I am an incoming PhD student in Computer Science at the University of Maryland, College Park (Fall 2026), advised by Professor Ming C. Lin. Previously, I received my bachelor's degree from UC Berkeley, where I worked on audio-visual research in Berkeley Speech Group (BAIR), advised by Professor Gopala Anumanchipalli.

My research interests lie at the intersection of audio-visual learning, multimodal perception, and generative modeling. While recent advancements in LLMs have mastered language, I view text as a reduction of reality. Instead, I aim to build multimodal systems that perceive the world through the synergy of sight and sound, grounded in spatial awareness and physical understanding.

If you would like to discuss my research or potential collaborations, feel free to contact me. I'm always open to connect and collaborate.

Email / GitHub / LinkedIn

Research Philosophies

My work is driven by two core philosophies: human-centered perception, where I model speech characteristics, affective dynamics, and joint cognitive attention to capture how humans naturally experience the world; and creative media, where I develop tools that offer precise, object-level control for content creation.

Future Directions

As an audio-visual researcher, I have witnessed the power of multimodal synergy, yet I realize that correlation alone is insufficient. True perception requires understanding the physical laws, such as geometry, dynamics, and material interactions, that govern the spaces where sight and sound coexist. Consequently, my future research will focus on spatial and physical learning to contribute to the development of a comprehensive world model. I aim to move beyond surface-level alignment to construct digital twins that not only mimic the appearance of the environment but also simulate its underlying physical reality. By grounding audio-visual generation in these physical truths, we can enable agents to reason about the world through a unified sensory experience.

Research

CAVE: Coherent Audio-Visual Emphasis via Schrödinger Bridge
Kan Jen Cheng*, Weihan Xu*, Koichi Saito, Nicholas Lee, Yisi Liu, Jiachen Lian, Tingle Li, Alexander H. Liu, Fangneng Zhan, Masato Ishii, Takashi Shibuya, Paul Pu Liang, Gopala Anumanchipalli
under review
Collaborate with Sony AI & MIT Media Lab

project page / arXiv

A realization of human's audio-visual selective attention that jointly emphasizes the selected object visually and acoustically based on flow-based Schrödinger bridge.

EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
Jingwen Liu*, Kan Jen Cheng*, Jiachen Lian, Akshay Anand, Rishi Jain, Faith Qiao, Robin Netzorg, Huang-Cheng Chou, Tingle Li, Guan-Ting Lin, Gopala Anumanchipalli
ASRU, 2025

project page / arXiv

A holistic benchmark for assessing emotional coherence in spoken dialogue systems through continuous, categorical, and perceptual metrics.

Audio Texture Manipulation by Exemplar-Based Analogy
Kan Jen Cheng*, Tingle Li*, Gopala Anumanchipalli
ICASSP, 2025

project page / arXiv

A latent diffusion, exemplar-based analogy (In-Context Learning) model for audio texture manipulation, which refers to editing the overall perceptual quality of a sound and its interaction with various sound sources.

Professional Experiences

Berkeley AI Research (BAIR), CA, U.S.

Research Assistant • Spring 2024 - Spring 2026

With: Gopala Anumanchipalli

Education

Ph.D.

Fall 2026 - Present

University of Maryland, College Park, MD, U.S.

Ph.D. in Computer Science

B.A.

Fall 2020, Spring 2022 - Fall 2023

University of California, Berkeley, CA, U.S.

B.A. in Computer Science