Qinian Wang

Hi, I'm Qinian — a first-year Master’s student at Shanghai Jiao Tong University (SJTU), where I’m fortunate to be advised by Weidi Xie. My research focuses on video understanding and multimodal perception, driven by a passion for exploring the intersection between computer vision and brain intelligence.

Before joining SJTU, I completed my undergraduate studies at the University of Electronic Science and Technology of China (UESTC).

I truly enjoy sharing diverse perspectives and collaborating with people from different backgrounds. Please feel free to reach out if you're interested in my research or just want to have a chat!

Email / CV / Github

Research

Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning
Yudi Shi, Shangzhe Di, Qirui Chen, Qinian Wang,
Xiaolong Jiang, Yao Hu, Weidi Xie
CVPR 2026 Findings
project page / arXiv

An end-to-end trainable multimodal reasoning agentic system called Weaver addresses the limitations of text-centric Chain-of-Thought in video reasoning by dynamically invoking tools and leveraging reinforcement learning

Project

NeSeg: An Agentic System for Video Segmentation with Positive and Negative Hints
github

In prior work, the VLM interacted with SAM using only positive sample points and bounding boxes to perform video object segmentation. In practice, however, when faced with complex fine-grained segmentation, the lack of negative sample hints left the model with no mechanism for output correction. To address this, we build on that framework and leverage reinforcement learning (RL) to introduce negative sample hints, investigating their role in segmentation tasks.

StreamTeller: Perception-First Memory with Event-Driven Visual Evidence
github

StreamTeller is a training-free plug-in mechanism that addresses the memory-perception conflict in streaming video understanding. It organizes video streams into event nodes, each containing a structured caption and optional visual evidence. An event-driven gating strategy generates captions only upon semantic drift, and a perception-first retrieval strategy loads historical memory on demand. This design explicitly separates perception from memory, improving long-term recall while minimizing interference with current perception.