Ironieser

👋 Hello! I'm Sixun Dong (Ironieser)

🌐 Homepage • 💻 GitHub • 📧 Email • 🎓 Google Scholar

🎓 About Me

I focus on cutting-edge research in multimodal learning, computer vision, and LLM agents. My work bridges the gap between vision, language, and temporal understanding, with a particular emphasis on weakly supervised learning and efficient model design.

🔬 Research Interests:

Multimodal Learning: Vision-Language Models, Cross-modal Understanding
Video Understanding: Temporal Analysis, Action Recognition, Weakly Supervised Learning
Time Series Analysis: Forecasting with Multimodal Perspectives
LLM Agents: Tool Learning, Feature Transformation, Embodied AI
Efficient AI: Token Pruning, Model Compression, Fast Inference

🎯 Current Focus: Developing embodied multimodal agents that can see, understand, reason, plan, and execute in open-world scenarios.

🏆 Research Highlights

📚 Selected Publications

🔥 ICLR 2026 - MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
First Author | Training-free VLM Inference Speed Up x 1.87
[Paper] [Code]

🔥 ICASSP 2026 - Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER
LLM agents for Robust Dysarthric Speech Recognition
[[Paper(Coming Soon)]]

🔥 NeurIps 2025 - Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation
Auto-regressive and diffusion model for feature engineering
[Paper]

🔥 WACV 2024 - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Equipping LLMs with multimodal tool-use capabilities
[Paper] [Code]

🔥 3DV 2024 - RoomDesigner: Encoding Anchor-latents for Style-consistent and Shape-compatible Indoor Scene Generation
Indoor scene generation with style and shape consistency
[Paper] [Code]

🔥 CVPR 2023 - Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
First Author | Video-text alignment without frame-level supervision
[Paper] [Code]

🔥 CVPR 2022 Oral - TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting
First Author | Repetitive action counting with transformer architecture
[Paper] [Code] [Dataset] [YouTube] [Bilibili]

🚀 Current Projects

Efficient Vision-Language Models: Token pruning strategies for VLM acceleration
Vision-Language Models Evaluation: Better evaluation strategy for VLMs

🛠 Technical Stack

Languages & Frameworks:

Research Areas:

💼 Professional Experience

🔬 Current Position

Independent Researcher ｜ US, Tempe ｜ Present
Focus: Multimodal Learning, Computer Vision, LLM Agent*

🏢 Industry Experience

GenAI Research Intern | Zoom Inc. | May 2025 - Aug 2025 Efficient Vision-Language Modeling

Research Intern (Team Leader) | DGene | Nov 2023 - Jan 2024
Co-Speech Gesture & Head Motion Generation

Research Intern (Team Leader) | Transsion Holdings | Apr 2023 - Aug 2023
Audio-Driven Talking Head Video Generation

🎓 Education

🎓 M.S. Computer Science | ShanghaiTech University | 2024
SVIP-Lab, Advisor: Prof. Shenghua Gao

🎓 B.E. Computer Science (Dual Degree) | Dalian University of Technology | 2020
🎓 B.E. Process Equipment & Control Engineering | Dalian University of Technology | 2020

📊 GitHub Stats

🤝 Academic Service

Reviewer for: CVPR 2023+, ICCV 2023+, ECCV 2024+, ACM MM (2023-2025), ACCV (2024), KDD (2024), NeurIPS 2025, ICML 2025, ICLR 2026, TMM, Neural Networks, TKDD

🎵 Currently Listening

🐍 Contribution Snake

💬 Let's Connect!

"Building the future of multimodal AI, one model at a time."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly