Skip to content
View Ironieser's full-sized avatar
  • Independent Researcher
  • US

Block or report Ironieser

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Ironieser/README.md

πŸ‘‹ Hello! I'm Sixun Dong (Ironieser)

🌐 Homepage β€’ πŸ’» GitHub β€’ πŸ“§ Email β€’ πŸŽ“ Google Scholar


πŸŽ“ About Me

I focus on cutting-edge research in multimodal learning, computer vision, and LLM agents. My work bridges the gap between vision, language, and temporal understanding, with a particular emphasis on weakly supervised learning and efficient model design.

πŸ”¬ Research Interests:

  • Multimodal Learning: Vision-Language Models, Cross-modal Understanding
  • Video Understanding: Temporal Analysis, Action Recognition, Weakly Supervised Learning
  • Time Series Analysis: Forecasting with Multimodal Perspectives
  • LLM Agents: Tool Learning, Feature Transformation, Embodied AI
  • Efficient AI: Token Pruning, Model Compression, Fast Inference

🎯 Current Focus: Developing embodied multimodal agents that can see, understand, reason, plan, and execute in open-world scenarios.


πŸ† Research Highlights

πŸ“š Selected Publications

πŸ”₯ ICLR 2026 - MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
First Author | Training-free VLM Inference Speed Up x 1.87
[Paper] [Code]

πŸ”₯ ICASSP 2026 - Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER
LLM agents for Robust Dysarthric Speech Recognition
[[Paper(Coming Soon)]]

πŸ”₯ NeurIps 2025 - Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation
Auto-regressive and diffusion model for feature engineering
[Paper]

πŸ”₯ WACV 2024 - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Equipping LLMs with multimodal tool-use capabilities
[Paper] [Code]

πŸ”₯ 3DV 2024 - RoomDesigner: Encoding Anchor-latents for Style-consistent and Shape-compatible Indoor Scene Generation
Indoor scene generation with style and shape consistency
[Paper] [Code]

πŸ”₯ CVPR 2023 - Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
First Author | Video-text alignment without frame-level supervision
[Paper] [Code]

πŸ”₯ CVPR 2022 Oral - TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting
First Author | Repetitive action counting with transformer architecture
[Paper] [Code] [Dataset] [YouTube] [Bilibili]

πŸš€ Current Projects

  • Efficient Vision-Language Models: Token pruning strategies for VLM acceleration
  • Vision-Language Models Evaluation: Better evaluation strategy for VLMs

πŸ›  Technical Stack

Languages & Frameworks:
Python PyTorch C++ Linux

Research Areas:
Computer Vision Multimodal Learning Time Series LLM Agents


πŸ’Ό Professional Experience

πŸ”¬ Current Position

Independent Researcher | US, Tempe | Present
Focus: Multimodal Learning, Computer Vision, LLM Agent*

🏒 Industry Experience

GenAI Research Intern | Zoom Inc. | May 2025 - Aug 2025 Efficient Vision-Language Modeling

Research Intern (Team Leader) | DGene | Nov 2023 - Jan 2024
Co-Speech Gesture & Head Motion Generation

Research Intern (Team Leader) | Transsion Holdings | Apr 2023 - Aug 2023
Audio-Driven Talking Head Video Generation


πŸŽ“ Education

πŸŽ“ M.S. Computer Science | ShanghaiTech University | 2024
SVIP-Lab, Advisor: Prof. Shenghua Gao

πŸŽ“ B.E. Computer Science (Dual Degree) | Dalian University of Technology | 2020
πŸŽ“ B.E. Process Equipment & Control Engineering | Dalian University of Technology | 2020


πŸ“Š GitHub Stats

Ironieser

🀝 Academic Service

Reviewer for: CVPR 2023+, ICCV 2023+, ECCV 2024+, ACM MM (2023-2025), ACCV (2024), KDD (2024), NeurIPS 2025, ICML 2025, ICLR 2026, TMM, Neural Networks, TKDD


🎡 Currently Listening

Spotify


🐍 Contribution Snake

github-snake

πŸ’¬ Let's Connect!

"Building the future of multimodal AI, one model at a time."

Email Homepage Google Scholar

Profile Views

Pinned Loading

  1. SvipRepetitionCounting/TransRAC SvipRepetitionCounting/TransRAC Public

    (CVPR 2022 Oral) Official implemention: TransRAC

    Python 120 20

  2. svip-lab/WeakSVR svip-lab/WeakSVR Public

    (CVPR 2023) Official implemention of the paper "Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos"

    Python 31 4

  3. Storyteller_Of_Auto-Barrage Storyteller_Of_Auto-Barrage Public

    θ‡ͺεŠ¨ε‘ι€εΌΉεΉ•ζ’δ»ΆοΌŒε―ηΌ–θΎ‘ε€šι‘Ήε‚ζ•°οΌŒθ―¦η»†ηš„ζη€ΊδΏ‘ζ―γ€‚δ½Ώη”¨ζœ¬ζ’δ»Άθ―·θ‘—εε‡Ίε€„γ€‚

    JavaScript 8 2

  4. Python-Remote-Development Python-Remote-Development Public

    The introduction for configure the remote development with pycharm or vscode.

    5

  5. TimesCLIP TimesCLIP Public

    The offical repo of "Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives"

    51 2

  6. MMTok MMTok Public

    [ICLR 2026] The official repo of "MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs"

    Python 37 4