Hi, this is the website of RHOS team at MVIG. We study Human Activity Understanding, Visual Reasoning, and Embodied AI. We are building a knowledge-driven system that enables intelligent agents to perceive human activities, reason human behavior logics, learn skills from human activities, and interact with environment.
Research Interests:
(S) Embodied AI: how to make agents learn skills from humans and interact with human & scene & object.
(S-1) Human Activity Understanding: how to learn and ground complex/ambiguous human activity concepts (body motion, human-object/human/scene interaction) and object concepts from multi-modal information (2D-3D-4D).
(S-2) Visual Reasoning: how to mine, capture, and embed the logics and causal relations from human activities.
(S-3) General Multi-Modal Foundation Models: especially for human-centric perception tasks.
(S-4) Activity Understanding from A Cognitive Perspective: work with multidisciplinary researchers to study how the brain perceives activities.
(E) Human-Robot Interaction (e.g. for Smart Hospital): work with the healthcare team (doctors and engineers) in SJTU to develop intelligent robots to help people.
We are actively looking for self-motivated students (Master/PhD, 2025 spring & fall), interns / engineers / visitors (CV/ML/ROB/NLP/Math/Phys background, always welcome) to join us in Machine Vision and Intelligence Group (MVIG). If you share same/similar interests, feel free to drop me an email with your resume.
Human Activity Knowledge Engine (HAKE) is a knowledge-driven system that aims at enabling intelligent agents to perceive human activities, reason human behavior logics, learn skills from human activities, and interact with objects and environments.
We propose a challenging Object Concept Learning (OCL) task to push the envelope of object understanding. It requires machines to reason out object affordances and simultaneously give the reason: what attributes make an object possess these affordances.
We design an action semantic space given verb taxonomy hierarchy and covering massive actions. Thus, we can gather multi-modal datasets into a unified database in a unified label system, i.e., bridging “isolated islands” into a “Pangea”. So then, we propose a bidirectional mapping model between physical and semantic space to use Pangea fully.
We rethink and propose a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA). We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy and several new and effective mechanisms and settings to advance further research.
A human-agent joint learning teleoperation system for faster data collection, less human effort, and efficient robot manipulation skill acquisition.
We provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression, which motivates our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block.
Main Repo: | HAKE Star | ||
Sub-repos: | Torch Star | TF Star | HAKE-AVA Star |
Halpe Star | HOI List Star |
Oral Talk: Compositionality in Computer Vision in CVPR 2020