Hi, this is the website of RHOS team at MVIG. We study Human Activity Understanding, Visual Reasoning, and Embodied AI. We are building a knowledge-driven system that enables intelligent agents to perceive human activities, reason human behavior logics, learn skills from human activities, and interact with environment.
(S) Embodied AI: how to make agents learn skills from humans and interact with human & scene & object.
(S-1) Human Activity Understanding: how to learn and ground complex/ambiguous human activity concepts (body motion, human-object/human/scene interaction) and object concepts from multi-modal information (2D-3D-4D).
(S-2) Visual Reasoning: how to mine, capture, and embed the logics and causal relations from human activities.
(S-3) General Multi-Modal Foundation Models: especially for human-centric perception tasks.
(S-4) Activity Understanding from A Cognitive Perspective: work with multidisciplinary researchers to study how the brain perceives activities.
(E) Human-Robot Interaction (e.g. for Smart Hospital): work with the healthcare team (doctors and engineers) in SJTU to develop intelligent robots to help people.
We are actively looking for self-motivated students (Master/PhD, 2024 spring & fall), interns / engineers / visitors (CV/ML/ROB/NLP/Math/Phys background, always welcome) to join us in Machine Vision and Intelligence Group (MVIG). If you share same/similar interests, feel free to drop me an email with your resume.
Click here for more details.
Human Activity Knowledge Engine (HAKE) is a knowledge-driven system that aims at enabling intelligent agents to perceive human activities, reason human behavior logics, learn skills from human activities, and interact with objects and environments.
We propose a challenging Object Concept Learning (OCL) task to push the envelope of object understanding. It requires machines to reason out object affordances and simultaneously give the reason: what attributes make an object possesses these affordances.
We design an action semantic space in view of verb taxonomy hierarchy and covering massive actions. Thus, we can gather multi-modal datasets into a unified database in a unified label system, i.e., bridging “isolated islands” into a “Pangea”. Accordingly, we propose a bidirectional mapping model between physical and semantic space to fully use Pangea.
|Main Repo:||HAKE Star|
|Sub-repos:||Torch Star||TF Star||HAKE-AVA Star|
|Halpe Star||HOI List Star|
Oral Talk: Compositionality in Computer Vision in CVPR 2020