Physical Simulation for Probabilistic Motion Tracking
Physics plays an important role in characterizing, describing and predicting motion. Most prior approaches to human motion tracking concentrated on efficient inference algorithms and prior motion models; however, few can explicitly account for physical plausibility of recovered motion. The primary purpose of this work is to enforce physical plausibility in the tracking of a single articulated human subject. Towards this end, we propose a full-body 3D physical simulation-based prior that explicitly incorporates motion control and dynamics into the Bayesian filtering framework. We consider the human's motion to be generated by a ``control loop''. In this control loop, Newtonian physics approximates the rigid-body motion dynamics of the human and the environment through the application and integration of forces. Collisions generate interaction forces to prevent physically impossible hypotheses. This allows us to properly model human motion dynamics, ground contact and environment interactions. For efficient inference in the resulting high-dimensional state space, we introduce exemplar-based control strategy to reduce the effective search space. As a result we are able to recover the physically-plausible kinematic and dynamic state of the body from monocular and multi-view imagery. We show, both quantitatively and qualitatively, that our approach performs favorably with respect to standard Bayesian filtering methods.
Relevant papers:
-
Physical Simulation for Probabilistic Motion Tracking,
M. Vondrak, L. Sigal and O. C. Jenkins,
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008. (to appear)
Videos: [Supplementary Video - DIVX (30mb)], [Supplementary Video - WMV (30mb)].
Articulated Shape Estimation
Much of the work on articulated human pose and motion estimation has been limited by the use of crude generative models of humans represented as articulated collections of simple parts such as cylinders. More detailed triangulated mesh models obtained from laser range scans have been viewed as too high dimensional for vision applications. Moreover, mesh models of individuals lack a convenient, low-dimensional, parameterization to allow fitting to new subjects. In this research we emphasize the use the SCAPE model (Shape Completion and Animation of PEople) which provides a low-dimensional parameterized mesh that is learned from a database of 3D range scans of different people. The SCAPE model captures correlated body shape deformations of the body due to the identity of the person and their non-rigid muscle deformation due to articulation. We first showed how this model can be tractably estimated from silhouette images obtained from multiple views and based on reasonable initial pose [CVPR 2007].
More recently [NIPS 2007], we developed a discriminative method for directly recovering the model parameters from monocular images using a mixture of regressors. This predicted pose and shape are used to initialize a generative model for more detailed pose and shape estimation. The resulting approach allows fully automatic pose and shape recovery from monocular and multi-camera imagery. Experimental results show that our method is capable of robustly recovering articulated pose, shape and biometric measurements (e.g. height, weight, etc.) in both calibrated and uncalibrated camera environments.
Relevant papers:
- Combined discriminative and generative articulated pose and non-rigid shape estimation, L. Sigal, A. Balan and M. J. Black, Neural Information Processing Systems Conference, NIPS 2007.
- Shining a Light on Human Pose: On Shadows, Shading and the Estimation of Pose and Shape, A. Balan, M. J. Black, H. Haussecker and L. Sigal, IEEE Conference on Computer Vision and Pattern Recognition, ICCV 2007.
- Detailed Human Shape and Pose from Images, A. Balan, L. Sigal, M. J. Black, J. Davis and H. Haussecker, IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007.
Hierarchical Articulated 3D Pose-Estimation and Tracking
Recent work on 2D body pose estimation and tracking treats the body as a “cardboard person” in which the limbs are represented by 2D planar patches connected by joints. Such models are lower-dimensional than the full 3D model and recent work has shown that they can be estimated from 2D images. The results are typically noisy and imprecise but they provide exactly the kind of information necessary to generate proposals for the probabilistic inference of 3D human pose. Thus we simplify the 3D problem by introducing an intermediate 2D estimation stage.
To infer 2D body pose we adopt a generative bottom-up process. Simple body part detectors provide noisy probabilistic proposals for the location and 2D pose (orientation and foreshortening) of visible limbs (b). To estimate the pose of the limbs we exploit the idea of a 2D loose-limbed body model.
This process provides reasonable guesses for 2D body pose from which to estimate 3D pose. Sminchisescu et al learned a probabilistic mapping from 2D silhouettes to 3D pose using a Mixture of Experts (MoE) discriminative model. We generalize their approach to learn a mapping from 2D poses (including joint angles and foreshortening information) to 3D poses. The approach uses a mixture of regularized linear regression models that are trained from a set of 2D-3D pose pairs obtained from motion capture data. Sampling from this model provides predicted 3D poses (d), that are appropriate as proposals for a Bayesian temporal inference process (e). Our multi-stage approach overcomes many of the problems inherent in inferring 3D pose directly from image features.
Relevant papers:
- Predicting 3D People from 2D Pictures, L. Sigal and M. J. Black, IV Conference on Articulated Motion and Deformable Objects, AMDO 2006 (Best Paper Award).
Videos: [2D to 3D Pose Inference (16mb)], [3D Tracking (75mb)].
Loose-limbed Body Model
In the recent years we presented a number of methods for a fully automatic pose estimation and tracking of human bodies in 2D [CVPR 2006] and 3D [NIPS 2003], [CVPR 2004]. Initialization and failure recovery in these methods are facilitated by the use of loose-limbed body model in which limbs are connected via learned probabilistic constraints. The pose estimation and tracking can then be formulated as an inference in a loopy graphical model and approximate belief propagation can be used to estimate the pose of the body at each time-step. Each node in the graphical model represents the position and orientation of the limb, and the directed edges between nodes represent statistical dependencies between limbs.
There are a number of significant advantages of this paradigm as compared to the more traditional methods for tracking human motion. Most traditional models of the body resort to the kinematic tree-based representations in 2D, 2.5D, or 3D leading to a high-dimensional search space. Searching for a body pose in this high dimensional space is impractical, and so most tracking methods rely on manual initialization or a canonical starting pose. Additionally, they often exploit strong priors characterizing the motions present, to speed up the search. The lack of automatic initialization from an arbitrary pose also makes it hard to recover from transient failures that often occur during tracking.
While the full body pose may be hard to recover directly, the location and pose of a sub-set of individual (visible) limbs is often much easier to compute. Many good head detectors exist and limb detectors based on the skin color, shading, and focus have been developed. This observation is what drives forth the loose-limbed body model paradigm.
Relevant papers:
- Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation, L. Sigal and M. J. Black, IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2006.
- Tracking Loose-limbed People, L. Sigal, S. Bhatia, S. Roth, M. J. Black and M. Isard, IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2004.
- Attractive people: Assembling loose-limbed models using non-parametric belief propagation, L. Sigal, M. Isard, B. H. Sigelman and M. J. Black, Advances in Neural Information Processing Systems 16, NIPS 2003.
Generic Object Localization and Tracking
The detection and tracking of complex objects in natural scenes requires rich models of object appearance that can cope with variability among instances of the object and across changing viewing and lighting conditions. To that end we develop a probabilistic framework for automatic component-based detection and tracking. By combining object detection with tracking in a unified framework we can achieve a more robust solution for both problems. Tracking can make use of object detection for initialization and re-initialization during transient failures or occlusions, while object detection can be made more reliable by considering the consistency of the detection over time. Modeling objects by an arrangement of image-based (possibly overlapping) components, facilitates detection of complex articulated objects, as well as helps in handling partial object occlusions or local illumination changes.
Object detection and tracking is formulated as inference in a two-layer graph- ical model in which the coarse layer node represents the whole object and the fine layer nodes represent multiple component “parts” of the object. Directed edges between nodes represent learned spatial and temporal probabilistic con- straints. Each node in the graphical model corresponds to a position and scale of the component or the object as a whole in an image at a given time instant. Each node also has an associated AdaBoost detector that is used to define the local image likelihood and a proposal process. In general the likelihoods and de- pendencies are not Gaussian. To infer the 2D position and scale at each node we exploit a form of Non-parametric Belief Propagation (BP) that uses a variation of particle filtering and can be applied over a loopy graph.
Relevant papers:
- Tracking Complex Objects using Graphical Object Models, L. Sigal, Y. Zhu, D. Comaniciu and M. J. Black, 1st International Workshop on Complex Motion, Springer-Verlag LNCS 3417, pp. 227-238, 2004.
Videos: [Vehicle Tracking (Components)], [Vehicle Tracking (Object)].
Skin-color Segmentation
Localizing and tracking patches of skin-colored pixels through an image sequence is a tool used in many face recognition and gesture tracking systems. Skin-color segmentation, particularly useful for its orientation and size invariance, is usually used for localization in early stages of these higher-level systems. An important challenge for any skin-color segmentation or tracking system is to accommodate varying illumination conditions that may occur within an image sequence.
We have developed a novel adaptive model for explicitly modeling the changes resulting from varying illumination. We use an explicit second order Markov model to predict evolution of the skin-color (HSV) histogram over time. Histograms are dynamically updated based on feedback from the current segmentation and predictions of the Markov model. The evolution of the skin-color distribution at each frame is parameterized by translation, scaling, and rotation of distribution in the color space. Consequent changes in geometric parameterization of the distribution are propagated by warping and re-sampling the histogram. The parameters of the discrete-time dynamic Markov model are estimated using Maximum Likelihood Estimation and also evolve over time. The accuracy of the new dynamic skin color segmentation algorithm is compared to that obtained via a static color model. Segmentation accuracy is evaluated using labeled ground-truth video sequences taken from staged experiments and popular movies. An overall increase in segmentation accuracy of up to 24 percent is observed in 17 out of 21 test sequences.
Relevant papers:
- Skin Color-Based Video Segmentation under Time-Varying Illumination, L. Sigal, S. Sclaroff and V. Athitsos, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(7), pp. 862-877, July 2004.
- Estimation and Prediction of Evolving Color Distributions for Skin Segmentation Under Varying Illumination, L. Sigal, S. Sclaroff and V. Athitsos, IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000.