Guest Editorial Human Activity Understanding from

Authors: Junsong Yuan Wanqing Li Zhengyou Zhang David Fleet Jamie Shotton

Publish Date: 2016/05/30

Volume: 118, Issue: 2, Pages: 113-114

PDF Link

Abstract

Automatic analysis of human motion has become one of the most active research topics in computer vision due to both the scientific challenges of the problem and the wide range of applications Such applications include intelligent video surveillance human–computer interfaces intelligent humanoid robots gaming diagnosis assessment and treatment of musculoskeletal disorders sports analysis realistic synthesis and animation of human motion and monitoring of elderly and disabled people at home Extensive studies have been conducted in the past decade using 2D visual information captured by single or multiple cameras However the problem is far from being robustly solved especially for viewpoint independent recognition of diverse human actions and activities in a real environmentRecent advances in 3D depth cameras using structured light or timeofflight sensors 3D information recovery from 2D images/videos and the availability of portable human motion capture devices have made 3D data readily available This 3D data is nurturing a potential breakthrough in human activity recognition The release of Microsoft’s Kinect sensors ASUS’s Xtion Pro Live sensors and Intel’s RealSense cameras have provided a commercially viable hardware platform to capture 3D data in realtime This special issue seeks high quality and original research on human activity understanding using such 3D input as well as that using more traditional 2D input The goal of this special issue is twofold 1 to advocate and promote research in human activity recognition using 2D and 3D data and 2 to present novel human activity understanding techniques applicable to diverse applicationsKernelized Multiview Projection for Robust Action Recognition by Shao et al doi 101007/s1126301508616 presents a practical algorithm to fuse multiple different features for action recognition The algorithm is based on a linear multiview embedding method using kernel matrices from different views Using an alternate optimization via relaxation a nearoptimal projection and weights for each view are learned Experiments demonstrate significant improvements over single view methods and other feature fusion methodsExploiting Privileged Information from Web Data for Action and Event Recognition by Li et al doi 101007/s1126301508625 proposes a new learning method to train robust classifiers for action and event recognition by using web videos as freely available training data The proposed multiinstance learning methods not only take advantage of the additional textual descriptions of training web videos as privileged information but also explicitly cope with noisy labels New domain adaptation methods are also proposed to deal with the situation when the training and test videos come from different data distributionsFusion R features and Local Features with Contextaware Kernels for Action Recognition by Yuan et al doi 101007/s1126301508670 presents a new feature that captures the global spatiotemporal distribution of interest points The feature is extracted through 3D R transform and fused with the widely used bagofvisualwords feature extracted locally at the spatiotemporal interest points To improve the robustness a contextaware kernel approach is developed for similarity measurement The paper demonstrates the value of the R feature and the contextaware kernel in action recognition through experiments on the UCF Sports UCF Films and Hollywood2 datasetsCapturing Hands in Action using Discriminative Salient Points and Physics Simulation by Tzionas et al doi 101007/s1126301608954 works on hand motion capture in interaction scenarios where hands interact with other hands or objects The proposed method combines a generative model with discriminatively trained salient points to achieve a low tracking error It also incorporates collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data The method can work well for monocular RGBD sequences as well as setups with multiple synchronized RGB camerasGaze Estimation in the 3D Space Using RGBD sensors Towards HeadPose and User Invariance by Mora and Odobez doi 101007/s1126301508634 studies the problem of 3D gaze estimation within a 3D environment from remote sensors which is valuable for human–human and human–robot interactions It leverages the depth data of RGBD cameras to perform an accurate head pose tracking acquires head pose invariance through a 3D rectification process that renders head pose dependent eye images into a canonical viewpoint and computes the lineofsight in 3D space To address the low resolution of the eye image an appearancebased gaze estimation paradigm is applied They demonstrate good performance through extensive gaze estimation experiments on a public dataset as well as a gaze coding task applied to job interviews in a natural settingMultimodal RGBDepthThermal Human Body Segmentation by Palmero et al doi 101007/s112630160901x addresses the problem of human body segmentation from multimodal visual cues Human body detection and segmentation plays an important role in automatic human behavior analysis A new RGBDepthThermal dataset is provided in this work along with a multimodal segmentation baseline Several modalities are registered using a calibration device and a registration algorithm They also report solid results on the new datasetA Hierarchical Video Description for Complex Activity Understanding by Liu et al doi 101007/s1126301608972 describes a latent discriminative structural model to automatically detect a complex activity atomic actions and the temporal structure of atomic actions The associated model learning method is semisupervised and requires that the training video samples be partially annotated with atomic actions The paper demonstrates how the model is used to extract rich and hierarchical description of activities from videosA Deep Structured Model with RadiusMargin Bound for 3D Human Activity Recognition by Lin et al doi 101007/s112630150876z extends the convolutional neural network CNN for action recognition from RGBD data The idea consists of integrating latent temporal structure with CNNs to decompose an activity video into subactivity segments and training several CNNs one CNN per segment to learn the spatiotemporal features To improve generalization on small training data sets the paper adopts radiusmargin bound regularization in the classification The model is evaluated on several public available RGBD datasetsThese eight papers cover a diverse range of human activity understanding from 2D and 3D data They share new stateoftheart ideas and technology and will benefit researchers and engineers who work in this area With 2D and 3D visual sensors becoming cheaper and more capable we believe human activity understanding will continue to be a fertile area for growth given its many exciting applications

Guest Editorial Human Activity Understanding from

Authors: Junsong Yuan Wanqing Li Zhengyou Zhang David Fleet Jamie Shotton

Publish Date: 2016/05/30

Volume: 118, Issue: 2, Pages: 113-114

PDF Link

Abstract

Keywords:

References

Other Papers In This Journal:

Search Result: