We present a human-centric method to sample and synthesize 3D room layouts and 2D images thereof, for the purpose of obtaining large-scale 2D/3D image data with the perfect per-pixel ground truth. An attributed spatial And-Or graph (S-AOG) is proposed to represent indoor scenes. The S-AOG is a probabilistic grammar model, in which the terminal nodes are object entities including room, furniture, and supported objects. Human contexts as contextual relations are encoded by Markov Random Fields (MRF) on the terminal nodes. We learn the distributions from an indoor scene dataset and sample new layouts using Monte Carlo Markov Chain. Experiments demonstrate that the proposed method can robustly sample a large variety of realistic room layouts based on three criteria: (i) visual realism comparing to a state-of-the-art room arrangement method, (ii) accuracy of the affordance maps with respect to ground-truth, and (ii) the functionality and naturalness of synthesized rooms evaluated by human subjects.
We present a novel Augmented Reality (AR) approach, through Microsoft HoloLens, to address the challenging problems of diagnosing, teaching, and patching interpretable knowledge of a robot. A Temporal And-Or graph (T-AOG) of opening bottles is learned from human demonstration and programmed to the robot. This representation yields a hierarchical structure that captures the compositional nature of the given task, which is highly interpretable for the users. By visualizing the knowledge structure represented by the T-AOG and the decision making process by parsing a T-AOG, the user can intuitively understand what the robot knows, supervise the robot's action planner, and monitor visually latent robot states (e.g., the force exerted during interactions). Given a new task, through such comprehensive visualizations of robot's inner functioning, users can quickly identify the reasons of failures, interactively teach the robot with a new action, and patch it to the knowledge structure represented by the T-AOG. In this way, the robot is capable of solving similar but new tasks only through minor modifications provided by the users interactively. This process demonstrates the interpretability of our knowledge representation and the effectiveness of the AR interface.
Contact forces of the hand are visually unobservable, but play a crucial role in understanding hand-object interactions. In this paper, we propose an unsupervised learning approach for manipulation event segmentation and manipulation event parsing. The proposed framework incorporates hand pose kinematics and contact forces using a low-cost easy-to-replicate tactile glove. We use a temporal grammar model to capture the hierarchical structure of events, integrating extracted force vectors from the raw sensory input of poses and forces. The temporal grammar is represented as a temporal And-Or graph (T-AOG), which can be induced in an unsupervised manner. We obtain the event labeling sequences by measuring the similarity between segments using the Dynamic Time Alignment Kernel (DTAK). Experimental results show that our method achieves high accuracy in manipulation event segmentation, recognition and parsing by utilizing both pose and force data.
When a moving object collides with an object at rest, people immediately perceive a causal event: i.e., the first object has launched the second object forwards. However, when the second object's motion is delayed, or is accompanied by a collision sound, causal impressions attenuate and strengthen. Despite a rich literature on causal perception, researchers have exclusively utilized 2D visual displays to examine the launching effect. It remains unclear whether people are equally sensitive to the spatiotemporal properties of observed collisions in the real world. The present study first examined whether previous findings in causal perception with audiovisual inputs can be extended to immersive 3D virtual environments. We then investigated whether perceived causality is influenced by variations in the spatial position of an auditory collision indicator. We found that people are able to localize sound positions based on auditory inputs in VR environments, and spatial discrepancy between the estimated position of the collision sound and the visually observed impact location attenuates perceived causality.
This paper studies a challenging problem of tracking severely occluded objects in long video sequences. The proposed method reasons about the containment relations and human actions, thus infers and recovers occluded objects identities while contained or blocked by others. There are two conditions that lead to incomplete trajectories: i) Contained. The occlusion is caused by a containment relation formed between two objects, e.g., an unobserved laptop inside a backpack forms containment relation between the laptop and the backpack. ii) Blocked. The occlusion is caused by other objects blocking the view from certain locations, during which the containment relation does not change. By explicitly distinguishing these two causes of occlusions, the proposed algorithm formulates tracking problem as a network flow representation encoding containment relations and their changes. By assuming all the occlusions are not spontaneously happened but only triggered by human actions, an MAP inference is applied to jointly interpret the trajectory of an object by detection in space and human actions in time. To quantitatively evaluate our algorithm, we collect a new occluded object dataset captured by Kinect sensor, including a set of RGB-D videos and human skeletons with multiple actors, various objects, and different changes of containment relations. In the experiments, we show that the proposed method demonstrates better performance on tracking occluded objects compared with baseline methods.
Learning complex robot manipulation policies for real-world objects is challenging, often requiring significant tuning within controlled environments. In this paper, we learn a manipulation model to execute tasks with multiple stages and variable structure, which typically are not suitable for most robot manipulation approaches. The model is learned from human demonstration using a tactile glove that measures both hand pose and contact forces. The tactile glove enables observation of visually latent changes in the scene, specifically the forces imposed to unlock the child-safety mechanisms of medicine bottles. From these observations, we learn an action planner through both a top-down stochastic grammar model (And-Or graph) to represent the compositional nature of the task sequence and a bottom-up discriminative model from the observed poses and forces. These two terms are combined during planning to select the next optimal action. We present a method for transferring this human-specific knowledge onto a robot platform and demonstrate that the robot can perform successful manipulations of unseen objects with similar task structure.
We present a design of an easy-to-replicate glove-based system that can reliably perform simultaneous hand pose and force sensing in real time, for the purpose of collecting human hand data during fine manipulative actions. The design consists of a sensory glove that is capable of jointly collecting data of finger poses, hand poses, as well as forces on palm and each phalanx. Specifically, the sensory glove employs a network of 15 IMUs to measure the rotations between individual phalanxes. Hand pose is then reconstructed using forward kinematics. Contact forces on the palm and each phalanx are measured by 6 customized force sensors made from Velostat, a piezoresistive material whose force-voltage relation is investigated. We further develop an open-source software pipeline consisting of drivers and processing code and a system for visualizing hand actions that is compatible with the popular Raspberry Pi architecture. In our experiment, we conduct a series of evaluations that quantitatively characterize both individual sensors and the overall system, proving the effectiveness of the proposed design.
A growing body of evidence supports the hypothesis that humans infer future states of perceived physical situations by propagating noisy representations forward in time using rational (approximate) physics. In the present study, we examine whether humans are able to predict (1) the resting geometry of sand pouring from a funnel and (2) the dynamics of three substances---liquid, sand, and rigid balls---flowing past obstacles into two basins. Participants' judgments in each experiment are consistent with simulation results from the intuitive substance engine (ISE) model, which employs a Material Point Method (MPM) simulator with noisy inputs. The ISE outperforms ground-truth physical models in each situation, as well as two data-driven models. The results reported herein expand on previous work proposing human use of mental simulation in physical reasoning and demonstrate human proficiency in predicting the dynamics of sand, a substance that is less common in daily life than liquid or rigid objects.
Visuomotor adaptation plays an important role in motor planning and execution. However, it remains unclear how sensorimotor transformations are recalibrated when visual and proprioceptive feedback are decoupled. To address this question, the present study asked participants to reach toward targets in a virtual reality (VR) environment. They were given visual feedback of their arm movements in VR that was either consistent (normal motion) with the virtual world or reflected (reversed motion) with respect to the left-right and vertical axes. Participants completed two normal motion experimental sessions, with a reversed motion session in between. While reaction time in the reversed motion session was longer than in the normal motion session, participants showed the learning improvement by completing trials in the second normal motion session faster than in the first. The reduction in reaction time was found to correlate with greater use of linear reaching trajectory strategies (measured using dynamic time warping) in the reversed and second normal motion sessions. This result appears consistent with linear motor movement planning guided by increased attention to visual feedback. Such strategical bias persisted into the second normal motion session. Participants in the reversed session were grouped into two clusters depending on their preference for proximal/distal and awkward/smooth motor movements. We found that participants who preferred distal-smooth movements produced more linear trajectories than those who preferred proximal-awkward movements.
We propose the configurable rendering of massive quantities of photorealistic images with ground truth for the purposes of training, benchmarking, and diagnosing computer vision models. In contrast to the conventional (crowd-sourced) manual labeling of ground truth for a relatively modest number of RGB-D images captured by Kinect-like sensors, we devise a non-trivial configurable pipeline of algorithms capable of generating a potentially infinite variety of indoor scenes using a stochastic grammar, specifically, one represented by an attributed spatial And-Or graph. We employ physics-based rendering to synthesize photorealistic RGB images while automatically synthesizing detailed, per-pixel ground truth data, including visible surface depth and normal, object identity and material information, as well as illumination. Our pipeline is configurable inasmuch as it enables the precise customization and control of important attributes of the generated scenes. We demonstrate that our generated scenes achieve a performance similar to the NYU v2 Dataset on pre-trained deep learning models. By modifying pipeline components in a controllable manner, we furthermore provide diagnostics on common scene understanding tasks; eg., depth and surface normal prediction, semantic segmentation, etc.
This paper examines how humans adapt to novel physical situations with unknown gravitational acceleration in immersive virtual environments. We designed four virtual reality experiments with different tasks for participants to complete: strike a ball to hit a target, trigger a ball to hit a target, predict the landing location of a projectile, and estimate the flight duration of a projectile. The first two experiments compared human behavior in the virtual environment with real-world performance reported in the literature. The last two experiments aimed to test the human ability to adapt to novel gravity fields by measuring their performance in trajectory prediction and time estimation tasks. The experiment results show that: 1) based on brief observation of a projectile's initial trajectory, humans are accurate at predicting the landing location even under novel gravity fields, and 2) humans' time estimation in a familiar earth environment fluctuates around the ground truth flight duration, although the time estimation in unknown gravity fields indicates a bias toward earth's gravity.
Both synthetic static and simulated dynamic 3D scene data is highly useful in the fields of computer vision and robot task planning. Yet their virtual nature makes it difficult for real agents to interact with such data in an intuitive way. Thus currently available datasets are either static or greatly simplified in terms of interactions and dynamics. In this paper, we propose a system in which Virtual Reality and human / finger pose tracking is integrated to allow agents to interact with virtual environments in real time. Segmented object and scene data is used to construct a scene within Unreal Engine 4, a physics-based game engine. We then use the Oculus Rift headset with a Kinect sensor, Leap Motion controller and a dance pad to navigate and manipulate objects inside synthetic scenes in real time. We demonstrate how our system can be used to construct a multi-jointed agent representation as well as fine-grained finger pose. In the end, we propose how our system can be used for robot task planning and image semantic segmentation.
In this paper, we present a probabilistic approach to explicitly infer containment relations between objects in 3D scenes. Given an input RGB-D video, our algorithm quantizes the perceptual space of a 3D scene by reasoning about containment relations over time. At each frame, we represent the containment relations in space by a containment graph, where each vertex represents an object and each edge represents a containment relation. We assume that human actions are the only cause that leads to containment relation changes over time, and classify human actions into four types of events: movein, move-out, no-change and paranormal-change. Here, paranomal-change refers to the events that are physically infeasible, and thus are ruled out through reasoning. A dynamic programming algorithm is adopted to finding both the optimal sequence of containment relations across the video, and the containment relation changes between adjacent frames. We evaluate the proposed method on our dataset with 1326 video clips taken in 9 indoor scenes, including some challenging cases, such as heavy occlusions and diverse changes of containment relations. The experimental results demonstrate good performance on the dataset.
We propose a notion of affordance that takes into account physical quantities generated when the human body interacts with real-world objects, and introduce a learning framework that incorporates the concept of human utilities, which in our opinion provides a deeper and finer-grained account not only of object affordance but also of people's interaction with objects. Rather than defining affordance in terms of the geometric compatibility between body poses and 3D objects, we devise algorithms that employ physics-based simulation to infer the relevant forces/pressures acting on body parts. By observing the choices people make in videos (particularly in selecting a chair in which to sit) our system learns the comfort intervals of the forces exerted on body parts (while sitting). We account for people's preferences in terms of human utilities, which transcend comfort intervals to account also for meaningful tasks within scenes and spatiotemporal constraints in motion planning, such as for the purposes of robot task planning.
The physical behavior of moving fluids is highly complex, yet people are able to interact with them in their everyday lives with relative ease. To investigate how humans achieve this remarkable ability, the present study extended the classical water-pouring problem (Schwartz & Black, 1999) to examine how humans take into consideration physical properties of fluids (e.g., viscosity) and perceptual variables (e.g., volume) in a reasoning task. We found that humans do not rely on simple qualitative heuristics to reason about fluid dynamics. Instead, they rely on the perceived viscosity and fluid volume to make quantitative judgments. Computational results from a probabilistic simulation model can account for human sensitivity to hidden attributes, such as viscosity, and their performance on the water-pouring task. In contrast, non-simulation models based on statistical learning fail to fit human performance. The results in the present paper provide converging evidence supporting mental simulation in physical reasoning, in addition to developing a set of experimental conditions that rectify the dissociation between explicit prediction and tacit judgment through the use of mental simulation strategies.
In this paper, we present a new framework for task-oriented object modeling, learning and recognition. The framework include: i) spatial decomposition of the object and 3D relations with the imagine human pose; ii) temporal pose sequence of human actions; iii) causal effects (physical quantities on the target object) produced by the object and action.
In this inferred representation, only the object is visible, and all other components are imagined "dark" matters. This framework subsumes other traditional problems, such as: (a) object recognition based on appearance and geometry; (b) action recognition based on poses; (c) object manipulation and affordance in robotics. We argue that objects, especially man-made objects, are designed for various tasks in a broad sense, and therefore it is natural to study them in a task-oriented framework.
Containers are ubiquitous in daily life. By container, we consider any physical object that can contain other objects, such as bowls, bottles, baskets, trash cans, refrigerators, etc. In this paper, we are interested in following questions: What is a container? Will an object contain another object? How many objects will a container hold? We study those problems by evaluating human cognition of containers and containing relations with physical simulation. In the experiments, we analyze human judgments with respect to results of physical simulation under different scenarios. We conclude that the physical simulation is a good approximation to the human cognition of container and containing relations.
Google’s Android platform includes a permission model thatprotects access to sensitive capabilities, such as Internet ac-cess, GPS use, and telephony. We have found that Android’scurrent permissions are often overly broad, providing appswith more access than they truly require. This deviationfrom least privilege increases the threat from vulnerabili-ties and malware. To address this issue, we present a novelsystem that can replace existing platform permissions withfiner-grained ones. A key property of our approach is thatit runs today, on stock Android devices, requiring no plat-form modifications. Our solution is composed of two parts:Mr. Hide, which runs in a separate process on a device andprovides access to sensitive data as a service; and Dr. An-droid (Dalvik Rewriter for Android), a tool that transformsexisting Android apps to access sensitive resources via Mr.Hide rather than directly through the system. Together, Dr.Android and Mr. Hide can completely remove several ofan app’s existing permissions and replace them with finer-grained ones, leveraging the platform to provide completemediation for protected resources. We evaluated our ideason several popular, free Android apps. We found that we canreplace many commonly used “dangerous” permissions withfiner-grained permissions. Moreover, apps transformed touse these finer-grained permissions run largely as expected,with reasonable performance overhead.