Egocentric Multimodal Trajectory Prediction

Abstract

Wearable collaborative robots stand to assist human wearers who need fall prevention assistance or wear exoskeletons. Such a robot needs to be able to predict the ego motion of the wearer based on egocentric vision and the surrounding scene. In this work, we leveraged body-mounted cameras and sensors to anticipate the trajectory of human wearers through complex surroundings. To facilitate research in ego-motion prediction, we have collected a comprehensive walking scene navigation dataset centered on the user's perspective. We present a method to predict human motion conditioning on the surrounding static scene. Our method leverages a diffusion model to produce a distribution of potential future trajectories, taking into account the user's observation of the environment. We introduce a compact representation to encode the user's visual memory of the surroundings, as well as an efficient sample-generating technique to speed up real-time inference of a diffusion model. We ablate our model and compare it to baselines, and results show that our model outperforms existing methods on key metrics of collision avoidance and trajectory mode coverage.

Dataset

Our Egocentric Navigation Dataset is collected around Stanford university campus and its vicinity. The dataset has 34 collections, each approximately 7 minutes spanning over 600 meters, designed to capture a wide range of interactions with the environment. The dataset encompasses various weather conditions (rain, sunny, overcast), surface textures (glass, solid, glossy, reflective, water), and environmental features (stairs, ramps, flat grounds, hills, off-road paths), alongside dynamic obstacles, including humans.

Recorded at a 20Hz sampling rate, the dataset includes comprehensive state and visual information to capture the nuances in human behavior: States: 6 degrees of freedom (dof) torso pose in a global frame, leg joint angles (hips and knees of both legs), torso linear and angular velocity, and gait frequency. Visual: aligned color and depth images, semantic segmentation masks, and visual memory frames generated. Totaling 198 minutes of data (over 400GB), in practice, we find it is possible to train a very high-quality model even with a smaller dataset size. Therefore we curated a high-quality pilot dataset with roughly 15% of the full data. This allows us to quickly iterate with faster training and an acceptable amount of performance degradation. The performance has been qualitatively compared in the ablation.

We hope to addresses critical gaps by providing dense, high-frequency logs with rich visual and state information. And we are committed to open-sourcing our dataset following the de-identification of all faces within the data. We will also provide the software tools to collect and process the data, in case anyone wishes to extend the dataset or collect similar data in different environments.

Method

The goal of this work is to predict the possible paths of a person in a cluttered environment. A trajectory is defined as a sequence of 6D poses (translation and orientation) of a person navigating in the 3D world. At each time step t, our model uses the past trajectory to predict likely future trajectories. In addition, the prediction must be conditioned on the observation of surroundings. The visual observation S encodes the appearance, geometry, and semantics of the environment captured by wearable visual and depth sensors.

Therefore, our method takes in the past trajectory of the person and a short history of RGBD images. The color images are semantically labeled by DINOv2 into 8 semantic channels, while the depth images go through a preprocessing pipeline that filters out erroneously filled edges. We transform the past trajectory from a global coordinate frame to an egocentric frame defined by the gravity vector as -Z and forward-facing direction as +X. The collected images are them project and globally aligned to create a single panorama in the egocentric coordinate frame, referred to as "visual memory". Conditioning on the visual memory and the past ego trajectory, a diffusion model is trained to predict the future trajectory, along with encoded visual observations as auxiliary outputs. Finally, we use the VAE decoder to recover the expected future panorama. Combining with the hybrid generation method illustrated in the later section, we aim to provide a fast and effective method in predicting the distribution of the future states conditioning on the observed environment.

Visual Memory

Visual memory is an ego-perspective, panoramic view representation of the surroundings. Given the camera's intrinsic and extrinsic parameters, images in different frames and channels can be projected to a single point cloud in the global frame. A distance-based filter is applied to remove points too far away from the current pose. The points are then projected back to the current ego frame to form a coherent representation of all the scene information gathered. It is important to note that the visual memory representation holds a lot more relevant information than just a single image. As shown below, A single image from a stereo camera only has a narrow FOV pointing directly in front. It fails to capture the objects and paths in the scene that are highly relevant to the prediction, requiring many individual frames to be sent to the prediction module, relying on the model's capability to extract useful information. Meanwhile, the visual memory stitches the past frames together and integrates multi-modal inputs all into one single image, greatly improving the model and storage efficiency.

Prediction Model and Hybrid Generation

A conventional DDPM sampling method, while capable of producing high-quality predictions, operates at a pace that is impractical for applications requiring immediate responses, such as navigational aids or interactive systems. This limitation is further pronounced when attempting to generate a distribution of future trajectories, as multiple denoising sequences are necessary to produce a substantial number of samples, exacerbating the time constraints. Conversely, while DDIM offers a considerable acceleration in generating predictions, it does so at the expense of sample quality—a compromise that is untenable for applications where the fidelity of predicted trajectories directly impacts functionality and safety.

To address these challenges, we introduce a hybrid generation scheme that combines the strengths of both methods. Hybrid generation operates by initiating the generation process with a DDIM-like approach to quickly approximate the trajectory distribution, followed by a refinement phase using the DDPM. Essentially retaining the multimodal gradient landscape at the end of the diffusion process, ensuring that the final output maintains the intricate details and nuanced variations captured by a traditional DDPM without the accompanying latency. In practice, we were able to achieve 50x acceleration in generating samples with minimal performance drop.

Results and evaluation metrics

A visual illustration of three metrics is shown below:

We found two previous papers that build to solve similar task as we do. The CXA transformer by Jianing et al. proposes a novel cascaded cross-attention transformer block to fuse in multimodal inputs. And then uses transformer decoder to generate the predicted trajectory autoregressively. The LSTM-VAE is another popular method commonly used in various trajectory prediction task. All three models are trained on the same dataset and evaluated by the same data and metrics.

@article{wang2024egocentric, title={Egocentric Scene-aware Human Trajectory Prediction}, author = {Wang, Weizhuo and Liu, C. Karen and Kennedy III, Monroe}, journal={arXiv preprint arXiv:2403.19026}, year={2024}, }

Egocentric Scene-aware Human Trajectory Prediction

Given past trajectory and the current scene, our method predicts the discrete distribution of possible future trajectory of the user.