Visual-Inertial Structure and Pose Estimation

We developed a method for registration in augmented reality, that simultaneously tracks the position, orientation, and motion of the user’s head, as well as estimating the three-dimensional (3-D) structure of the scene. The method fuses data from head-mounted cameras and head-mounted inertial sensors. Two Extended Kalman Filters (EKF) are used; one of which estimates the motion of the user’s head and the other that estimates the 3-D locations of points in the scene. A recursive loop is used between the two EKFs.

We developed our own AR system from scratch. (This was quite a while ago; no HoloLenses existed at that time!) We mounted an optical see-through display (Virtual i-o i-glasses) on a helmet, along with inertial sensors and cameras. The system was tethered to a PC; camera and inertial sensor data were processed on the PC. We even had a mannequin head with a camera so that we could record what the user would see.

helmet1.jpg
optotrak.jpg

The LEDs on the helmet were used to obtain ground truth data, and evaluate the accuracy of the system. The LEDs were tracked by a Northern Digital Optotrak optical position sensor model 3020, which was mounted on the wall of the lab.

We used two coupled EKFs (Extended Kalman Filters): one to estimate head motion and the other to estimate structure. Head motion was represented by a 15x1 state vector, consisting of rotation angle, rotation velocity, translational position, translational velocity, and translational acceleration. Structure was represented by a 3Nx1 state vector that held the estimated 3D coordinates of N scene points.

The idea is to take the results of the head motion prediction filter and use it as a known condition for the estimation of the 3-D structure. Meanwhile, we take the results from the estimation of the 3-D structure filter and use it as a known condition for the head motion prediction step. Thus these two steps form a feedback loop.

EKF1.png

A block diagram of the head motion estimation filter is shown below. We assume that the sensors are asynchronous and their noise is independent of each other, so each sensor can be incorporated using a separate measurement update. The filter will perform the time update step to project the state from the current time step into the next time step, when either gyroscope data, accelerometer data or camera data is available. Then a measurement update step will be followed to update the filter's state according to the new measurement input. This is a recursive process and it will run continuously when the measurement input data is available.

Block diagram of head motion filter.

Block diagram of head motion filter.

A block diagram of the structure estimation filter is shown below. Our implementation can use either one or two cameras. If using two cameras, their inputs can be asynchronous, so each sensor can be incorporated using a separate measurement update. The filter will perform the time update step to project the state from the current time step into the next time step, when data from either camera data is available. Then a measurement update step will be followed to update the filter's state according to the new measurement input.

Block diagram of structure estimation filter.

Block diagram of structure estimation filter.

Complete details of our system, along with experimental results, are given in our paper: Chai, L., W.A. Hoff, and T. Vincent, “Three-dimensional motion and structure estimation using inertial sensors and computer vision for augmented reality,” Presence: Teleoperators and Virtual Environments, Vol. 11, No. 5, pp. 474-492, 2002. (pdf)

In 2008 my graduate student John Steinbis extended and improved this system. John used a quaternion representation for orientation instead of Euler angles, developed a new type of fiducial target, and implemented a real-time (30 fps) system. Details are described in his thesis: J. Steinbis, "A New Vision & Inertial Pose Estimation System for Handheld Augmented Reality", MS thesis, Engineering Division, Colorado School of Mines, 2008. (pdf)

The AR system was implemented on a handheld tablet, with attached camera and IMU sensors.

The system runs at 30 frames per second.