Self-learning visual perception for a mobile robot

Project proposer: Rudolf Mester, IDI

In this project, a motion perception system for a robot is to be developed. The main idea is that ego-motion of a robot can be learned from the visual input in combination with data from an inertial measurement unit (IMU) which is able to sense translational acceleration and rotational velocity. This is (hypothetically) the same way how humans and animals learn to perceive their motion, by combining data from visual input and the inner ear (sense of balance).

In the project, a suitable camera with build-in IMU (Intel Realsense) shall be used to first collect extensive data (video and IMU recordings), and then in step 2 design an architecture that learns to derive the desired information (translation and rotation of the camera) from that data. We think of really huge data sets in the order of magnitude of 1 Terabyte as the relation to be learned is more complex as this simplified project description can convey.

One important principle to be followed in this project is that we search for an architecture which is as sparse as possible while being in the position to solve the given problem reliably. This means, it is not intended to "simply" try a very deep architecture and solve the problem by computational overkill. The basis for this "sparse" and efficient solution is that quite a lot is known about motion perception in biological systems and in computer vision. The front end of the envisaged structure is a CNN with a spatio-temporal (that is: 3dimensional) input layer, 1 time dimension and 2 spatial dimensions. In other words, the primary information processing layer does not only work on 2 images (as in many approaches), but on a stack of N images, N >>2. This refers to the fact that the projection of most motions onto the retina (or camera image plane) is mostly a smooth curve; things do not oscillate wildly when we regard them, but they move (mostly) smoothly across the image plane.

The output of this CNN is a (coarse) estimate of the so called "optical flow fleld", but in contrast to most conventional schemes, the output is not (only) a 2D motion vector, but comes in terms of a vector of latent variables that describe the local motion in a richer, more informative way. This first layer will be tranined essentially like an auto-encoder network. During the training process, it will be investigated whether is is useful to include the measurements from the IMU into the loss function.

This first layer with a coarse subsampling of the "extended optical flow" field is subsequently fed into a second network which,given the already trained first CNN layer is trained to provide, together with the inertial measurement unit, the final ego-motion data, that is: the translational and rotational motion of the camera.

This project is well suited for a student (or two students) who do not only have some initial knowledge about visual computing (e.g. from TDT4195 - Visual Computing Fundamentals) and deep learning (e.g. TDT4265 Computer Vision and Deep Learning) --- such knowledge is essential ---, but are also interested in thorough mathematical analysis and modeling of the regarded problem, and are interested in a structured solution architecture that reflects all the known facts about the problem. So the ability and the willingness to regard the problem also from the mathematical and statistical side is a requirement.

In the ideal case, the student(s) will be supported to report about their results not only in a thesis, but also in a conference paper. This is an option, but not a stringent requirement.