Visually guided grasping robot MIRA

Abstract.

A mobile robot which is capable of identifying and grasping a fruit on a table, taking into account the fruit's position. The behaviour is guided by verbal instruction. It focuses on the practical use of visual recognition and navigation in a continuous world which is easy for humans yet traditionally difficult for machine intelligence.

Core components for docking: neural network.

The figure shows the neural network for the visually guided docking manoeuvre. Blue connections represent trained weights, the light blue ones were only used during training, while only the dark blue connections are used for performance. The dark rectangles show neural activations (green if active) on the corresponding layers.

Core component 1: lower visual system.

The bottom-up recognition weights W and their feedback counterpart were trained using a sparse and topographic Helmholtz machine framework [Weber, C. (2001) Self-Organization of Orientation Maps, Lateral Connections, and Dynamic Receptive Fields in the Primary Visual Cortex (PS|PDF)]. As a result of training on real world images they have become feature detectors: many neurons on the what area detect localised edges in the image, some are color selective (W as GIF).

Core component 2: what - where association of target object.

The lateral weights V within and between the what and the where areas were trained as a continuous associator network [Weber, C. and Wermter, S. (2003) Object Localization using Laterally Connected "What" and "Where" Associator Networks (PS|PDF)]. They associate the what neural activations (which contain information of the target fruit and the background) with the where location. During training, the where location of the fruit was given as a Gaussian activation hill at the location corresponding to the location within the image (both, image- and where-area are the same size, 24x16 units). For performance, the where activations are first unknown and initialised to zero, but through pattern completion via the V weights the (hopefully) correct location will emerge on the where area as a Gaussian hill of activation.

Core component 3: reinforcement-trained motor strategy.

The reinforcement-trained weights R drive the robot until the fruit target is perceived right in front of its grippers [Weber, C. and Wermter, S. and Zochios, A. (2003) Robot Docking with Neural Vision and Reinforcement (PS|PDF)]. The actual state of the robot is fully given by the visually perceived location of the target on the where area and by the robot angle w.r.t. the table from which it has to grasp the fruit (note that it must arrive perpendicularly to the table so that it doesn't hit it with its sides). Both these inputs are first expanded on a state space where one neuron (and its immediate neighbours) represents the state. The weights to the critic allow every state to be evaluated (it is better if the target will soon be reached). The weights R to the robot motors (forward, backward, left_turn, right_turn) drive the robot to states which have a better value, i.e. closer to the goal (mpeg video).

Embedding behaviours into a demo: writing policies with Miro.

The following image shows a primitive Miro policy in which one after the other "action pattern" (boxes in the image) is activated in a series. Within each action pattern can be several bahaviours (grey). The above described visually guided docking is in the "NNsimBehaviour" inside the "Docking" action pattern. Behaviours can be parameterized and thus reused in different variations.

The sequence of events is:
SpeechRecog: do transition OffSpeechRecognition if any of the words "GET" or "ORANGE" has been recognised.
GotoTable: move forward until the infrared table sensors sense a table, and then do transition DoDockingTransition.
Docking: now hopefully the orange target is within the camera field. Do the neurally defined behaviour trained as described above. Do transition OffNNsim if the orange is perceived at front middle (defined on the where area) for 5 consecutive iterations.
Grasping: close gripper, lift gripper, do transition LeaveTableTransition.
LeaveTable: go backward a few centimeters, do transition OffStraightLimit.
Turn: turn 180 degrees, do transition OffStraightLimit.
Forward: go forward half a meter, do transition OffStraightLimit.
SpeechRec2: do transition OffSpeechRecognition if any of the words "OPEN" or "HAND" has been recognised.
Grasping: open gripper (to release the orange), do transition LeaveTableTransition.
Empty: the end.
However, we had added another two action patterns. In
SpeechRec3 the robot recognised any of the words "THANK" and "YOU" and then made a transition to
SpeechGeneration at which it said "YOU ARE WELCOME. THAT WAS EASY FOR A ROBOT LIKE ME".

Running the demo and implementation.

Four xterms (GIF) are opened to run the demo. Three of them start services on the robot: the video service for the camera, the speech service for speech recognition (sphinx) and production (festival) and finally the robotBase service for all the rest: motors, range- and table sensors, gripper, etc. The fourth xterm starts the BehaviourEngine which loads the Policy.xml file. All behaviours are compiled into this program.
The directory structure (GIF) has the following: the Policy.xml file, the Engine and Factory directories which contain the executable BehaviourEngine file and a BehavioursFactory to collect the implemented behaviours, one directory for every behaviour and a directory "SphinxTGplusMB" for the SphinxSpeech speech service. The other services (video and robotBase) resided in the Miro directory.

The demo with audience.

The demo won us (Cornelius Weber, Mark Elshaw, Alex Zochios, Chris Rowan, all members of the HIS centre led by Stefan Wermter) the MI-prize competition at the AI-2003 in Cambridge. See a JPEG, or more, from stage or a post-prize-winning video (7.3 MB).

Acknowledgements

This work was made possible through the MirrorBot project.