Download PDF
By Martin Wojtczyk and Devy Tan-Wojtczyk
1. INTRODUCTION
This article gives a brief overview of Rover, then focuses on our
implementation of the human-robot interface utilizing the Intel?
Perceptual Computing SDK for gesture and face detection. For a short
introduction to Rover’s features, see the Intel? Developer Zone video
from Game Developers Conference 2014 in San Francisco:
Figure 1: Intel? Developer Zone interview with Rover at Game Developers Conference 2014.
In comparatively contemporary times robots have either been
relegated behind closed doors of large industrial manufacturing plants
or demonized in movies such as Terminator where they were depicted as
destroyers of the human race. Both stereotypes contribute to creating an
unfounded fear in self-operating machines losing control and harming
the living. But now, vacuum-cleaning and lawn-mowing robots, among
others, are beginning a new trend: service robots as dedicated helpers
in shared environments with humans. The miniaturization and
cost-effective production of range and localization sensors on the one
hand and the ever-increasing compute power of modern processors on the
other, enable the creation of smart, sensing robots for domestic use
cases.
In the future, robots will require intelligent interactions with
their environment, including adapting to human emotions.
State-of-the-art hardware and software, such as the Intel Perceptual
Computing SDK paired with the Creative* Interactive Gesture Camera, are
paving the way for smarter, connected devices, toys, and domestic
helpers [1, 2].
2. CUBOTIX ROVER
When Intel announced the Perceptual Computing Challenge in 2013, our
team, Devy and Martin Wojtczyk, brainstormed possible use cases
utilizing the Intel Perceptual Computing SDK. The combination of a
USB-powered camera with an integrated depth sensor and an SDK that
enables gesture recognition, face detection, and voice interaction
resulted in us building an autonomous, mobile, gesture-controlled and
sensing robot called Rover. We were very excited to be selected for an
award [3]. Since then, we launched the website http://www. with updates on Rover and are in the process of creating an open hardware community.
The Cubotix Rover is our attempt to use advanced robotic algorithms
to transform off-the-shelf hardware into a smart home robot, capable of
learning and understanding unknown environments without prior
programming. Instead of unintuitive control panels, the robot is
instructed through gestures, natural language, and even facial
s. Advanced robotic algorithms make Rover location aware and
enable it to plan collision-free paths.
2.1. Gesture Recognition

Figure 2: Showing a thumbs-up gesture makes Rover happy and mobilizes the robot. Photo courtesy California Academy of Sciences.
Hand gestures are a common form of communication among humans. Think
of the police officer in the middle of a loud intersection in Times
Square gesturing the stop sign with his open palm facing approaching
traffic.Rover is equipped to recognize, respond to, and act on hand
gestures captured through the 3D camera. You can mobilize this robot by
gesturing thumbs-up, and in response it will also say “Let’s go!” This
robot frowns when you gesture a thumbs-down. Gesturing a high-five
renders Rover to crack jokes, such as “If I had arms, I would totally
high-five you”. Gesturing a peace sign renders Rover to say “Peace”.
These hand gestures and the resulting robotic vocal responses are
completely customizable and programmable.

Figure 3: Showing a thumbs-down gesture stops the robot and makes it sad. Photo courtesy California Academy of Sciences.
2.2. Facial Recognition
Facial is perhaps the most revealing and honest of all the
other means of communication. Recognition of these s and
being able to respond appropriately or inappropriately can mean the
difference between forming a bond or a division with another human
being. With artificial intelligence the gap separating machines and
humans can begin to close if robots are able to empathize. By capturing
facial s through the camera, Rover can detect smiles or frowns
and respond appropriately. Rover knows when a human has come near it
through its facial detection algorithms and can greet them by saying
“Hello my name is Rover. What’s your name?”, to which most people have
responded just as they would with another human being by saying “Hello
I’m ________”. After initiating the conversation, Rover utilizes the
Perceptual Computing SDKs face analysis features to distinguish three
possible states of the person in front of the camera: happy, sad, or
neutral and can respond with an appropriate empathetic : “Why
are you sad today?” or “Glad to see you happy today!” Moreover the SDKs
face recognition allows Rover to learn and distinguish between
individuals for a personalized experience.
3. HARDWARE ARCHITECTURE

Figure 4: Rover's
mobile LEGO* platform. Centrally located with glowing green buttons is
the LEGO Mindstorms* EV3 microcontroller, which is connected to the
servos that move the base. Also note the support structures and the
locking mechanism to mount a laptop.
Rover uses widely accessible and affordable off-the-shelf hardware
that many people may already own and can transform into a smart home
robot. It consists of a mobile LEGO platform that carries a depth-camera
and a laptop for perception, image processing, path-planning, and
human-robot interaction. The LEGO Mindstorms* EV3 set is a great tool
for rapid prototyping customized robot models. It includes a
microcontroller, sensors, and three servos with encoders, which allow
for easy calculation of travelled distances.

Figure 5: Rover's
mobile platform with an attached Creative* Interactive Gesture Camera
for gesture recognition, face detection, and 3D perception.
The Creative Interactive Gesture Camera attached to EV3 contains a
QVGA depth sensor and a HD RGB image sensor. The 0.5ft to 3.5ft
operating range of the depth sensor allows for 3D perception of objects
and obstacles in near range. It is powered solely by the USB port and
doesn’t require an additional power supply, which makes it a good fit
for mobile use on a robot. Rover’s laptop—an Ultrabook? with an Intel?
Core i7 processor and a touch screen—is mounted on top of the mobile
LEGO platform and interfaces the camera and the LEGO microcontroller.
The laptop is powerful enough to perform face detection and gesture and
speech recognition and to evaluate the depth images in soft real time to
steer the robot and avoid obstacles. All depth images and encoder data
from the servos are filtered and combined into a map, which serves the
robot for indoor localization and collision-free path planning.

Figure 6: Complete
Rover assembly with the mobile LEGO* platform base, the Creative*
Interactive Gesture Camera in the front and the laptop attached and
locked in place.
4. SOFTWARE ARCHITECTURE

Figure 7: Rover's
software architecture with most components for perception, a couple of
planners, and a few application use cases. All of these building blocks
run simultaneously in multiple threads and communicate with each other
via messages. The green-tinted components utilize the Intel? Perceptual
Computing SDK. All other modules are custom-built.
Rover’s control software is a multi-threaded application integrating a
graphical user interface implemented in the cross-platform application
framework Qt, a perception layer utilizing the Intel Perceptual
Computing SDK, and custom-built planning, sensing, and hardware
interface components. CMake*, a popular open-source build system, is
used to find all necessary dependencies, configure the project, and
create a Visual Studio* solution on Windows* [4, 5].
The application runs on an Ultrabook laptop running the Windows
operating system and mounted directly on the mobile LEGO platform.
As shown in Figure 7, the application layer has three different use
case components: the visible and audible Human-Robot Interface, an
Exploration use case that lets Rover explore a new and unknown
environment, and a smartphone remote control of the robot. The planning
layer includes a collision-free path planner based on a learned map and a
task planner that decides for the robot to move, explore, and interact
with the user. A larger number of components form the perception layer,
which is common for service robots as they have to sense their often
unknown environments and respond safely to unexpected changes.
Simultaneous Localization and Mapping (SLAM) and Obstacle Detection are
custom-built and based on the depth images from the Perceptual Computing
SDK, which also provides the functionality for gesture recognition,
face detection, and speech recognition.
The following sections briefly cover the Human-Robot Interface and
describe in more detail the implementation of gesture recognition and
face detection for the robot.
4.1. User Interface
The human-robot interface of Rover is implemented as a Qt5 application [6].
Qt includes tools for window and widget creation and commonly used
features, such as threads and futures for concurrent computations. The
main window depicts a stylized face consisting of two buttons: for the
robot’s eyes and mouth. Depending on the robot’s mood the mouth forms a
smile or a frown. When nobody interacts with the robot, it goes to
sleep. When it detects a person in front of it, it wakes up and responds
to gestures, which trigger actions. The robot’s main program launches
several different threads for the detection of different Intel
Perceptual Computing features. It utilizes Qt’s central signal/slot
mechanism for communication between objects and threads [7]. Qt’s implementation of future classes is utilized whenever the robot speaks for asynchronous speech output [8].
4.2. Perception
The robot’s perception relies on the camera featuring a color and a
depth sensor. The camera is interfaced through the SDK, which enables
applications to easily integrate gesture, face, and speech recognition,
as well speech synthesis.
4.2.1. Gesture Recognition
Simple, easy-to-learn hand gestures, which are realized utilizing the
SDK, trigger most of Rover’s actions. When a person shows a thumbs-up
gesture, the robot will look happy, say “Let’s go!” and can start
autonomous driving or another configured action. When the robot is shown
a thumbs-down gesture, it will put on a sad face, vocalize its
unhappiness, and stop mobile activities in its default configuration.
When showing the robot a high-five, it will crack a joke. Rover responds
to all of the SDK’s default gestures, but here we will just focus on
these three: thumbs-up, thumbs-down, and high-five.
Rover’s gesture recognition is implemented in a class
GesturePipeline, which runs in a separate thread and is based on the
class UtilPipeline out of the convenience library pxcutils in the SDK
and QObject from the Qt framework. GesturePipeline implements the two
virtual UtilPipeline functions OnGesture() and OnNewFrame() and emits a
signal for each recognized gesture. The class also implements the two
slots work() and cleanup(), which are required to move the pipeline into
its own QThread. Therefore, the declaration of GesturePipeline is very
simple and similar to the provided gesture sample [9, 10]:
01 | #ifndef GESTUREPIPELINE_H |
02 | #define GESTUREPIPELINE_H |
05 | #include "util_pipeline.h" |
07 | class GesturePipeline : public QObject, public UtilPipeline |
13 | virtual ~GesturePipeline(); |
15 | virtual void PXCAPI OnGesture(PXCGesture::Gesture *data); |
16 | virtual bool OnNewFrame(); |
19 | PXCGesture::Gesture m_gdata; |
22 | void gesturePoseThumbUp(); |
23 | void gesturePoseThumbDown(); |
24 | void gesturePoseBig5(); |
32 | #endif /* GESTUREPIPELINE_H */ |
Listing: GesturePipeline.h
Besides the empty default constructor and destructor, implementation
in GesturePipeline.cpp is limited to the four methods mentioned above.
The method work() is executed when the pipeline thread is started as a
QThread object. It enables gesture processing from within UtilPipeline
and runs its LoopFrames() method to process the camera’s images and
recognize gestures in subsequent image frames. The implementation of
work() is as follows:
1 | void GesturePipeline::work() |
4 | if (!LoopFrames()) wprintf_s(L"Failed to initialize or stream |
Listing: GesturePipeline.cpp – work()
The method cleanup() is called when the GesturePipeline thread is
terminated. In this case it does nothing and is implemented as an empty
function.
Once started via LoopFrames(), UtilPipeline calls OnNewFrame() for
every acquired image frame. To continue processing and recognizing
gestures, this function returns true on every call.
1 | bool GesturePipeline::OnNewFrame() |
Listing: GesturePipeline.cpp – OnNewFrame()
OnGesture() is called from UtilPipeline when a gesture is recognized.
It queries the data parameter for activated gesture labels and emits an
appropriate Qt signal.
01 | void PXCAPI GesturePipeline::OnGesture(PXCGesture::Gesture *data) |
07 | case PXCGesture::Gesture::LABEL_POSE_THUMB_UP: |
08 | emit gesturePoseThumbUp(); |
11 | case PXCGesture::Gesture::LABEL_POSE_THUMB_DOWN: |
12 | emit gesturePoseThumbDown(); |
15 | case PXCGesture::Gesture::LABEL_POSE_BIG5: |
16 | emit gesturePoseBig5(); |
Listing: GesturePipeline.cpp – OnGesture()
The emitted Qt signals would have little effect if they weren’t
connected to appropriate slots of the application’s main control thread
MainWindowCtrl. Therefore, it declares slots for each signal and
implements the robot’s activities.
1 | class MainWindowCtrl : public QObject |
6 | void gesturePoseThumbUp(); |
7 | void gesturePoseThumbDown(); |
8 | void gesturePoseBig5(); |
Listing: MainWindowCtrl.h snippet declaration of gesture slots.
The implementation of the actions triggered by the abovementioned
gestures is fairly simple. The robot’s state variable is switched to
RUNNING or STOPPED, and the robot’s mood is switched between HAPPY and
SAD. Voice feedback is assigned accordingly and spoken asynchronously
via SpeakAsync, a method utilizing the QFuture class of the Qt framework
for asynchronous computation.
01 | void MainWindowCtrl::gesturePoseThumbUp() |
03 | std::wstring sentence(L "Let's go!" ); |
11 | void MainWindowCtrl::gesturePoseThumbDown() |
13 | std::wstring sentence(L "Aww" ); |
21 | void MainWindowCtrl::gesturePoseBig5() |
23 | std::wstring sentence(L"I would totally high five you, if I |
Listing: MainWindowCtrl.cpp – gesture slot implementation.
The only missing piece between the signals of GesturePipeline and the
slots of MainWindowCtrl is the setup procedure implemented in a
QApplication object, which creates the GesturePipeline thread and the
MainWindowCtrl object and connects the signals to the slots. The
following listing shows how to create a QThread object, move the
GesturePipeline to that thread, connect the thread’s start/stop signals
to the pipeline’s work()/cleanup() methods and the gesture signals to
the appropriate slots of the main thread.
02 | gesturePipeline = new GesturePipeline; |
03 | gesturePipelineThread = new QThread( this ); |
05 | connect(gesturePipelineThread, SIGNAL(started()), |
06 | gesturePipeline, SLOT(work())); |
07 | connect(gesturePipelineThread, SIGNAL(finished()), |
08 | gesturePipeline, SLOT(cleanup())); |
09 | gesturePipeline->moveToThread(gesturePipelineThread); |
11 | gesturePipelineThread->start(); |
13 | connect(gesturePipeline, SIGNAL(gesturePoseThumbUp()), |
14 | mainWindowCtrl, SLOT(gesturePoseThumbUp())); |
15 | connect(gesturePipeline, SIGNAL(gesturePoseThumbDown()), |
16 | mainWindowCtrl, SLOT(gesturePoseThumbDown())); |
17 | connect(gesturePipeline, SIGNAL(gesturePoseBig5()), |
18 | mainWindowCtrl, SLOT(gesturePoseBig5())); |
Listing: Application.cpp – gesture setup
4.2.2. Face Detection
When Rover stands still and nobody interacts with it, it closes its
eyes and goes to sleep. However, when a person shows up in front of the
robot, Rover will wake up and greet them. This functionality is realized
using the SDK’s face detector.
Face detection is implemented in a class FacePipeline that is
structured very similar to GesturePipeline and is based on the Face
Detection sample in the SDK’s documentation [11].
It runs in a separate thread and is derived from the classes
UtilPipeline and QObject. FacePipeline implements the virtual
UtilPipeline functions OnNewFrame() and emits a signal when at least one
face is detected in the frame and a signal if no face is detected in
the frame. It also implements the two slots work() and cleanup(), which
are required to move the pipeline into its own QThread. Following is the
declaration of FacePipeline:
05 | #include "util_pipeline.h" |
07 | class FacePipeline : public QObject, public UtilPipeline |
13 | virtual ~FacePipeline(); |
15 | virtual bool OnNewFrame(); |
19 | void noFaceDetected(); |
26 | #endif /* FACEPIPELINE_H */ |
Listing: FacePipeline.h
The constructor, destructor, and the cleanup() methods are empty. The method work() calls LoopFrames() and starts UtilPipeline.
1 | void FacePipeline::work() |
3 | if (!LoopFrames()) wprintf_s(L"Failed to initialize or stream |
Listing: FacePipeline.cpp – work()
The method OnNewFrame is called by UtilPipeline for every acquired
frame. It queries the face analyzer module of the Intel Perceptual
Computing SDK, counts the number of detected faces, and emits the
appropriate signals.
01 | bool FacePipeline::OnNewFrame() |
04 | PXCFaceAnalysis* faceAnalyzer = QueryFace(); |
07 | for ( int fidx = 0; ; fidx++) |
11 | pxcStatus sts = faceAnalyzer->QueryFace(fidx, &fid, |
13 | if (sts < PXC_STATUS_NO_ERROR) |
21 | emit noFaceDetected(); |
Listing: FacePipeline.cpp – OnNewFrame()
Respective slots for the face detector are declared in the application’s main control thread:
1 | class MainWindowCtrl : public QObject |
Listing: MainWindowCtrl.h – declaration of face detector slots.
The implementation of the face detector slots update the robot’s
sleep/awake state, its mood, and its program state. When no face is
detected, a timer is launched that puts the robot to sleep unless the
robot is carrying out a task. This renders the methods easy to
implement.
01 | void MainWindowCtrl::faceDetected() |
09 | state = FACE_DETECTED; |
14 | void MainWindowCtrl::noFaceDetected() |
16 | if ((state != START) && (state != RUNNING)) |
18 | startOrContinueAwakeTimeout(); |
Listing: MainWindowCtrl.cpp – face detector slot implementation.
Similar to the gesture recognizer, the main application creates the
FacePipeline object, moves it into a Qt thread to run concurrently, and
connects the face detector signals to the appropriate slots of the main
control thread.
02 | facePipeline = new FacePipeline; |
03 | facePipelineThread = new QThread( this ); |
05 | connect(facePipelineThread, SIGNAL(started()), facePipeline, |
07 | connect(facePipelineThread, SIGNAL(finished()), facePipeline, |
09 | facePipeline->moveToThread(facePipelineThread); |
11 | facePipelineThread->start(); |
13 | connect(facePipeline, SIGNAL(faceDetected()), mainWindowCtrl, |
14 | SLOT(faceDetected())); |
15 | connect(facePipeline, SIGNAL(noFaceDetected()), |
16 | mainWindowCtrl, SLOT(noFaceDetected())); |
Listing: Application.cpp – face detector setup.
5. RESULTS
Based on our observations at recent exhibitions in the U.S. and
Europe, including Mobile World Congress, Maker Faire, CeBIT, California
Academy of Science, Robot Block Party, and Game Developer’s Conference,
people are ready and excited to try interacting with a robot. Google’s
official plunge into the world of artificial intelligence and robotics
has inspired the general public to look deeper and pay attention to the
future of robotics.

Figure 8: Rover at Mobile World Congress surrounded by a group of people.
Fear and apprehension has been replaced by curiosity and enthusiasm.
Controlling a machine has predominantly been done through dedicated
hardware, unintuitive control panels, and workstations. That boundary is
dissolving now as humans can communicate with machines through natural,
instinctual interactions thanks to advancing developments that allow
localization and mapping and gesture and facial recognition. Visitors
are astounded when they see they can control an autonomous mobile robot
through hand gestures and facial s utilizing the Ultrabook,
Intel Perceptual Computing SDK, and Creative Interactive Gesture Camera.
We have encountered these responses across a very wide spectrum of
people—young and old, man and woman, domestic and international.
6. OUTLOOK
Unlike many consumer robots on the market today, Rover is capable of
mapping out its environment without any external hardware like a remote
control. It can independently localize specific rooms in a home like the
kitchen, bathroom, and bedroom. If you’re at the office and need to
check up on a sick child at home, you can simply command Rover to go to a
specific room in your house without manually navigating it. Resembling a
human, this robot has short- and long-term memory. Its long-term memory
is stored in the form of a map that allows it to move independently. It
can recognize and therefore maneuver around furniture, corners, and
other architectural boundaries. Its short-term memory is capable of
recognizing an object that unpredictably darts in front of the robot,
prompting it to stop until the 3D camera no longer detects any obstacles
in its path. We are looking forward to sharing further details about
robot localization, mapping, and path-planning in future articles.
We see the potential for widespread use and adoption of Perceptual
Computing technology is vast. Professions and industries that embody the
“human touch” from healthcare to hospitality may reap the most benefits
from Perceptual Computing technology. Fundamentally, as human beings we
all seek to understand and be understood, and the best technologies are
those that make life easier, more efficient, or enhanced in an
impactful way. Simultaneous localization and mapping and gesture and
facial recognition all working together blur the lines between humanity
and machines, bringing us closer to the robots that can inhabit our
realities and imaginations.
7. ABOUT THE AUTHORS

Figure 9: Devy and Martin Wojtczyk with Rover.
Devy Tan-Wojtczyk is co-founder of Cubotix. She brings over 10 years
of business consulting experience with clients from UCLA, GE, Vodafone,
Blue Cross of California, Roche, Cooking.com, and New York City
Department for the Aging. She holds a BA in International Development
Studies from UCLA and an MSW with a focus on Aging from Columbia
University. For fun one weekend she led a newly formed cross-functional
team consisting of an idea generator, two developers, and a designer in
business and marketing efforts at the 48-hour HP Intel Social Good
Hackathon, which resulted in a cash award in recognition of technology,
innovation, and social impact. Devy was also competitively selected to
attend Y Combinator's first ever Female Founders Conference.
Martin Wojtczyk is an award-winning software engineer and technology enthusiast. With his wife Devy he founded Cubotix http://www.,
a DIY community, creating smart and affordable service robots for
everybody. He graduated in computer science and earned his PhD (Dr. rer.
nat.) in robotics from Technical University of Munich (TUM) in Germany
after years of research in the R&D department of Bayer HealthCare in
Berkeley. Speaking engagements include Google DevFest West, Mobile
World Congress, Maker Faire, and many others in the international
software engineering and robotics community. In the past 10 years he
developed the full software stack for several industrial autonomous
mobile service robots. He won multiple awards in global programming
competitions, was recently featured on Makezine.com, and recognized as
an Intel Software Innovator.
8. RELATED CONTENT
[1] Intel Perceptual Computing SDK: https://software.intel.com/en-us/vcsource/tools/perceptual-computing-sdk/home
[2] Creative Interactive Gesture Camera Kit: http://click.intel.com/creative-interactive-gesture-camera-developer-kit.html
[3] Intel Perceptual Computing Showcase – Rover – A LEGO Self-Driving Car: https://software.intel.com/sites/campaigns/perceptualshowcase/lego-self-driving-car.htm
[4] CMake – http://
[5] Martin Wojtczyk and Alois Knoll. A
cross platform development workflow for C/C++ applications. In Herwig
Mannaert, Tadashi Ohta, Cosmin Dini, and Robert Pellerin, editors,
Software Engineering Advances, 2008. ICSEA ’08. The Third International
Conference, 224-9, Sliema, Malta, October 2008. IEEE Computer Society.
[6] Qt Project: http://
[7] Qt Project Documentation – Signals & Slots: http:///doc/qt-5/signalsandslots.html
[8] Qt Project Documentation – QFuture Class: http:///doc/qt-5/qfuture.html
[9] Intel Perceptual Computing SDK Documentation – UtilPipeline: https://software.intel.com/sites/landingpage/perceptual_computing/documentation/html/index.html?utilpipeline.html
[10] Intel Perceptual Computing SDK Documentation – Add Gesture Control: https://software.intel.com/sites/landingpage/perceptual_computing/documentation/html/index.html?tuthavok_add_gesture_control.html
[11] Intel Perceptual Computing SDK Documentation – Code Walkthrough Of Face Detection Sample: https://software.intel.com/sites/landingpage/perceptual_computing/documentation/html/index.html?tutface_code_explanation.html
Intel? Real-Sense? Technology
First announced at CES 2014, Intel? RealSense? technology
is the new name and brand for what was Intel? Perceptual Computing
technology, the intuitive user interface SDK with functions like speech
recognition, gesture, hand and finger tracking, and facial recognition
that Intel introduced in 2013. Intel RealSense Technology gives
developers additional features including scanning, modifying, printing,
and sharing in 3D plus major advances in augmented reality interfaces.
With these new features, users can naturally manipulate scanned 3D
objects using advanced hand- and finger-sensing technology.
Add a Comment
Top(For technical discussions visit our developer forums. For site or software product issues contact support.)
Please sign in to add a comment. Not a member? Join today