Our goal for this project was to use the Sawyer collaborative robot to assist a human with eating.

This is exciting to us not only because of our interest in the involved technical challenges, but also because these challenges are directly applicable to real world problems. In the assistive technologies space, we believe that robots have a lot of potential in giving individuals such as quadriplegics the autonomy to perform everyday activities such as eating on their own. In the future, we hope robotic systems that can grasp and manipulate items like food will play a large part in allowing everybody – regardless of disability, background, or socioeconomic status – to live independent and meaningful lives.

As a proof-of-concept, we simplified this challenge into the task of feeding somebody a marshmallow. Several questions needed to be answered to make this happen:


We split our task of feeding an individual a marshmallow into a handful of distinct steps:

  1. Receive a “feed” command from the user
  2. Detect and localize a single marshmallow that’s within our robot arm’s workspace
  3. Detect and find the position of the human’s mouth
  4. Move the end effector to the position of the chosen marshmallow
  5. Grip the marshmallow
  6. Move the marshmallow to the user’s mouth

To complete all of this successfully, our assistive feeder had to be efficient, accurate, and interactively simple. We wanted to reliably detect and pick up marshmallows, consistently bring them to within inches of the user’s mouth position, and implement a straightforward control interface.

We use our personal Kinect2 over a provided Kinect1 because of the higher resolution the Kinect2 provided, which was integral to our desired accuracy. The Sawyer arms are also more precise than the Baxter arms. We use Sawyer’s wrist cam instead of its head cam to help calibrate our system because of the head cam had skewed perspective issues.

We also decided to eliminate the use of AR tags during the actual feeding operation, because they’d be unrealistic to use in the real world. To successfully accomplish this, we had to use some sort of RGBD camera, like the Kinect2, which supplied both a point cloud and color image in order for us to properly detect the marshmallows and face.

Since our Kinect was our main means of sensing, we had to figure out a way to transform the Kinect’s feed to be relative to the frame of the robot. We did this so that Sawyer could know where everything was in relation to its base frame. We spent some time thinking of the best orientation for our human subject, the placement of the Kinect, and the placement of Sawyer. Another design choice we faced was related to how we wanted to interface with our system. The easiest approach would be for our system to purely take in command-line commands, but this wasn’t clean enough. So we looked into using some sort of web interface, or even possibly a voice-controlled interface. The web interface was the most robust and effective, and was the right choice.

These are definitely decisions that would be made in industry applications. The type of hardware to use, the ideal placement of hardware, the interaction between human and robot, are all choices to be made in real engineering projects. Our project’s design resulted in a robust and efficient system.


Kinect Extrinsic Calibration

Before we can use the Kinect to accurately sense objects and people in the robot’s environment, we first need to identify where the sensor is located relative to the robot.

This calibration procedure not only needs to be accurate, but it also must be fast enough to do each time we set up our demo.

The procedure is uses a tweaked version of the ar_track_alvar ROS package, which allows our Sawyer’s gripper camera and our Kinect’s depth camera to each track a shared set of randomly placed AR tags.

calibration photo

Then, we try to find the optimal world => Kinect transformation, which minimizes the squared positional error between the two broadcasted tfs associated with each tag. If we define a matrix X whose rows are the positions of the tags in the Kinect frame and a matrix Y whose rows are the positions of these tags in the world frame, we can solve for the position and orientation of the Kinect as follows:

Solving this optimization by setting its derivative to 0 gives us a solution built off the singular vectors of the covariance matrix:

is the ideal orthonormal transformation matrix with a magnitude 1 determinant, but if our input data is extraordinarily bad this can theoretically also be a reflection and not a pure rotation. This can be rectified by checking the sign of the determinant – see for full implementation details.

Marshmallow Pose Estimation

Once we have our depth camera calibrated, we can use it to search our robot’s environment for marshmallows. For our project, we assume that these marshmallows are spread out on a table in front of it.

marshmallow tfs

Our marshmallow localization node parses the point cloud outputted by the Kinect, and applies a naive search algorithm to estimate the poses of marshmallows on it:

marshmallow flow chart

Face Tracking

We simplify mouth detection to just face detection with the position of the mouth estimated to be towards the bottom center of the face’s bounding rectangle. OpenCV provides a simple frontal face detection algorithm which uses Haar feature-based cascade classifiers to recognize potential faces in an image. The issue with just this is that the user’s face must be very “face-like,” looking straight ahead at the camera, and not tilted in any direction. This is unideal, so we add in another algorithm called template matching in order to track the face even when it’s not oriented properly. The way this works is that when OpenCV’s haar cascade algorithm is no longer able to detect a face, we switch to the template matching phase.

The previous face pattern is remembered, so that when we enter template matching, we search the video frame for that pattern, and the closest match is assumed the face. We return this result as the face.

With the combination of these two algorithms, we’re able to track the subject’s face with much more precision and consistency. Once we know the 2D coordinates of the mouth in the RGB camera’s face, we can find the corresponding 3D coordinate in the Kinect point cloud, which is then broadcasted as a transform to the planning node!

This image shows Rviz on the left, where you can see the “face” transform indicating the 3D position of the mouth, and the face detected from the Kinect’s rgb camera drawing a box and circle around the user’s face and mouth:

mouth tracking

Path Planning

In order to perform path planning so that the gripper moves to an appropriate position relative to the marshmallow or mouth, we broadcast TFs offset by a certain distance above the marshmallow or away from the mouth, all relative to the Sawyer base. These value were determined by measuring the gripper finger lengths and setting an offset that worked well when testing. The TFs that we broadcasted were marshmallow_waypoint_goal (10 cm above the marshmallow), marshmallow_final_goal (the appropriate TF to put the marshmallow between the grippers), and face_gripper_goal (a comfortable offset from the mouth TF).

In the ROS node that performs planning and movement, we listen for the aforementioned custom TFs to determine where the end effector should be moved to. We divide the path planning movement into three different phases with path planning constraints as mentioned:

These actions were abstracted on our web user interface into a “marshmallow” button to move to the waypoint then to the marshmallow gripping position, and a “mouth” button to move to the mouth position.


Marshmallows are particularly challenging to grip because they are soft, deformable objects. Additionally, the Sawyer’s gripper do not always detect enough force applied when a marshmallow is present. In order to address these issues, we applied these following techniques to maximize the probability of gripping:


The gripping action is abstracted into the “grip” button on the web interface.

Control Interface

web ui

To send commands to our robot, we connected it to a web service in the cloud.

A ROS node communicates with the server, and makes a corresponding ROS service call each time a button on our web interface is pressed.

In the future, we could use this connection to the cloud to allow the robot to be commanded through a voice assistant, such as Amazon Alexa.

Running our project

At the start of our project, we discussed a lot about making a single launch file for the entire Mr. Marshmello stack: hardware drivers, MoveIt, calibration code, face tracking, etc. However, we soon realized that this would make development and debugging significantly more difficult. We wouldn’t be able to, for example, kill and restart just a single one of our nodes without restarting the entire stack.

Instead, we split our project up into several different logically grouped launch files. A shell script was then written to automatically launch each of them in named tmux panes. This was easy to run, yet also easy to debug:

tmux new-session -d -s marshmello

tmux rename-window 'sawyer'
tmux send-keys 'roslaunch marshmello_bringup sawyer_moveit.launch'

tmux new-window -t marshmello -n 'rviz'
tmux send-keys 'roslaunch marshmello_bringup sawyer_rviz.launch'

tmux new-window -t marshmello -n 'camera_drivers'
tmux send-keys 'roslaunch marshmello_bringup camera_drivers.launch'



Our project could autonomously move to each marshmallow laid out arbitrarily on a table, grab them, and feed the subject the marshmallows. Each of these subtasks, like identifying the position of a marshmallow, moving the gripper there, picking up the marshmallow, and moving it back to the detected mouth position, functioned as expected.


josh eating

The final outcome of our system worked extremely well, meeting all of our initial design criteria.


The majority of our difficulties were related to path planning. We had many problems while attempting to interact with the robot. The first problem involved the MoveIt Python library. We faced a great amount of confusion trying to figure out how to use this library since it’s not as well documented as its C++ counterpart. For instance we struggled to figure out end effector positions, relevant reference frames, and gripper controls. Most of these problems were solved through trial and error. Furthermore, MoveIt would crash constantly and was difficult to debug. We also had issues with path planning constraints. We tried to implement as many realistic constraints as we could. However, even if the constraints seemed flexible and realistic, the MoveIt path planning library would not be able to generate a valid plan. We ended up removing around half our constraints to get a working result. Additionally, we had difficulty with gripping controls. We originally had many ideas for precise gripping, such as setting the force and velocity of the gripper. However none of the gripper controls documented online would work, and our code would end up crashing. In the end we were limited to only opening and closing our gripper. We also did not realize that in order to use the gripper, we would have to initialize it, which involved not having an object between the grippers. In our first iterations, we would only initialize the gripper when there was a marshmallow between the grippers, leading to an uninitialized gripper and ultimately having our code crash. Through some tinkering we realized our mistake and fixed our code.

Outside of path planning, we had issues calibrating our kinect. If the kinect was calibrated incorrectly, the position of the marshmallows would be off, guiding Sawyer to the wrong position. To mitigate this problem, we set a maximum error limit while calibrating the kinect to minimize offset positions. We also had problems with getting the correct kinect drivers on the lab computers. It was only during the last couple of days that the correct drivers were installed on the lab machines and we could run our code. We were testing from our personal computers before that.

Possible Improvements

Sawyer’s default path planning can be quite hectic, and due to its placement in the lab, Sawyer’s arm path would often collide with the wall behind or next to it. So we set up constraints, preventing it from hitting those walls, or the table we placed our marshmallows on. But then when we placed a vertical constraint at the user’s face position, we ran into the issue where the arm wouldn’t even be able to find a path to the marshmallow, or to the face - even though we could manually move the arm correctly in gravity mode. Our solution to this was to remove the face constraint, and to just make sure the subject was aware of where the arm was moving. This would definitely be unacceptable in a real scenario, but because the arm doesn’t move at high speeds, and our subjects were all capable of extrapolating the path of the arm as it moved, this was not a safety concern. Another problem that we encountered was detecting if the gripper has successfully gripped the marshmallow. Since the necessary gripper position is fairly wide, it can’t always close to a point where the gripper can detect the spring force from the compressed marshmallow. We currently have no way to detect a failed grip attempt with certainty, but given more time, it is possible that for an apparently failed gripping attempt, we move the arm to a default position in front of the head camera to detect if a marshmallow is visible. With this secondary check, we can determine with certainty if the gripping sequence was successful or if it needs to be redone.


Nandita Iyer


  • Computer Science, 2019
  • Software Engineering Intern @ Salesforce

Major contributions:

  • Path planning
  • Gripping

William Lu


  • Mechanical Engineering and Electrical Engineering and Computer Science, 2019
  • Software Engineering Intern @ Facebook

Major contributions:

  • Path planning
  • Gripping

Brent Yi


  • Computer Science, 2019
  • Hardware Engineering Intern @ Amazon Lab126

Major contributions:

  • Sensor calibration
  • Marshmallow pose estimation
  • Web interface (UI, server side)

Joshua Yuan


  • Computer Science, 2019
  • Software Engineering Intern @ Google

Major contributions:

  • Facial tracking
  • Web interface (ROS side)