Combining Self-Supervised Learning and Imitation
for Vision-Based Rope Manipulation



We present a system where a robot takes as input a sequence of images of a human manipulating a rope from an initial to goal configuration, and outputs a sequence of actions that can reproduce the human demonstration, using only monocular images as input. To perform this task, the robot learns a pixel-level inverse dynamics model of rope manipulation directly from images in a self-supervised manner, using more than 30K interactions with the rope collected autonomously by the robot. The human demonstration provides a high-level plan of what to do and the low-level inverse model is used to execute the plan. We show that by combining the high and low-level plans, the robot can successfully manipulate a rope into a variety of target shapes using only a sequence of human-provided images for direction.


The paper is available at: [pdf]


The data used for training the inverse dynamics model is available here. Instructions on how to access the data and use our validation set for evaluating the accuracy of the learnt model are included in an iPython notebook here.


This video explains our method and shows our experimental results.

Robot's Rope Manipulation Skills

The following videos show randomly sampled success and failure examples of our robot at manipulating the rope into L, S, W shapes and in tying knots.





Generalization Experiments

We tested the learnt model for manipulating two different ropes - (i) stiffer white rope and (ii) a softer black rope. Our model is succesfully able to manipulate these ropes into "S" and "L" shapes. This is very interesting because our model was trained using data from only a single red rope. While our model is sucessful at manipulating different ropes into "S" and "L" shapes it is unsucessful at forming knots or the "W" shape. One possible reason for this failure is that the white rope is too stiff to be bent into these shapes. Since our model was trained using only a single (green) background it doesnot generalizes to new backgrounds. If provided with substantially more robot-hours and a greater variety of ropes and environments, our model could in principle learn a more generalizable notion of rope manipulation.



The network architecture used is shown below and explained in the paper.


The classification output for the three variables is done with an autoregressive architecture. For example, in the model we evaluated on the robot, we first output a classification prediction for p (pixel), then sample the top prediction and feed it into the next prediction (theta), and so on for length, as shown in the image above. Below, we include the test accuracy of the top prediction for each variable and for different network architectures trained on the publicly available data of 30K actions.

Prediction Ordering P Accuracy T Accuracy L Accuracy
None (Independent) 12% 13% 20%
P -> T -> L 10% 16% 29%
L -> T -> P 14% 13% 22%

Link to videos

All videos above can be accessed here.

Link to prior work

Our previous project on model learning and large-scale data collection can be viewed here.

Website Template

The template for this website has been adopted from Carl Doersch.


For comments/questions, contact Pulkit Agrawal