Sparsh: Self-supervised touch representations
for vision-based tactile sensing

1FAIR at Meta, 2University of Washington, 3Carnegie Mellon University,

* indicates equal contribution


We present Sparsh, a family of general touch representations, and TacBench, a standardized benchmark consisting of six touch-centric tasks ([T1]-[T6]) covering prominent problems in vision-based tactile sensing. In evaluation (middle), we find Sparsh trained with self-supervision on a dataset of 460k+ tactile images to generalize across tasks (right) and sensors (left) outperforming task and sensor specific models (E2E). [T1]- [T5] and [T6] are trained with 33% and 50% of the labeled data respectively.

Abstract

In this work, we introduce general purpose touch representations for the increasingly accessible class of vision-based tactile sensors. Such sensors have led to many recent advances in robot manipulation as they markedly complement vision, yet solutions today often rely on task and sensor specific handcrafted perception models. Collecting real data at scale with task centric ground truth labels, like contact forces and slip, is a challenge further compounded by sensors of various form factor differing in aspects like lighting and gel markings.

To tackle this we turn to self-supervised learning (SSL) that has demonstrated remarkable performance in computer vision. We present Sparsh, a family of SSL models that can support various vision-based tactile sensors, alleviating the need for custom labels through pre-training on 460k+ tactile images with masking and self-distillation in pixel and latent spaces. We also build TacBench, to facilitate standardized benchmarking across sensors and models, comprising of six tasks ranging from comprehending tactile properties to enabling physical perception and manipulation planning. In evaluations, we find that SSL pre-training for touch representation outperforms task and sensor-specific end-to-end training by 95.1% on average over TacBench, and Sparsh (DINO) and Sparsh (IJEPA) are the most competitive, indicating the merits of learning in latent space for tactile images.


Walkthrough Video


Overview of Sparsh

To address the scarcity of labeled and even unlabeled data in the tactile domain, we curate together new and existing vision-based tactile sensor datasets totaling ~661k samples. Further, we investigate self-supervised learning approaches, such as Masked Autoencoding (MAE), Self-Distillation (DINO and DINOv2), and Joint embedding Prediction (JEPA) to train the Sparsh family of foundation models for general purpose touch representations. We find that background subtraction and tokenization of a small temporal window (~80 ms) of tactile images are crucial for learning generalizable and effective representations. We then evaluate the general-purpose utility of these representations on TacBench, a benchmark of six tasks that span tactile perception.


Normal and shear field decoding

Using Sparsh, one can obtain in realtime normal and shear fields from a stream of tactile sensor images in realtime. To predict these fields, we train a DPT decoder given (frozen) latent representations from Sparsh via photometric losses computed by warping tactile image frames using the predicted normal and shear forces. These fields can be used to estimate contact location and normals and could be fed into policies for robot manipulation tasks or used for state estimation.


Pose estimation

We also show that Sparsh representations capture relative object pose information. To investigate this, we follow the regression-by-classification paradigm. Specifically, we probe the model with an attentive probe, to estimate SE(2) transformations of the object relative to the sensor. Further results can be found in our paper.


Bead Maze

Sparsh representations can enable planning for manipulation. We adapt the Bead Maze task to robot policy learning, where the goal is to guide a bead from one end to another following the path of the wire. Here, we use Diffusion Policy to train policies from a set of teleop demonstrations on different maze patterns. Given the tactile images, the policy predicts the joint angles for the Franka robot arm. We find that in general, policies trained with Sparsh representations perform slightly better than training encoders end-to-end.


Acknowledgments

We thank Ishan Misra, Mahmoud Assran for insightful discussions on SSL for vision that informed this work, and Changhao Wang, Dhruv Batra, Jitendra Malik, Luis Pineda, Tess Hellebrekers for helpful discussions on the research.

Our implementation is based on the following repositories:


BibTeX

If you find our work useful, please consider citing our paper:

@article{higuera2024sparsh,
  title = {Sparsh: Self-supervised touch representations for vision-based tactile sensing},
  author = {Carolina Higuera and Akash Sharma and Chaithanya Krishna Bodduluri and Taosha Fan and Patrick Lancaster and Mrinal Kalakrishnan and Michael Kaess and Byron Boots and Mike Lambeta and Tingfan Wu and Mustafa Mukadam},
  booktitle = {8th Annual Conference on Robot Learning},
  year = {2024},
  url = {https://openreview.net/forum?id=xYJn2e1uu8}
}