Sign Language Recognition with Multi-modal Features

Abstract

We study the problem of recognizing sign language automatically using the RGB videos and skeleton coordinates captured by Kinect, which is of great significance in communication between the deaf and the hearing societies. In this paper, we propose a sign language recognition (SLR) system with data of two channels, including the gesture videos of the sign words and joint trajectories. In our framework, we extract two modals of features to represent the hand shape videos and hand trajectories for recognition. The variation of gesture is obtained by 3D CNN and the activations of fully connected layers are used as the representations of these sign videos. For trajectories, we use the shape context to describe each joint, and combine them all within a feature matrix. After that, a convolutional neural network is applied to generate a robust representation of these trajectories. Furthermore, we fuse these features and train a SVM classifier for recognition. We conduct some experiments on large vocabulary sign language dataset with up to 500 words and the results demonstrate the effectiveness of our proposed method.

Publication
In Pacific-Rim Conference on Multimedia