CN110705339A

CN110705339A - C-C3D-based sign language identification method

Info

Publication number: CN110705339A
Application number: CN201910303476.5A
Authority: CN
Inventors: 赵宏伟; 张卫山; 刘霞
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2019-04-15
Filing date: 2019-04-15
Publication date: 2020-01-17

Abstract

The invention provides a gesture recognition method based on C-C3D, which is characterized in that a C3D network is taken as a main feature extraction network and is improved, a time sequence candidate box is defined through a pair of corner points, a variable-length three-dimensional convolution kernel is designed, and the gesture motion of a hand language is recognized in a time sequence candidate box classification regression subnetwork.

Description

C-C3D-based sign language identification method

Technical Field

The invention relates to the field of deep learning target detection and behavior recognition, in particular to a C-C3D-based sign language recognition method.

Background

A sign language recognition method based on C-C3D is based on deep learning target detection and behavior recognition technology. The closest techniques to the present invention are:

(1) and deep learning: with the rapid development of the deep learning technology, a new idea is provided for solving a plurality of problems in production. The convolutional neural network is powerful in that the multilayer structure of the convolutional neural network can automatically learn features, and can learn features of multiple layers: the sensing domain of the shallower convolutional layer is smaller, and the characteristics of some local regions are learned; deeper convolutional layers have larger perceptual domains and can learn more abstract features. These abstract features are less sensitive to the size, position, orientation, etc. of the object, thereby contributing to an improvement in recognition performance. The method has strong adaptability to factors such as geometric transformation, deformation and illumination of the target, and effectively overcomes the recognition resistance caused by variable target appearance. The method can automatically extract and analyze the features according to the data input into the network, and has higher universality generalization capability.

(2) C3D network: the C3D network may be used to extract spatio-temporal features of video. Based on the 3D convolution operation, the C3D network has 8 convolution operations, 4 pooling operations. The sizes of the convolution kernels are all 3 × 3 × 3, and the step size is 1 × 1 × 1. In order not to reduce the length in time sequence too early, the sizes of other pooling cores are 2 × 2 × 2, and the step size is 2 × 2 × 2, except that the first-layer pooling size and the step size are both 1 × 2 × 2. And finally, the network obtains a final output result after passing through the full connection layer and the softmax layer twice. The input size of the network is 3 × 16 × 112 × 112, i.e., 16 frame images are input at a time. Compared with a 2D network, the 3D network can better extract features, and can obtain better performance than most of the existing algorithms at present only by matching with a simple classifier. However, the C3D network can only process fixed-length video and can only classify video and cannot detect objects in the video.

In order to fully utilize the advantages of deep learning and make up for the defects of C3D in the sign language motion recognition direction, a three-dimensional convolutional neural network C-C3D based on corner points is designed by improving a C3D network, and a sign language recognition method based on C-C3D is provided.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the invention designs a three-dimensional convolutional neural network C-C3D based on angular points, and provides a sign language identification method based on C-C3D.

The technical scheme of the invention is as follows:

step (1), in a feature extraction sub-network of C-C3D, taking a C3D network as a main network, taking a video with any length as input, and obtaining a feature map of an original video after a series of convolution, pooling and activation operations of the feature extraction network;

step (2), in a time sequence candidate frame extraction sub-network of the C-C3D, designing a length-variable three-dimensional convolution based on actual data, virtual data and service characteristics, determining the position of a candidate frame by adopting a pair of corner points, and extracting a candidate time sequence which possibly has a target;

step (3), the time sequence candidate frame classification regression sub-network of the C-C3D selects a candidate region from the time sequence candidate frame extraction sub-network, extracts a feature with a fixed size in the selected candidate region, and performs category judgment and time sequence frame regression on the candidate region on the basis of the feature;

and (4) combining the time sequence candidate frame extraction sub-network and the classification regression sub-network, combining the classification and the regression, and designing a loss function together.

The invention has the beneficial effects that:

(1) in the method, a region where a gesture possibly exists is determined through a pair of corner points in the extraction process of a candidate region, and a three-dimensional convolution with variable length is designed;

(2) a new action recognition network C-C3D is designed, and the video length with variable length can be processed and the meaning of the video gesture can be analyzed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a simplified model diagram of a gesture recognition method based on C-C3D according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in FIG. 1, the sign language recognition method model diagram based on C-C3D is divided into three parts, namely a feature extraction sub-network, a candidate region generation sub-network and a candidate region classification regression sub-network.

The following describes a specific flow of the sign language recognition method based on C-C3D in detail:

The gesture recognition method based on C-C3D provided by the invention is characterized in that a C3D network is taken as a main feature extraction network, a time sequence candidate frame is defined through a pair of corner points, a variable-length three-dimensional convolution kernel is designed, and gesture actions are recognized in a time sequence candidate frame classification regression sub-network.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A gesture recognition method based on C-C3D is characterized in that a C3D network is used as a main feature extraction network, a time sequence candidate box is defined through a pair of corner points, a variable-length three-dimensional convolution kernel is designed, and gesture actions are recognized in a time sequence candidate box classification regression sub-network, and the method comprises the following steps:

step (3), the C-C3D time sequence candidate frame classification regression sub-network selects a candidate region from the time sequence candidate frame extraction sub-network, extracts a feature with a fixed size in the selected candidate region, and performs category judgment and time sequence frame regression on the candidate region on the basis of the feature;

and (4) combining the time sequence candidate frame extraction sub-network and the classification regression sub-network, and combining the classification and the regression to jointly form a loss function.