CN114821764A

CN114821764A - Gesture image recognition method and system based on KCF tracking detection

Info

Publication number: CN114821764A
Application number: CN202210085768.8A
Authority: CN
Inventors: 凌焕章; 董浩男; 周长玲; 李广涵
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-07-29

Abstract

The invention belongs to the technical field of gesture image recognition, and discloses a gesture image recognition method and a gesture image recognition system based on KCF tracking detection, wherein the gesture image recognition method based on KCF tracking detection comprises the following steps: respectively carrying out identification, segmentation and feature extraction on the dynamic gesture image; initialization of prediction: detecting an initial hand position using a deep learning network-based approach; and tracking the position of the hand by adopting a KCF-based tracking algorithm. According to the gesture image recognition method based on KCF tracking detection, the operation of a matrix is replaced by the dot product operation of vectors, so that the operation is accelerated. By using the tracking algorithm, the invention can obviously accelerate the operation speed when the initial frame is positioned to the hand and accurately classified, because each frame is not required to carry out intensive detection operation any more, the operation speed is greatly simplified by using the probability-based prediction method, and the detection speed can be obviously improved by using the tracking-based gesture detection method.

Description

Gesture image recognition method and system based on KCF tracking detection

Technical Field

The invention belongs to the technical field of gesture image recognition, and particularly relates to a gesture image recognition method and system based on KCF tracking detection.

Background

In recent years, after theoretical exploration and continuous practical innovation, the gesture recognition technology has made great progress in research results, and the mature technology can recognize simple dynamic gestures and meet actual requirements in a specific application scene. However, there are still some obstacles to building a robust gesture recognition system, such as hand occlusion, mis-tracking of gestures, expandability of gesture database, unpredictability of gesture actions, different background illumination and high computation cost, so that many problems of the existing algorithms in real-time, operation speed, recognition rate, etc. still remain to be solved. The technical development trend of future gesture recognition is as follows:

a) and reconstructing the three-dimensional gesture based on the point cloud. At present, the recognition rate of visual dynamic gestures is low due to the problems of illumination, rotation and the like. Traditional gesture modeling often models and analyzes gestures according to motion information or apparent characteristics of the gestures. Gesture modeling based on motion information takes the spatiotemporal characteristics of gestures into account, but the obtained gesture edge information is fuzzy. Gesture modeling based on apparent features is computationally less intensive, but does not adequately account for the deformation capabilities of gestures. Current RGB-D cameras view gestures as two-dimensional structures, making the appearance change of gesture models and the gesture rotation problem challenging. On the other hand, the RGB-D camera is high in cost, and requires a simple background model of the gesture image, which is not very practical. The point cloud-based three-dimensional target reconstruction can generate an accurate spatial support, and limited feature representation can be learned from the two-dimensional projection of the gesture, so that the gesture target which rotates outside the field of view or has serious self-shielding problems can be stably positioned.

b) Gesture tracking based on multi-zone detection. Illumination changes are a common problem in visual gesture tracking, severely impacting system performance. The multi-region detection strategy can simultaneously detect a plurality of regions on the basis of parallel computation, equivalently expand a search region, effectively consider visual perception of different layouts and region changes, and improve the significance of a target gesture region.

c) Gesture tracking based on multi-granularity sparse representation. Generally, the gesture area occupies a small proportion in the gesture image, and when the gesture image is analyzed, the overall characteristics and the local characteristics of the gesture should be fully considered to improve the expression capability of the gesture characteristics. The gesture tracking method based on multi-granularity sparse representation not only makes full use of the effectiveness of the overall features and the local features, but also makes full use of the representation capability of a plurality of patches under different granularities. By carrying out multi-granularity sparsity analysis on the gesture image, the joint representation capability of the target gestures with different sizes is fully considered, so that the representation performance of the gesture features is improved.

d) Gesture tracking based on a multi-stream deep similarity learning network. CNN is currently widely used to learn robust feature representations and complex models from large-scale datasets. Corresponding progress is made on various tasks such as object classification, target detection classification, semantic segmentation and the like, but most of the current methods regard the tracking problem as a two-classification problem, and the target is recognized from the background through training a classifier. In the testing process, a gradient descent algorithm is adopted to update the network, but the model updating speed is too low to perform real-time operation. In addition, a general classifier such as a CNN model, which discriminates objects of different classes by training a data set in advance, does not show superiority for discriminating the difference between objects of the same class. Thus, the features obtained by the classifier are not sufficient to capture the appearance change of the target gesture. To better handle the task of tracking a particular target, the multi-stream deep similarity learning network treats the classification problem as an instance-level verification problem. Target gestures in subsequent image frames are tracked by learning a generic depth similarity comparison model offline. The model does not need to transfer a feature space from a general classification network, but effectively learns the feature space for similarity comparison with gesture examples, and avoids similarity comparison among similar targets. The learning model is directly used for positioning each frame of target gesture, online learning or parameter fine tuning is not needed, and efficient judgment on the target gesture is achieved.

e) Gesture classification based on an adaptive convolutional neural network. CNN has been shown to achieve optimal performance in visual target tracking tasks. However, existing CNN-based trackers typically use whole target samples to train the network, and once the target experiences complications (such as occlusion, background clutter, and deformation), tracking performance can be severely degraded. The adaptive convolution kernel model captures a target gesture structure by designing a mask set to generate a convolution kernel. Meanwhile, a self-adaptive weighting fusion strategy is adopted for the convolution kernel to adapt to the change of the target gesture appearance, and the robustness of the tracker can be effectively enhanced. For a non-rigid target such as a gesture, the significance detection capability and the classification accuracy can be improved.

The overall goal of gesture recognition is the interpretation of the semantics of the position, gesture of an opponent. In gesture recognition systems, various techniques for gesture recognition based on, for example, depth cameras, color cameras, distance sensors, data gloves, near-infrared cameras, wearable inertial sensors, or other modal type sensor data have been proposed by many researchers. Cameras commonly used for capturing gesture images are RGB and TOF cameras respectively, wherein RGB obtains various colors by changing three color channels of red (R), green (G) and blue (B) and superimposing the three color channels, RGB represents colors of the three channels of red, green and blue, and the standard almost comprises all colors which can be perceived by human vision, and is one of the most widely applied color systems at present; TOF is only an abbreviation of technology, infrared light (laser pulse) invisible to human eyes is emitted outwards, reflected after encountering an object and reflected to the end of a camera, the time difference or phase difference from emission to reflection back to the camera is calculated, data is collected to form a group of distance and depth data, and therefore a three-dimensional 3D model is obtained.

The gesture recognition method based on the traditional characteristics is low in recognition accuracy rate and difficult to recognize a plurality of gesture targets in a picture, the deep learning model is an artificial neural network structure deep learning model with a complex layer and has strong nonlinear modeling capability, and compared with the characteristics of the traditional artificial design, the characteristics learned from data by using a universal learning process can express higher-level more abstract internal characteristics.

The gesture tracking is an important link in gesture recognition, and the essence of the gesture tracking is to analyze continuous images in time sequence frame by frame and calculate a tracked target within an image change interval. Since the degree of freedom of motion change of a gesture is high and the target is relatively small, the hand tracking speed is generally slow. Therefore, increasing the trace detection rate is one of the key technical problems to be solved by the present invention.

The neural network-based method can automatically acquire the high-level characteristics of the gesture sample due to the strong self-learning capability of the neural network-based method, and can be suitable for general data sets. When training the neural network, the low-level features have higher resolution, mainly corresponding to the detail information of the gesture, and the high-level features correspond to abstract semantic information. But for the gesture image, the gesture occupies a smaller proportion in the image. Therefore, improving the accuracy of dynamic gesture recognition and ensuring good timeliness is one of the key technical problems to be solved by the project.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) at present, the recognition rate of visual dynamic gestures is low due to the problems of illumination, rotation and the like; the existing RGB-D camera regards the gesture as a two-dimensional structure, so that the problems of appearance change of a gesture model and gesture rotation are challenging; meanwhile, the RGB-D camera is high in cost, and the background model of the gesture image is required to be simple and not strong in practicability.

(2) The existing gesture tracking method based on the multi-stream deep similarity learning network adopts a gradient descent algorithm to update the network, but the model updating speed is too low to perform real-time operation; a general classifier does not show superiority in distinguishing the differences between objects of the same class, and the features obtained by the classifier are not sufficient to capture the appearance change of the target gesture.

(3) Existing CNN-based trackers typically train the network using whole target samples, and once a target experiences complex conditions (such as occlusion, background clutter and deformation), the tracking performance may be severely degraded; if a plurality of TOF devices work, infrared rays are transmitted into other lenses to form mutual interference, and distortion is caused.

(4) The gesture recognition method based on the traditional characteristics has low recognition accuracy and is difficult to recognize a plurality of gesture targets in the picture; since the degree of freedom is high and the target is relatively small in the case of a gesture with a very fast motion change, the hand tracking speed is generally slow.

The difficulty in solving the above problems and defects is:

(1) the cost is high. Compared with the original equipment, the characteristic information extracted by the equipment with lower cost has disadvantages in abundance;

(2) the adaptability is poor. The requirements cannot be met for different scenes. Therefore, the algorithm design for converting simpler raw feature data into more general gesture features is a difficult point of the scheme.

The significance of solving the problems and the defects is as follows:

(1) the training and detection speed of the gesture recognition algorithm is increased. In addition, the detection module of the algorithm in the scheme uses a one-phase algorithm, and the speed is more advantageous than that of the traditional two-phase algorithm.

(2) The anti-interference capability of the gesture recognition algorithm on deformation and background is enhanced. The algorithm with high recall rate has high recall capability, and can accurately distinguish the detection target in the image, thereby reducing the influence of deformation and background on detection.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a gesture image recognition method and system based on KCF tracking detection.

The invention is realized in such a way that a gesture image recognition method based on KCF tracking detection comprises the following steps:

step one, respectively carrying out identification, segmentation and feature extraction on dynamic gesture images;

step two, prediction initialization: detecting an initial hand position using a deep learning network-based approach;

and step three, tracking the position of the hand by adopting a tracking algorithm based on KCF.

Further, in the first step, the gesture area in the image is segmented and extracted based on the full convolution neural network FCN; in the FCN network structure, after the full-connection structure of Alexnet is improved into a convolution layer, the network has 8 layers of convolution layers, then an anti-convolution layer, namely an up-sampling layer, is connected, an output result is amplified, and the result is cut by using a crop layer; and finally, outputting a score layer which represents the probability that each point of the pixel on the image belongs to each class, wherein the adopted loss layer is SoftmaxwithLoss:

further, in step three, the KCF method includes the following characteristics:

(1) the circulant matrix is used to generate a sufficient number of training sets to train the classifier. The circulant matrix generates a plurality of same targets on the diagonal of the current target, and actually, a group of vectors are gradually shifted to the right, a new group of vectors are generated after each shift, and a plurality of vectors are combined into a circulant matrix after being generated. The classifier is then trained using a least squares method with a penalty term added.

(2) The property of the cyclic matrix is utilized to carry out Fourier diagonalization on the matrix, the problem solving is simplified, and the solving of the linear regression problem is accelerated.

Further, to the one-dimensional linear regression problem f (x) _i )＝ω ^T x _i Wherein (x) _i ，y _i ) Referring to the ith sample, and ω is the corresponding weight considering regularization, the objective function is:

the corresponding matrix expression is:

wherein X ═ X ₁ ，x ₂ ，...，x _n ] ^T In x _i Corresponding to each sample, y _i The label of the corresponding sample, so the minimum value is solved, namely the extreme value of the solving function:

ω＝(X ^T X+λI) ^-1 X ^T y (4)

in complex space:

ω＝(X ^H X+λI) ^-1 X ^H y (5)

wherein, X ^H Is a complex conjugate transpose matrix. Because the circulant diagonal matrix can be diagonalized in fourier space with a fourier matrix:

bringing formula (5) into formula (6) yields:

using formula (1):

can obtain the product

Further, in the third step, the tracking algorithm trains a classifier during tracking, judges the possible position of the target in the next frame, and updates the classifier by using the detected new target; in the detection process, the target area is a positive sample, and the non-target background is a negative sample, so that the position of the positive sample determines the position where the target may exist in the next frame.

Another object of the present invention is to provide a gesture image recognition system based on KCF tracking detection applying the gesture image recognition method based on KCF tracking detection, the gesture image recognition system based on KCF tracking detection comprising:

the image preprocessing module is used for respectively identifying and segmenting the dynamic gesture image and extracting the characteristics of the dynamic gesture image;

a prediction initialization module for detecting a position of an initial hand using a deep learning network based method;

and the position tracking module is used for tracking the position of the hand by adopting a KCF-based tracking algorithm.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

respectively carrying out identification, segmentation and feature extraction on the dynamic gesture image; detecting the position of an initial hand by using a method based on a deep learning network, and initializing the prediction; and tracking the position of the hand by adopting a KCF-based tracking algorithm.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

Another object of the present invention is to provide an information data processing terminal for implementing the gesture image recognition system based on KCF tracking detection.

The invention also aims to provide an application of the gesture image recognition system based on KCF tracking detection in information equipment and system control.

The invention has the advantages and positive effects that: according to the gesture image recognition method based on KCF tracking detection, the operation of a matrix is replaced by the dot product operation of vectors, so that the operation is accelerated. By using the tracking algorithm, the invention can remarkably accelerate the operation speed when the initial frame is positioned to the hand and accurately classified, because each frame is not required to carry out intensive detection operation any more, and the operation speed is greatly simplified by using a probability-based prediction method.

Compared with the gesture recognition method based on detection, the gesture recognition method based on tracking can remarkably improve the detection speed. Therefore, in the initialization, the detection method based on the deep learning network is used for detecting the position of the initial hand, the prediction is initialized, and then the position of the hand is tracked in a tracking mode. The tracking algorithm generally trains a classifier during tracking, then judges the position where the target may appear in the next frame, and updates the classifier with the detected new target. In the detection process, the target area is a positive sample, and the background of the non-target is a negative sample, so that the position of the positive sample determines the position where the target may exist in the next frame. The KCF process has the following characteristics:

(1) the circulant matrix is used to generate a sufficient number of training sets to train the classifier. The circulant matrix generates a plurality of identical objects on the diagonal of the current object, and in fact, gradually right-shifts a set of vectors, each shift generating a new set of vectors, which are combined into a circulant matrix after a plurality of vectors are generated. The classifier is then trained using a least squares method with a penalty term added.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and that other drawings may be obtained by those skilled in the art without paying creative efforts.

FIG. 1 is a flowchart of a gesture image recognition method based on KCF tracking detection according to an embodiment of the present invention.

FIG. 2 is a block diagram of a gesture image recognition system based on KCF tracking detection according to an embodiment of the present invention;

in the figure: 1. an image preprocessing module; 2. a prediction initialization module; 3. a location tracking module.

Fig. 3 is a schematic diagram of a gesture recognition technology provided in an embodiment of the present invention.

Fig. 4 is a schematic diagram of a gesture recognition technology framework provided in an embodiment of the present invention.

Fig. 5(a) is a schematic diagram of a limitation on the size of an input image due to a full connection layer according to an embodiment of the present invention.

Fig. 5(b) is a schematic diagram of the input size not limited by the fully connected layer according to the embodiment of the present invention.

Fig. 6 is a schematic diagram illustrating an effect of a full link layer on an input image size according to an embodiment of the present invention.

Fig. 7 is a diagram illustrating an original image and a result of a calculated heat map according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of a calculation result obtained by the full convolution neural network according to the embodiment of the present invention.

Fig. 9 is a schematic diagram of an Alexnet network structure according to an embodiment of the present invention.

Fig. 10 is a schematic diagram of an Alexnet-based FCN network structure according to an embodiment of the present invention.

FIG. 11 is a diagram illustrating a gesture segmentation result as a mask covered on an original image under different environments according to an embodiment of the present invention.

FIG. 12 is a schematic diagram of a single-frame target tracking algorithm using a KCF-based algorithm according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a gesture image recognition method and system based on KCF tracking detection, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the gesture image recognition method based on KCF tracking detection provided by the embodiment of the present invention includes the following steps:

s101, respectively identifying and segmenting dynamic gesture images and extracting features;

s102, prediction initialization: detecting an initial hand position using a deep learning network-based approach;

and S103, tracking the position of the hand by adopting a KCF-based tracking algorithm.

As shown in fig. 2, a gesture image recognition system based on KCF tracking detection provided by an embodiment of the present invention includes:

the image preprocessing module 1 is used for respectively identifying and segmenting the dynamic gesture image and extracting the characteristics of the dynamic gesture image;

a prediction initialization module 2 for detecting the position of the initial hand using a deep learning network-based method;

and the position tracking module 3 is used for tracking the position of the hand by adopting a KCF-based tracking algorithm.

The technical solution of the present invention is further described below with reference to specific examples.

1. Summary of the invention

The invention provides a dynamic rapid gesture recognition method based on deep learning, which is used for designing and realizing a gesture recognition algorithm framework combined with finger posture estimation so as to solve the problems of multi-view, multi-target, finger self-shielding and the like in gesture recognition. According to the method, through improving relevant parameters and network structures in the deep convolutional network model and combining various improved deep convolutional network gesture recognition models to estimate the gesture of the finger aiming at the gesture image, gesture characteristics output by the gesture estimation model and an original color image are used as input of a classifier together for gesture recognition, the gesture recognition accuracy is improved, and the high-recognition-rate dynamic gesture rapid recognition model capable of recognizing multiple types of gestures is obtained.

2. Summary of the invention

2.1 dynamic gesture fast recognition research based on deep learning

Among the various communication modes, gestures are the most effective and common semantic expression mode, and because gestures are flexible in communication and operation, the use of hands as input devices is very attractive in natural human-computer interaction technology. Gesture languages are divided into categorical gestures (i.e., static gestures) and motion modes (i.e., dynamic gestures) that can be used to implement command and control interfaces. The development of computer vision has the potential to provide a more natural, non-contact solution.

The dynamic gesture recognition based on the vision generally comprises the visual dynamic gesture recognition based on the machine learning and the visual dynamic gesture recognition based on the deep learning, and the visual dynamic gesture recognition based on the machine learning needs to detect and segment gestures, remove background noise, and then perform gesture modeling processes such as gesture tracking, feature extraction, gesture classification and recognition. The visual dynamic gesture recognition based on deep learning is to convert a level of feature representation into a more abstract feature representation by combining some simple nonlinear units, so as to automatically learn the gesture features of an original image for classification.

(1) Gesture detection and segmentation

Gesture detection and segmentation are the primary steps of gesture recognition, when gesture segmentation is carried out, a gesture image needs to be separated from a background model, the gesture segmentation effect is directly related to the accuracy of gesture recognition, and the gesture segmentation algorithm mainly comprises a gesture segmentation algorithm based on a depth sensor, a gesture segmentation algorithm based on two-dimensional plane features and a gesture segmentation algorithm based on motion information.

The gesture segmentation method based on the depth sensor acquires depth structure information of a hand by using a depth camera, and further segments a hand region; the gesture segmentation method based on the two-dimensional plane features is used for extracting a hand region through apparent features of gestures, common two-dimensional plane features including color space, edge information, pixel information and other skin color region segmentation are common gesture segmentation algorithms based on the color space, but the method cannot eliminate the influence of conditions such as background, illumination and the like in practical application and cannot establish a skin color model with higher accuracy; the gesture segmentation method based on the motion information is used for modeling the image information and comparing the image information with a background model to complete the detection of the target object. The growing of deep learning brings a new opportunity for gesture segmentation, training is carried out through massive data, and target gesture characteristics are automatically learned, so that detection of target gestures is completed, and corresponding gesture segmentation is completed through the detected target gestures. Compared with the traditional method, the gesture segmentation method based on deep learning does not need to manually analyze gesture features for segmentation, so that segmentation is more convenient, and the method has a better application prospect in the aspect of gesture segmentation.

The invention aims to design a practical gesture segmentation algorithm based on deep learning, which not only can segment hands and backgrounds, but also has the characteristics of strong robustness, high operation speed and the like for complex backgrounds and various interference factors.

(2) Dynamic gesture recognition

The dynamic gesture recognition is a gesture recognition process with temporal and spatial characteristics, and refers to dynamic gesture recognition including a hand shape if hand shape information is referred to when a gesture is defined, and refers to dynamic gesture recognition not including a hand shape otherwise. In the dynamic gesture recognition without hand shapes, only the trajectory swung by a human hand in the space is considered, so that the problem can be converted into trajectory recognition, and the common methods are a hidden Markov model, dynamic time warping, time axis compression and the like; in the dynamic gesture recognition including the hand shape, the characteristics of the gesture in the space not only include track characteristics, but also include hand shape characteristics, if the current frame includes a preset static hand shape, the signature is a key frame, otherwise, the signature is a non-key frame, and the gesture is defined and recognized in a mode of combining the key frame and the motion track of the key frame.

The invention aims to design a dynamic gesture detection and recognition algorithm, which is based on RGB and TOF cameras to capture, detect and recognize gesture images, adopts image target detection and recognition algorithms such as deep learning and the like to research the influence of different resolutions on the real-time performance of the algorithm, and utilizes gesture (including limbs) instructions to control information equipment and systems aiming at fixed positions or specific areas.

2.2 gesture recognition research in Complex Environment

Complicated background information in the gesture recognition process causes great interference to gesture recognition, dynamic gestures are more difficult to recognize than static gestures, and due to the fact that the dynamic gestures are moving, existing information is more, and recognition accuracy is still to be improved. When a deep neural network is adopted to train a data set, the pre-segmentation of a training sample and the post-processing of converting output into a label sequence increase a great burden for a user and the network. Dynamic gestures typically include 3 motion phases: a preparation phase, a core phase and a retraction phase. The main information in dynamic gestures is in the core phase of the gesture, different gestures may have similarities in the preparation phase and the retraction phase, and thus there is disturbing information on the recognition. The invention aims to develop a gesture recognition algorithm with strong adaptability, and realizes control of information equipment and a system by utilizing gesture (including limb) instructions aiming at a fixed position or a specific area in a complex environment.

3. Main function and technical index

3.1 major functions

The invention is based on the technologies of capturing, detecting and identifying gesture images by RGB and TOF cameras, adopts image target detection and identification algorithms such as deep learning and the like, studies the influence of different resolutions on the real-time performance of the algorithms, and realizes the control of information equipment and systems by using gesture (including limb) instructions aiming at fixed positions or specific areas. The main functions are as follows:

(1) dynamic gesture rapid recognition research based on deep learning

The machine learning-based visual dynamic gesture recognition needs to detect and segment gestures, remove background noise, and then perform gesture modeling processes such as gesture tracking, feature extraction, gesture classification and recognition.

a. The gesture detection and segmentation are gesture recognition, a gesture image needs to be separated from a background model when the gesture segmentation is carried out, the gesture segmentation effect is directly related to the accuracy of the gesture recognition, and the gesture segmentation algorithm mainly comprises a gesture segmentation algorithm based on a depth sensor, a gesture segmentation algorithm based on two-dimensional plane features and a gesture segmentation algorithm based on motion information.

b. The dynamic gesture recognition is a gesture recognition process with time and space characteristics, and only the trajectory of a human hand in the space is considered in the dynamic gesture recognition without the hand shape, so that the problem can be converted into trajectory recognition; in the dynamic gesture recognition including the hand shape, the characteristics of the gesture in space not only include track characteristics, but also include hand shape characteristics, if the current frame includes a preset static hand shape, the signature is a key frame, otherwise, the signature is a non-key frame, and the gesture is defined and recognized in a mode of combining the motion tracks of the key frame and the key frame.

(2) Gesture recognition research in complex environment

Complicated background information in the gesture recognition process causes great interference to gesture recognition, dynamic gestures are more difficult to recognize than static gestures, and due to the fact that the dynamic gestures are moving, existing information is more, and recognition accuracy is still to be improved.

4. Design of the scheme

4.1 subject location

Gesture recognition means that the intelligent terminal equipment can recognize gestures captured by a camera and understand gesture meanings, and therefore corresponding functions are achieved. The gesture recognition technology provides a natural and intuitive communication mode for the conversation between a person and the terminal equipment. In the technology of human-computer interaction, the limitation of gesture recognition to people is minimum, when the hand of the invention is used as a direct input device of a computer, the communication between the people and the computer does not need an intermediate medium any more, and a user can control a terminal device through defined gestures. Gesture recognition mainly comprises two parts:

1. and gesture segmentation, namely segmenting gestures in the image as the input of the gesture recognition model.

2. And gesture recognition, namely recognizing the meaning represented by the gesture image. The gesture recognition technology can be applied to human-computer interaction as well as human-human interaction, and a sound person can understand the meaning expressed by sign language for the deaf-mute through the gesture recognition technology, so that language barriers are eliminated.

4.2 general research concept

The gesture recognition technology mainly comprises the following steps: preprocessing, gesture segmentation, gesture feature extraction and identification model establishment. Wherein the preprocessing is to remove noise in the image; the gesture segmentation is to segment the gesture area in the image to extract the gesture area from a complex background, so that the influence of the background on a gesture recognition result is reduced, and meanwhile, the problem of multiple gestures in the image can be processed; the gesture feature extraction is to design and extract the features of the gesture and describe the gesture through the features; the identification model is established by training a gesture model according to the characteristics of different gestures.

A. Pretreatment of

The gesture image is inevitably interfered by noise during acquisition, and the segmentation and recognition of the gesture image are seriously affected, so that the preprocessing of the image before the segmentation and recognition of the gesture is particularly important. Common gesture image preprocessing mainly comprises: denoising the gesture image, binarizing the gesture image, performing morphological processing and the like.

B. Gesture segmentation

When a gesture is recognized, a large amount of background information inevitably exists, and the introduction of the background information causes the accuracy of gesture recognition to be reduced. Therefore, the gesture is particularly important to be segmented before gesture recognition, and the gesture is extracted from a complex background, so that the influence of background factors is reduced, and the recognition accuracy is improved. Meanwhile, the problem of multiple gestures in the gesture images is solved, namely, one image contains multiple gestures, and multiple gesture images can be obtained through gesture segmentation.

C. Gesture feature extraction

The traditional machine learning method needs manual design and feature extraction, and the local features commonly used by the existing gesture recognition algorithm based on traditional machine learning comprise SIFT, SURF, HOG, LB descriptors and the like. However, the characteristics of the target object need to be designed in combination with the scene, so that the trained model may not have robustness for different scenes, and a great deal of effort is needed to research new characteristics, which are not general and cannot completely describe the gesture. The convolutional neural network can automatically extract the characteristics of the target object through a convolutional kernel, and extracts the high-dimensional characteristics layer by layer in a progressive manner, so that the defects of the traditional machine learning in characteristic extraction are overcome, and the extracted characteristics are more comprehensive and abundant.

D. Identification model building

The identification model building is a model obtained by combining the characteristics of the existing target object. At present, a common gesture recognition model can be established by the following method: establishing a model by using a Support Vector Machine (SVM), wherein the SVM enables characteristics to be linearly separable in a high dimension by mapping nonlinear characteristics to the high dimension; a model is established by using Random Forest (RF), and a decision tree is established aiming at each characteristic, so that the method is more targeted and has the advantages of overfitting resistance, high precision and the like; the method comprises the steps of establishing a model by using a convolution neural network, automatically extracting the characteristics of a target object by the convolution neural network through convolution kernel energy, fitting the characteristics through continuous training, and finally establishing the model.

4.3 technical composition and working principle

The invention analyzes dynamic gesture recognition on the basis of deep learning. Deep learning is one of the research areas most concerned by the learners in recent years, and is now the most successful combination of big data and artificial intelligence. The deep learning model is a multilayer structure, each layer is composed of a plurality of same nerve cells, each layer of neural network can be considered to be capable of learning a certain characteristic, and the upper layer of neural network can learn more complex characteristics from the basis of the bottom layer of neural network. The dynamic gesture recognition based on deep learning is to automatically learn the gesture features of an original image for classification by combining some simple nonlinear units and converting a level feature representation into a more abstract feature representation.

The principle of the gesture recognition technology is shown in fig. 3, and the framework of the gesture recognition technology is shown in fig. 4.

4.4 Main technical index analysis

In order to make the dynamic gesture recognition have better effect, the key technical problems to be solved by the project are as follows:

(1) dynamic gesture segmentation techniques

Most gesture segmentation methods are to segment by manually designing the features of the target image, and the segmentation of the gesture can be realized by designing the features in a simple background. But effective gesture features are difficult to design in complex environments, and the method is difficult to apply to a human-computer interaction system. Although various gesture segmentation algorithms have been proposed, the influence of illumination background, skin-color-like objects and the like on the segmentation accuracy cannot be solved well, so that the gesture has no good robustness under the influence of a complex background and various interference information factors. Therefore, one of the key technical problems to be solved by the project is to study and effectively divide the gesture area and the background area.

(2) Feature extraction technique

Gesture feature extraction is a key step in gesture recognition. Feature extraction is the conversion of a portion of interest in an input image into a compact set of feature vectors. In gesture recognition, the extracted gesture features should contain relevant information obtained from the input gesture and feature representation information distinguished from other gestures. The selection and design of the gesture features are not only related to the accuracy of gesture recognition, but also related to the complexity and the real-time performance of the system. Therefore, the selection and design of the gesture features are one of the key technical problems to be solved by the project.

(3) Tracking detection technique

(4) Dynamic gesture recognition techniques

The neural network-based method can automatically acquire the high-level characteristics of the gesture sample due to the strong self-learning capability of the neural network-based method, and can be suitable for a general data set. When training the neural network, the low-level features have higher resolution and mainly correspond to the detail information of the gesture, and the high-level features correspond to abstract semantic information. But for the gesture image, the gesture occupies a smaller proportion in the image. Therefore, improving the accuracy of dynamic gesture recognition and ensuring good timeliness are one of the key technical problems to be solved by the project.

5. Key technology and solution

5.1 dynamic gesture recognition technique based on deep learning

The traditional gesture recognition method needs to train a specific classifier aiming at a gesture data set, and the training data volume is limited, so that the traditional gesture recognition method is not suitable for large data sets and is difficult to popularize and use. The neural network-based method can automatically acquire the high-level characteristics of the gesture sample due to the strong self-learning capability, and is suitable for a general data set. When training the neural network, the low-level features have higher resolution, mainly corresponding to the detail information of the gesture, and the high-level features correspond to abstract semantic information. However, for the gesture image, because the gesture has a small proportion in the image, most methods need to design a deeper network layer to consume a large amount of training time and operation memory in order to enrich the feature information of the gesture and realize higher recognition accuracy, and thus the real-time performance of gesture recognition is difficult to guarantee.

The deep convolutional neural network in deep learning is an effective picture feature extraction method, and is a popular method in the field of image processing and target recognition because the deep convolutional neural network has translation and rotation invariance to image information.

Complicated background information in the gesture recognition process causes great interference to gesture recognition, dynamic gestures are more difficult to recognize than static gestures, due to the fact that the dynamic gestures are moving, existing information is more, and recognition accuracy is still to be improved. When a deep neural network is adopted to train a data set, the pre-segmentation of a training sample and the post-processing of converting output into a label sequence increase a great burden for a user and the network. Dynamic gestures typically include 3 motion phases: a preparation phase, a core phase and a retraction phase. The main information in dynamic gestures is in the core phase of the gesture, different gestures may have similarities in the preparation phase and the retraction phase, and thus there is disturbing information on the recognition.

Gesture classification

The gesture classification is to classify the extracted gesture space-time characteristics and is the last core link for realizing gesture recognition. Common dynamic gesture classification algorithms can be roughly classified into four types: an algorithm based on template matching, an algorithm based on state diagram transfer, an algorithm based on statistical learning and an algorithm based on a neural network.

TABLE 1 gesture Classification common Algorithm

5.2 gesture segmentation technique based on full convolution neural network (FCN)

The convolution depth network achieves good effect in classification tasks, prediction results of mainstream classification models such as Alexnet, VGG and GoogLenet are very accurate in classification tasks, and the output of the deep learning network can be structured, so that the convolution neural network can be used for image segmentation, the whole image can be subjected to intensive prediction, and each pixel can be classified. The full convolutional neural network (FCN) uses a fine tuning method to perform migration learning on the main flow classification network, and then the learned information is applied to a segmentation task. The main reasons why the fully convolutional neural network can be used for image segmentation are:

(1) the whole network does not use a full connection structure, but uses a convolution structure instead, so that the network can accept input of any size.

(2) The network proposes an deconvolution layer, which restores the feature map to the size of the original image, and realizes prediction for each pixel.

(3) A special framework is used, and a deep rough segmentation result and a shallow segmentation result are combined to finally obtain a fine segmentation result.

The convolutional and pooling layers of the convolutional neural network can accept inputs of any size. For a convolutional layer, the size of the feature map obtained after calculation is:

where Inputsize is the size of the input, kernelsize is the size of the convolution kernel, stride is the step size, different input sizes result in different output sizes, and both the convolution and pooling layers have no limit on the size of the input, but the size of the input is limited when appropriate features are input as fully connected layers. When input enters a full-connection layer, a feature graph is stretched into a one-dimensional vector, the full-connection network requires that each node of each layer is connected with each node of the next layer, after the structure of the full-connection neural network is determined, the number of the nodes of each layer is fixed, the input and the output of each layer are fixed, and the size of an initial image is required to be fixed if the input and the output of each layer are gradually calculated forwards. In FCN networks, convolutional layers are used instead of fully-connected layers. The biggest difference between convolutional layers and fully-connected layers is that the former are locally connected, and neurons of the same convolutional kernel share parameters, so that mutual transformation of the two is possible. The convolution layer is a full connection layer, which is equivalent to a huge sparse matrix with the same size of output and input, and most elements of the matrix are equal. The full-link layer is converted into a convolutional layer by setting the size of the convolutional kernel to be the same as the input size.

Therefore, the full link layer of Alexnet can be converted into a convolutional layer, and the sizes of the convolutional cores are (4096,1,1), (1000,1,1), respectively. Only convolutional layers remain in the whole network, so the network is called a full convolutional neural network.

The effect of the fully connected layer on the input image size is shown in fig. 6.

The segmentation task is prediction at a pixel level, the obtained final prediction result is a feature map with the same size as the input feature map, the FCN uses the deconvolution layer to perform upsampling on the output of the convolution layer, and finally the feature map is restored to the size of the original image, and each pixel in the output feature map is classified prediction. After the input image is subjected to multilayer convolution and pooling operations, the image becomes smaller and smaller, the resolution becomes very low, the obtained feature map is called a heat map, the heat map is a high-dimensional feature map of the whole image, 1000 heat maps are subjected to up-sampling, and then the maximum value of 1000 up-sampling results is obtained pixel by pixel to be used as the classification result of the pixel, so that the whole image is classified.

Fig. 7 shows the original image and the calculated heat map result, and fig. 8 shows a schematic diagram of the calculated result obtained by the full convolution neural network.

Deconvolution and convolution have the same principle, but are not exactly the inverse operation of convolution. In practice, deconvolution amplifies an input in a certain proportion, 0 is used for filling a gap, then a convolution kernel in convolution operation is turned over for convolution, and in the process, it can be seen that deconvolution can only restore an output to an input size and cannot restore a numerical value. Only very rough segmentation results can be obtained by using deconvolution, and the FCN also provides a framework for fusing multi-layer segmentation results and improving the precision of final results. After 5 layers of convolution and pooling operations are carried out on the image, the resolution is respectively attenuated by 2, 4, 8, 16 and 32 times, so that the original image size can be obtained by amplifying the result obtained by the last layer by 32 times, but the amplification result lacks details, the result obtained by the fourth layer is amplified by 16 times, the result obtained by the third layer is amplified by 8 times, and finally the results are subjected to weighted summation, so that a more accurate segmentation result can be obtained. Eight Alexnet layers in total, the first five layers are convolutional layers, and the last three layers are full-connection layers:

the first layer of convolution layer inputs 227 × 227 × 3 images, the convolution kernel is 11 × 11 × 96, namely 96 convolution kernels with the size of 11 × 11, the moving step size of the convolution kernel is 4, the distance is the distance between the centers of the receptive fields of adjacent neurons on the feature map, and the size of the output feature map is 55 × 55 × 96. The convolution calculation result passes through the activation function ReLU and then passes through the first pooling layer, and the feature map size is reduced to 27 × 27 × 96.

The last of the first convolutional layer is local mean normalization (LRN), which allows neurons to compete, amplify neurons with large response values, and suppress neurons with small response values. The use of local mean normalization can improve generalization capability and thus improve results.

In the second layer of convolution layer, the convolution kernel size is 5 × 5 × 256, the size after convolution operation is 27 × 27 × 256, and then the size of the result obtained by normalization with the activation function ReLU, pooling layer and local mean value is 13 × 13 × 256.

The third layer of convolutional layers has 384 convolutional kernels, each of which has a size of 3 × 3 and a size of 13 × 13 × 384 after ReLU.

The size and the number of channels of the convolution kernel of the fourth layer are the same, and the output dimension is 13 multiplied by 384 after the ReLU is passed through in the same way.

The fifth layer convolution kernel size is 3 × 3 × 256, and through the ReLU layer and the pooling layer, the output dimension is 6 × 6 × 256. And then, three fully connected layers are connected, wherein the first two layers are 4096 neurons respectively and are connected with a ReLU layer and a drop layer, and the drop layer enables partial node output results to be 0 according to fifty percent probability. The last layer of neurons output a Softmax result of 1000 channels, which indicates the probability that the image belongs to each category, and the structure diagram of Alexnet is shown in fig. 9.

An FCN network structure using Alexnet as the pre-base network is shown in fig. 10.

After the full-connection structure of Alexnet is improved into a convolution layer, the network has 8 convolution layers, then an anti-convolution layer, namely an up-sampling layer, is connected, the output result is amplified, and the result is cut by using a crop layer. The final output is the score layer, which represents the probability that each point of the pixel on the image belongs to each class. The loss layer used is SoftmaxwithLoss:

the final loss values are:

Loss＝-log p _k (3)

the results of the segmentation using the Alexnent-based network structure are shown in fig. 11.

Under different environments, the Alexnet-based FCN network can accurately segment hands, and the segmentation edges are tidy. When the background environment is dark or bright, the segmentation result is not influenced obviously. The segmentation result is not affected when the hand is close to and moving at the face. In addition, the gesture segmentation based on the deep learning has robustness on gestures of different people, for example, in fig. 8, in the same gesture, the gestures of hands of different people are slightly different, the rotation angle directions of the hands are all different, but the segmentation result is not affected, and the gesture segmentation based on the deep learning has high accuracy and robustness.

The results of Alexnent-based FCN networks after training are not ideal. In real time, the overall network speed is slow, with about 630ms on the CPU and nearly 100ms on the GPU. The real-time performance of the result of each frame of division calculation is poor, which is not beneficial to the actual use and the transplantation of other platforms. In terms of the size of the network parameters, the network model obtained based on the Alexnet network is large, the parameter quantity is about 217M, and the use is limited due to the overlarge model parameters. Finally, the Alexnet network has a general characteristic extraction effect, so that the segmentation effect is not good, the result obtained in the test set is that the real rate TPR is 95.2%, the false positive rate FPR is 3.5%, and a space capable of improving the accuracy rate still exists. Thus, FCN networks have many areas that can be improved.

Because the convolutional neural network is usually accompanied by a large amount of calculation, and as the number of network layers increases, the speed of network convergence during training becomes slow, when the number of network layers is too deep, the training cannot be performed at last even because the gradient disappears during training, and the larger the network is, the more the parameters are, the more difficult the adjustment is, and the longer the whole training and guessing is. Therefore, in order to improve the real-time performance of network speculation and reduce the complexity of network parameters, the invention modifies an Alexnet-based FCN network structure:

1. the number of layers of the network is gradually reduced. Most of the hyper-parameters of the network are consistent with the original network, and the main modified hyper-parameters are, except for the deleted layer, the size of the convolution kernel in the deconvolution layer, the sliding step size of the convolution kernel, and the offset value offset of the crop layer. These hyper-parameters are mainly related to the size of the input feature map.

2. The LRN layer is removed. The LRN layer is a great innovation in Alexnet, uses adjacent pixels to carry out normalization calculation, is derived from the idea of lateral inhibition in organisms, can further enable the learned characteristics to be more abstract, enhances the generalization ability, is mainly used after ReLU, and has the following formula:

there are four hyper-parameters k, n, α, β, the selection of which is based on the validation set. However, later networks such as VGG have demonstrated that the LRN actually contributes very little to the end result, and then the LRN is gradually not used by the network in the architecture, so the present invention considers the LRN layer to be removed.

3. And selecting a preposed basic classification network with better promotion effect. The accuracy of network segmentation is actually improved by the feature extraction capability of the pre-network, so the depth of the network must be increased, which is in conflict with the real-time performance of the network, and a trade-off needs to be made. The invention considers that the preposed network is replaced by the network VGG-16 with the minimum difference with Alexnet, a large convolution kernel is replaced by a small convolution kernel, and all LRN layers are removed. The convolution layer of the VGG has five sections, each section is subjected to two to three times of convolution, and the maximum pooling layer reduction image is connected behind each convolution layer.

The input original image size of the first convolution layer is 224 × 224 × 3, and after two times of convolution with convolution kernel size of 3 × 3 and one time of pooling, the output result is 112 × 112 × 64.

The second convolutional layer still uses 3 × 3 convolutional kernel, and after two convolution and one pooling, the output size is 56 × 56 × 128.

The operations performed in the third, fourth and fifth sections are the same, and after performing convolution twice using convolution kernel of 3 × 3 size and convolution kernel of 1 × 1 size, and then performing pooling, the output feature image is gradually reduced to 28 × 28 × 256, 14 × 14 × 512 and 7 × 7 × 512, respectively.

5.3 gesture feature extraction techniques

High-value feature extraction is carried out through a deep learning algorithm, data features can be represented by vectors and the like, and then gesture classification is carried out through a traditional machine learning classification algorithm. In the gesture picture for experiment, the gesture is usually in a complex environment, such as various complex background factors, too bright or too dark light, or different distance from the acquisition device. Gesture data is uncertain in size, is too large and needs to be compressed, and can be compressed through an H.264 algorithm. The gesture data may be preprocessed to filter noise. The experimental data is typically a small MNIST dataset and an ASL gesture dataset, or a direct capture of RGB gesture images in a complex indoor context.

The depth image is better acquired through infrared emission and receiving, the gesture segmentation is easy, and the recognition accuracy can be improved. The depth camera acquires video information of hand movement, and then the hand foreground is segmented. Because the value of each pixel point in the depth image corresponds to the depth value of the camera from the midpoint of the scene, segmentation is easy.

The hand segmentation method comprises threshold segmentation, pixel point clustering, combination of color images and depth images and the like. There are enough training samples to train a classifier with high accuracy.

5.4 KCF-based tracking detection technology

Compared with the gesture recognition method based on detection, the gesture detection method based on tracking can remarkably improve the detection speed. Therefore, in the initialization, the detection method based on the deep learning network is used for detecting the position of the initial hand, the prediction is initialized, and then the position of the hand is tracked in a tracking mode. The tracking algorithm generally trains a classifier during tracking, then judges the position where the target may appear in the next frame, and updates the classifier with the detected new target. In the detection process, the target area is a positive sample, and the non-target background is a negative sample, so that the position of the positive sample determines the position where the target may exist in the next frame. The KCF process has the following characteristics:

(1) the circulant matrix is used to generate a sufficient number of training sets to train the classifier. The circulant matrix generates a plurality of identical targets on the diagonal of the current target, and actually, a group of vectors are gradually shifted to the right, a new group of vectors are generated after each shift, and a plurality of vectors are combined into a circulant matrix after being generated. The classifier is then trained using a least squares method with a penalty term added.

For a unary linear regression problem f (x) _i )＝ω ^T x _i Wherein (x) _i ，y _i ) Referring to the ith sample, and ω is the corresponding weight considering regularization, the objective function is:

the corresponding matrix expression is:

ω＝(X ^T X+λI) ^-1 X ^T y (7)

in complex space:

ω＝(X ^H X+λI) ^-1 X ^H y (8)

bringing formula (8) into formula (9) yields:

using formula (2):

can obtain the product

This can speed up the operation by substituting the operation of the matrix with the operation of dot product of the vector. Using a tracking algorithm, when an initial frame has been located to the hand and classified accurately, the computation speed can be significantly increased, since dense detection operations are no longer required for each frame, and using a probability-based prediction method greatly simplifies the computation speed.

For feature extraction and target tracking, linear separable is required to be realized in a feature space of corresponding dimensions, but in the practical application process, a plurality of low-dimensional linear inseparables exist, and in mathematics, ascending-dimensional separable can be realized through some nonlinear mappings, namely, the low-dimensional linear inseparables are mapped into high-dimensional linear separable. However, these non-linear mappings often require the form and parameters of the function to be determined, and since they are high-dimensional mappings, they determine a very large number of parameters. However, the use of kernel functions can avoid the corresponding problems of non-linear mapping while having the advantages of non-linear mapping.

Let X, z be in X and X be in xi ⁿ The function φ: xi ⁿ →ξ ^m Where n < m, the kernel function is:

K(x,z)＝<φ(x),φ(x)> (14)

wherein the kernel function is K (x, z), and <, > is the inner product of the two functions. Obviously, according to the definition of phi (x), variables in the low-dimensional space can be mapped to the high-dimensional space, and the inner product does not add new parameters, and only needs to process the function phi.

Due to the nature of the kernel function, the amount of computation can be greatly reduced without the need to strictly determine the function φ, while for the function f (x) _i ) The simplest linear interpretation is:

f(x _i )＝<w,x _i >+b (15)

the classification performance of the KCF algorithm can be improved through the linear mapping, and after the overall mapping function is determined to be phi (x), the corresponding parameters can be determined. At this time, w is:

only alpha-alpha needs to be determined ₁ ,α ₂ ...,α _i ,...] ^T Since the algorithm will sample the target object multiple times to obtain different samples, the matrix processed by the kernel function after combining the samples is:

K _ij ＝k(x _i ,x _j ) (17)

the regression function that the algorithm needs to process can be written:

so solve as

α＝(K+λI) ^-1 y (19)

Where K is the result after the kernel function operation.

The operations that need to be performed after the algorithm performs sampling have been described above, and in order to further reduce the occupation of the algorithm, the sampling also needs to be optimized, the KCF obtains a series of sub-images by shifting the image, and the training samples are different from the random sampling of other tracking algorithms, and it does not need to depend on the center of the target object set in the image, and only needs to be held by a loop to obtain a series of samples, which reduces the time needed by sampling while ensuring the number of samples. From the above analysis, it can be seen that the introduction of the circulant matrix can greatly reduce the complexity of sampling.

Fig. 12 shows the result of using a sequence of KCF algorithms for a video sequence, with red boxes as tracking targets. When the video starts, the tracking frame is not initialized, the detection frame stays at the upper left corner all the time, when the hand enters the detection frame, the detection frame is initialized by the hand at the moment, and then each frame tracks the gesture. When the hand is lost, deformed, translated and zoomed, the tracking frame can accurately move along with the gesture, and the target cannot be lost.

Numerical simulation:

the data set for the experiment was 16 different gestures, including left, right, front, back, etc. different gestures. The names of the 16 gestures are defined as zero, one, two, three, four, five, six, seven, eight, nine, OK, good, right, other, anti, slide.

The data set of the experiment was divided into 80 groups for training. For dynamic gestures, a combination of FCN target detection algorithm and KCF target tracking algorithm is required. The FCN returns the coordinate frame and the type of the gesture when recognizing the type of the gesture in the picture, is set in the code of the whole project, directly calls a KCF target tracking algorithm when recognizing the corresponding dynamic gesture, and directly sets read contents on an algorithm platform.

For dynamic gestures, the dynamic gesture initial frames in some simple backgrounds are mainly used, a corresponding KCF interface is given in an opencv library of python, and target tracking of dynamic picking can be achieved by directly calling the opencv library.

As can be seen from the above figure, the number of lost training sets for dynamic picking is much larger than that of static gestures, because the dynamic gesture recognition is more complicated than that of static gestures, because static gestures directly call a camera, but dynamic gestures need to track a target in real time.

Under the combined action of the FCN and the KCF algorithm, the recognition accuracy rate of static gestures is over 95 percent, and the recognition rate of dynamic gestures is over 80 percent, so that the usable level is reached.

The above description is only for the purpose of illustrating the embodiments of the present invention, and the scope of the present invention should not be limited thereto, and any modifications, equivalents and improvements made by those skilled in the art within the technical scope of the present invention as disclosed in the present invention should be covered by the scope of the present invention.

Claims

1. A gesture image recognition method based on KCF tracking detection is characterized by comprising the following steps:

and step three, tracking the position of the hand by adopting a KCF-based tracking algorithm.

2. The method for recognizing the gesture image based on the KCF tracking detection as claimed in claim 1, wherein in the first step, the gesture area in the image is segmented and extracted based on a full convolution neural network (FCN); in the FCN network structure, after the full connection structure of Alexnet is improved into a convolution layer, the network has 8 convolution layers, then an inverse convolution layer, namely an upper sampling layer, is connected, an output result is amplified, and a crop layer is used for cutting the result; and finally, outputting a score layer which represents the probability that each point of the pixel on the image belongs to each category, wherein the adopted loss layer is SoftmaxwithLoss:

3. the method for recognizing gesture images based on KCF tracking detection as claimed in claim 1, wherein in step three, the KCF method comprises the following characteristics:

(1) training a classifier using a circulant matrix to generate a sufficient training set; the cyclic matrix generates a plurality of same targets on the diagonal of the current target, and actually, a group of vectors are gradually shifted to the right, a group of new vectors are generated after each shift, and a plurality of vectors are generated and then combined into a cyclic matrix; training a classifier by using a least square method added with a penalty term;

4. The method for recognizing gesture images based on KCF tracking detection as claimed in claim 3, wherein f (x) is a univariate linear regression problem _i )＝ω ^T x _i Wherein (x) _i ，y _i ) Referring to the ith sample, and ω is the corresponding weight considering regularization, the objective function is:

the corresponding matrix expression is:

wherein X ═ X ₁ ，x ₂ ，...，x _n ] ^T In x _i Corresponding to each sample, y _i The label of the corresponding sample, so solving the minimum value, is the extremum of the solving function:

ω＝(X ^T X+λI) ^-1 X ^T y (4)

in complex space:

ω＝(X ^H X+λI) ^-1 X ^H y (5)

wherein, X ^H Is a complex conjugate transpose matrix; because the circulant diagonal matrix can be diagonalized in fourier space with a fourier matrix:

bringing formula (5) into formula (6) yields:

using formula (1):

can obtain the product

5. The method for recognizing gesture images based on KCF tracking detection as claimed in claim 1, wherein in step three, the tracking algorithm trains a classifier during tracking, judges the possible position of the target in the next frame, and updates the classifier with the detected new target; in the detection process, the target area is a positive sample, and the non-target background is a negative sample, so that the position of the positive sample determines the position where the target may exist in the next frame.

6. A gesture image recognition system based on KCF tracking detection applying the gesture image recognition method based on KCF tracking detection as claimed in any one of claims 1-5, characterized in that the gesture image recognition system based on KCF tracking detection comprises:

7. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:

8. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

9. An information data processing terminal characterized by being used for implementing the gesture image recognition system based on KCF tracking detection as claimed in claim 6.

10. Use of a gesture image recognition system based on KCF tracking detection according to claim 6 in information device and system control.