CN108717524B

CN108717524B - Gesture recognition system based on double-camera mobile phone and artificial intelligence system

Info

Publication number: CN108717524B
Application number: CN201810402470.9A
Authority: CN
Inventors: 邓琨; 孟昭鹏; 郑岩
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2022-05-06
Anticipated expiration: 2038-04-28
Also published as: CN108717524A

Abstract

The invention discloses a gesture recognition system based on a double-camera mobile phone and an artificial intelligence system, which realizes the recognition of human body gestures by utilizing the double-camera mobile phone and machine learning, wherein an image acquisition module is used for acquiring two different original images generated by different camera visual angles, including color images of a left camera and a right camera and images containing depth information, and storing the images; the image preprocessing module is used for intercepting a gesture area from an original image and acquiring a depth image of the gesture area; the neural network training module is used for training the acquired depth image by using a depth neural network to obtain a neural network system with the recognition accuracy rate of more than 92%; and the gesture detection and recognition module is used for returning a gesture recognition result according to the gesture image input information to be recognized. Compared with the prior art, the method increases the depth information and has more accurate gesture information, so that the recognition accuracy is higher.

Description

Gesture recognition system based on double-camera mobile phone and artificial intelligence system

Technical Field

The invention relates to the technologies of computer image processing and artificial intelligence, in particular to a system and a method for gesture recognition by acquiring a 3D image through binocular stereo vision.

Background

Human-computer interaction refers to a way of conversation between a human and a machine. From the original keyboard, mouse to the present camera, various sensors, etc., have undergone great innovation and development. With the continuous development of VR technology, the recognition of motion interaction becomes the heat of new development. How to capture the action gestures of the user and perform recognition and judgment is a complex art.

With the continuous development of mobile phone software and hardware, the dual-camera is becoming the standard configuration of mainstream mobile phones, the mobile phone carrying the dual-camera can provide better telephoto performance, and the mutual cooperation of the two lenses can bring the background blurring capability like a camera, so that the camera has a good effect when people take photos. Moreover, by using binocular stereo vision of the two cameras, image video with a 3D effect can be realized, and depth image data of a scene can be obtained. Thereby applying the 3D data to other specific scenes.

The field of machine learning has continued to improve and evolve since 2006. In the field of image processing, a convolutional neural network obtains huge practical application results. Through a supervised deep learning model CNN (convolutional neural network), the number of parameters is reduced by utilizing spatial calculation modes such as weight sharing, downsampling and the like, so that the number of local minimum values is reduced, the parameters can be reduced, and a best local optimal solution is found during training. Thereby improving the recognition rate and achieving good effect.

Disclosure of Invention

Based on the prior art, the invention provides a gesture recognition system utilizing a double-camera mobile phone and an artificial intelligence system, which is used as a novel man-machine interaction means.

The invention discloses a gesture recognition system based on a double-camera mobile phone and an artificial intelligence system, which realizes the recognition of human body gestures by utilizing the double-camera mobile phone and machine learning, and comprises an image acquisition module, an image preprocessing module, a neural network training module and a gesture recognition module; wherein:

the image acquisition module 100 is configured to acquire and store two different original images, which are generated due to different viewing angles of the cameras, including color images of the left and right cameras and an image including depth information;

the image preprocessing module 200 is configured to intercept a gesture area from an original image, and obtain a depth image of the gesture area;

the neural network training module 300 is configured to train the acquired depth image with a depth neural network to obtain a neural network system;

the gesture checking and identifying module 400 is configured to return a gesture identification result according to the gesture image input information to be identified;

the image acquisition module 100 is utilized to simultaneously acquire JPG image data of two cameras. The JPG image comprises 3 parts, namely a color image shot by a left camera, a color image shot by a right camera and a depth image obtained by preprocessing; then, JPG image segmentation processing is carried out, namely, according to JPG file format, the following steps are specified: extracting corresponding storage segments of the left camera image and the right camera image for storage respectively by taking a jpg file header as 0xFFD8 and an SOA format segment as 0 xFFDA; the depth image fragment then starts at 0x0065646f6600 and is extracted and stored separately. The characters of the hexadecimal string are denoted as edof flags;

acquiring an image with depth information from an original image by using the image preprocessing module 200, and intercepting a gesture area in the depth image by using a threshold segmentation method; intercepting a corresponding gesture area from the color image as a primary gesture segmentation result; converting the color image from an RGB space to an HSV space, clustering the color information of the image by using a kmeans machine learning clustering method, and clustering the image data of the HSV space into 3 types, namely obtaining a background white type, a gesture area type and other areas types; after the classified pixels of the gesture area are obtained, the pixel mean value and the variance are obtained, and the corresponding accurate gesture area in the color image is intercepted according to the mean value and the variance by utilizing a threshold segmentation method; cutting out a depth image gesture area by utilizing the color image accurate gesture area to obtain a final depth gesture image; performing transformation expansion on the final depth gesture image, and enhancing a training data set to reach more than about 30000 depth images;

performing neural network training on the gesture region depth map obtained by the image preprocessing module by using the neural network training module 300, wherein the neural network is composed of 4 layers, the first layer is a convolution neural network layer and comprises 16 convolution kernels with 5 × 5 and 1 convolution kernel with 2 × 2 maximum value subsampling, and 16 characteristics with 36 × 48 are output by inputting a gray scale map with the size of 72 × 96; the second layer is a convolution neural network layer, and the input 32 gray-scale graphs with the size of 36 × 48 are output with 64 features of 18 × 24 by 32 convolution kernels with 5 × 5 and 1 maximum subsampled convolution kernel with 2 × 2; the third layer is a full connection layer, and the 64 output characteristic maps of 18 × 24 are fully connected to 512 output neurons; the fourth layer is a softmax layer, 512 input neurons are output to 9 output neurons, 9 numbers representing 1-9 are output, and the maximum output item is taken as the recognition result;

the gesture detection and recognition module 400 is used for obtaining a gesture depth map through the preprocessing of the image preprocessing module, and then the gesture depth map is input into a neural network layer to obtain a prediction result.

Compared with the traditional technology of utilizing color images to recognize images, the invention starts from depth images, combines the particularity of gesture recognition images and utilizes depth information to recognize gestures. The depth information has more accurate gesture information, and thus, higher recognition accuracy.

Drawings

FIG. 1 is a functional block diagram of a gesture recognition system based on a dual-camera phone and an artificial intelligence system according to the present invention;

FIG. 2 is a schematic flow diagram of an image acquisition module;

FIG. 3 is a schematic flow diagram of an image pre-processing module;

FIG. 4 is a schematic diagram of a result of preliminary gesture segmentation of a depth image, (4-1) is an original depth image, (4-2) is a depth image of a gesture area captured by a threshold segmentation method, and (4-3) is a depth image after gray scale stretching;

FIG. 5 is a diagram illustrating effects of the embodiment; (5-1) is an original left camera color image, (5-2) is a color image segmented by a primary gesture area, (5-3) is a cut and scaled color image, (5-4) is a color image subjected to fuzzy processing, (5-5) is a color image segmented by an accurate gesture area, (5-4) is a color image subjected to fuzzy processing, (5-6) is a depth image segmented by an accurate gesture area, and (5-7) is a depth image subjected to gray scale inversion;

FIG. 6 is a schematic overall flow chart of a gesture recognition method based on a dual-camera mobile phone and an artificial intelligence system according to the present invention;

FIG. 7 is a diagram of a neural network model for depth images according to the present invention;

FIG. 8 is a schematic diagram of an original captured image, (8-1) is the original image, (8-2) is the left camera image, (8-3) is the right camera image, and (8-4) is the depth image;

FIG. 9 is a schematic view of depth images of 1-9 gestures;

FIG. 10 is a diagram illustrating an expansion effect of a number 1 gesture. (10-1) is an original figure, (10-2) to (10-4) are a shearing effect, (10-5) is an outline delineation effect, (10-6) and (10-7) are a maximization and minimization effect, (10-8) to (10-10) are a rotation effect, (10-11) is a sharpening effect, and (10-12) is a softening effect.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is a functional block diagram of a gesture recognition system based on a dual-camera phone and an artificial intelligence system according to the present invention. The system includes an image acquisition module 100, an image preprocessing module 200, a neural network training module 300, and a gesture verification recognition module 400.

The image acquisition module mainly utilizes the principle that when two rear cameras of a double-camera mobile phone take pictures simultaneously, different visual angles are generated due to the position difference of the cameras, so that two images with small difference are acquired, and a scene depth information image generated by the camera by utilizing a binocular stereo vision principle is acquired; the image preprocessing module 200 intercepts the primarily segmented gesture regions from the depth image by using a threshold value method, segments the same gesture regions by using the corresponding color images, divides the image regions by using a clustering method to obtain accurate gesture image regions, and correspondingly removes other image regions in the depth image to obtain a more accurate depth image of the gesture regions; the neural network training module 300 is used for collecting about 2500 depth images with manually marked correct results in advance, carrying out image filtering processing such as overturning, blurring, sharpening, line smoothing, boundary enhancement and the like on sample images, expanding the number of images, enhancing a neural network training set to obtain about 30000 training sample sets, and training the training sample sets by using the depth neural network to obtain a neural network system with the recognition accuracy rate of more than 92%; the gesture detection and recognition module 400 utilizes the built 4-layer structure neural network system, inputs information according to the gesture image to be recognized, inputs the information into the neural network after being processed by the image preprocessing module, and returns a gesture recognition result. The module can be integrated into various mobile-side apps, pc-side apps and web apps. In the actual gesture recognition application, a user utilizes a camera to take a picture, and the program acquires and preprocesses the taken picture, inputs the preprocessed picture into a neural network to obtain a neural network prediction result and feeds the neural network prediction result back to a user recognition result.

Fig. 2 shows a flow chart of the image capturing module 100 according to the present invention. The image acquisition process specifically comprises the following processing:

step 1001, camera acquisition: the JPG image data shot by the two cameras is simultaneously obtained through an interface for shooting by a large aperture of the mobile phone. The JPG image contains 3 parts, namely, color images taken by two cameras separately and depth image information obtained by preprocessing with a camera program. In the depth image, darker pixels represent that the scene is closer to the camera, lighter pixels represent that the scene is farther from the camera, as shown in fig. 8, fig. 8-1 is an original jpg image, 8-2 is a left camera color image, 8-3 is a right camera color image, and 8-4 is a depth image;

step 1002, performing JPG image segmentation processing. According to the JPG file format specification, 0xFFD8 in the image file is identified as a JPG file header, 0xFFDA is identified as an SOA format segment header, and 0xFFD9 or 0xFFD8 is a format segment trailer. Two images shot by the left camera and the right camera are stored in two SOA format sections of the jpg file respectively. The method is used for accurately segmenting the color image gesture and only needs the image of any one camera, so that any SOA segment is extracted from the jpg source file. Similarly, the depth image format segment coding starts with 0x0065646f6600, ascii characters are represented as edof, 0xFFD9 or 0xFFD8 is represented as format segment tail, and an edof storage segment is extracted from an original jpg file and stored as a new jpg file;

and 1003, exporting the depth image until the image acquisition is finished. The derivation operation process is as follows:

an image acquisition process:

(1) color image acquisition

1. Checking jpg header 0xffd8 identifier

2. Search for the first 0 xfdda identifier

3. Obtaining SOA format segment length

4. And intercepting the SOA format section and outputting the SOA format section to an x _ n _ rgb.jpg file. Wherein x is the hand-marked gesture recognition result, namely the number is 1-9, and n is the picture serial number.

(2) Depth image acquisition

1. Checking jpg header 0xffd8 identifier

2. Search for 0x0065646f6600(ascii code corresponds to edof) identifier

3. Obtaining the edof Format segment Length

4. And intercepting the edof format segment and outputting the edif format segment to an x _ n _ dep. Wherein x is the hand-marked gesture recognition result, namely the number is 1-9, and n is the picture serial number.

5. And the image is turned over, so that the direction of the depth image is consistent with that of the color image.

Fig. 3 is a flow chart of the image preprocessing module 200 according to the present invention. The image preprocessing flow specifically comprises several important steps of preliminarily segmenting a depth image gesture area by using a threshold method, preliminarily segmenting a color image gesture area, acquiring pixel characteristics of a hand area by using kmeans, accurately segmenting the color image gesture area by using the threshold method, accurately segmenting the depth image gesture area, and carrying out image transformation and expansion, and comprises the following steps:

step 2001, preliminarily segmenting the depth image gesture region by using a threshold method. Firstly, obtaining depth image information of a gesture through an image acquisition module, and primarily intercepting a gesture area image by using a threshold segmentation method aiming at the depth image, wherein the processing process is as follows:

depth image preliminary gesture segmentation:

1. and acquiring image gray value histogram statistical data, namely counting the number of pixels appearing in each pixel value for the subsequent processing process.

2. And obtaining the maximum gray value appearing in the histogram statistical data, and simultaneously satisfying that the number of the gray value pixel points is more than 0.1 percent of the total number of the pixels of the picture. The larger the gray value is, the closer the object is to the camera when the object is shot, and due to the particularity of the gesture image, the gray value can be considered as the part of the human hand closest to the camera. Taking 0.1% as a threshold value effectively avoids some single noise pixel points from being recognized as valid objects by mistake. The value is obtained through multiple tests, so that the filtering effect is good, and individual noise pixel points can be effectively eliminated.

3. Regarding the particularity of the gesture image, 30 is taken as a threshold value for gesture area segmentation. That is, after the maximum gray value is obtained, a value greater than the gray value by more than 30 is filtered. Experiments prove that when a scene is shot in a close view, the pixel gray value changes by about 30 every time the scene is 20cm away from the camera, so that the depth of field of the hand gesture does not exceed 20cm, the point closest to the camera is the starting point of the hand region, and the non-gesture region of the image is filtered by taking the point more than 20cm away as the background region.

4. And carrying out gray level stretching on the filtered image to enlarge the contrast of the depth value. And extending the 30 pixel value spaces to the pixel value spaces of 0-255, so as to increase the difference of depth information and facilitate comparison.

5. The image is temporarily stored as a result of the preliminary gesture segmentation of the depth image, as shown in FIG. 4.

The original depth image is shown as 4-1, the depth image after threshold segmentation is shown as 4-2, and the effect shown as 4-3 is obtained after the gray stretching processing process, so that the gesture image has an effective gesture area primarily filtered out after the primary gesture segmentation process of the depth image, and the depth information contrast is obvious.

Step 2002, the color image gesture area is initially segmented. Due to the influence of background or camera shake, the gesture region depth map obtained by the threshold segmentation method may contain much noise, such as the arm, other parts of the body, or other extraneous background pixels. Therefore, the acquired color image is required to be used for further denoising, and the difference influence of the images of the left camera and the right camera is not great in the gesture segmentation process, so that the color image acquired by the left camera is used for processing by default. And intercepting a depth image preliminary gesture region corresponding to the color image, namely segmenting the gesture region of the corresponding part in the color image by using a preliminary gesture segmentation result generated in the step 2001. The original color image is shown in fig. 5-1, and the color image after the preliminary gesture segmentation step is shown in fig. 5-2.

Step 2003, kmeans obtains pixel characteristics of the human hand area. After the initial gesture segmentation, other irrelevant areas need to be filtered, and as shown in fig. 5-2, sleeves, shadows and other noise areas can be seen from the intercepted gesture area image. Therefore, the human hand region can be further accurately segmented from the angle of the difference between the human hand skin color and the background by utilizing the color information of the color image. The invention processes color information in the hsv color space. HSV is a space for locating colors by using H (hue), S (saturation), and V (brightness) as color values. The value range of the hue is 0-360 degrees and is used for representing the category of the color. Where red is 0 degrees, green is 120 degrees, and blue is 240 degrees. The value range of the saturation is 0-100%. To indicate the vividness of the color, the saturation of gray is 0% and the pure color saturation is 100%. The brightness range is 0% -100% and is used for representing the brightness of the color, the brightness is black when the brightness is 0%, the brightness is white when the brightness is 100%, and the brightness is between 0% -100% and is used for representing the brightness of each color. Compared with the RGB space, the HSV space can express the brightness, the tone and the vividness of colors very visually, and the contrast between colors is convenient to carry out. Therefore, by using the hsv color space, the different color regions have larger difference, and the difference of the different color regions can be distinguished more easily. Clustering the color information of the images by using a kmeans machine learning clustering method to obtain average h, s and v values of the gesture area; clustering image data of the HSV space into 3 types, and obtaining a background white type, a gesture area type and other areas in an ideal state; the specific detailed implementation steps are as follows:

kmeans acquires pixel characteristics of a human hand region:

1. and cutting and scaling the original color picture to reduce the pixel quantity of the picture and reduce the calculated value. From the practical situation, the gestures are generally concentrated near the center area of the picture, and in a few cases, the gestures are slightly deviated. Therefore, images with the width of 10 pixels are cut off in the upper, lower, left and right directions of the picture respectively, then the length and the width of the images are reduced by 40 times in an equal proportion and are scaled to be 72 × 96, so that the total size of the picture is reduced to 6912 pixels, and the calculation amount of pixel clustering and a neural network is greatly reduced.

2. The blurring operation is performed on the picture twice, so that the color change of different areas of the picture is smoother, the influence of partial bright points or dark points in the picture can be effectively reduced, and the obtained result is shown in fig. 5-3.

3. And converting the image pixel value from the rgb value to the hsv value by using a conversion rule from the rgb color space to the hsv color space, and then performing clustering analysis on the processed image pixels by using a kmeans clustering method. Kmeans is an unsupervised clustering machine learning algorithm, and the main objective is to automatically classify similar samples into one category. In the step, the image data of the HSV space is tried to be clustered into 3 types, and a background white type, a gesture area type and other areas types are obtained under an ideal state. In the default situation, the kmeans algorithm clusters 3 random initial centroids, and aiming at the characteristic of segmenting the color image by the initial gesture, the invention adopts 3 customized initial centroids, namely white (hsv: 0 degrees, 0 percent and 100 percent), black (hsv: 0 degrees, 0 percent and 0 percent) and yellow skin average color (hsv: 60 degrees, 90 percent and 60 percent). And 3 customized initial centroids are adopted, the centroids of 3 pixel classes of the final clustering result are closer to the initial centroids, and the obtained clustering result is more accurate. Through the kmeans clustering analysis, the picture pixels are divided into 3 classes, and the class with the initial centroid being the average color of the yellow skin is all the hand region pixels finally divided. The centroid value represents a color characteristic of the region of the human hand.

And step 2004, accurately segmenting the color image gesture area by using a threshold value method. The color characteristics of the region of the human hand, i.e., the centroid of the skin pixel class, are obtained from step 2003. And (3) obtaining the class centroid, and segmenting by using a threshold value method, namely using the pixel which is within 15 degrees, 10 percent and 10 percent of the hsv value of the centroid as the pixel of the hand region, and setting other pixels as white, wherein the threshold value is obtained through a plurality of tests, and the threshold value has a good segmentation effect. The resulting segmentation results are shown in FIGS. 5-4. And then, filtering the picture by adopting a minimum filter to eliminate part of independent noise points. The final segmentation result is obtained, as shown in fig. 5-5, the white area is the background area, and the middle area is the gesture area.

Step 2005, depth image precision gesture region segmentation. The result of the accurate gesture area segmentation of the color image is obtained in step 2004, and the result of the accurate gesture segmentation of the depth image is obtained by intercepting the area corresponding to the depth image and setting the other parts as white backgrounds, as shown in fig. 5-6. Then, the gray value of the picture is turned over, so that the neural network can calculate conveniently, and the obtained result is shown in fig. 5-7.

And step 2006, performing transformation expansion on the final depth gesture image. Due to the shortage of the image acquisition quantity, the neural network training set is too small, and large recognition errors are easily caused, so that the image quantity can be increased and the training set can be enlarged by carrying out scaling, filtering, overturning and other processing on the image. As shown in fig. 10, the depth gesture image and the expanded image are listed. FIG. 10-1 is an original depth gesture image, FIG. 10-2 is a 105% scaled image, FIG. 10-3 is a 110% scaled image, FIG. 10-4 is a 115% scaled image, FIG. 10-5 is a boundary enhancement filtered image, FIG. 10-6 is a maximum filter filtered image, FIG. 10-7 is a minimum filter filtered image, FIG. 10-8 is an image rotated 90 degrees counterclockwise, FIG. 10-9 is an image rotated 180 degrees counterclockwise, FIG. 10-10 is an image rotated 270 degrees counterclockwise, FIG. 10-11 is a sharpening filtered image, and FIG. 10-12 is a smooth filtered image.

And the gesture area depth map obtained by using the image preprocessing flow is used as the input of a neural network training module, so that the network is trained. As shown in fig. 9, the present invention collects 9 gestures representing numbers 1 to 9 for 10 testers, and the total number is about 2500; after the images are subjected to an image preprocessing flow, 30000 gesture depth images are obtained through image expansion and serve as a training set for training of a neural network system.

As shown in fig. 7, the neural network structure designed for the present invention. The neural network system uses a 4-layer network structure, which includes 2 convolutional layers (convolution) and maximum pooling (maxporoling), 1 full-link layer, and 1 softmax output layer in total:

the first layer is a convolutional neural network layer, with 16 convolution kernels of 5 x 5, and 1 maximum subsampling operation of 2 x 2. Each input is a gray value matrix with the size being 1 and 72 × 96, the stride parameter of convolution is 1, and the padding parameter selection is the same as that of the original image; the pooling operation took 2 x 2 max power and the output was 16 matrices of 36 x 48.

The second layer is a convolutional neural network layer, consisting of 32 convolution kernels of 5 x 5, and 1 maximum subsampled convolution kernel of 2 x 2. The input is the output of the upper network structure, 16 matrixes of 36 x 48 are input, the stride parameter of convolution is 1, and the padding parameter selection is the same as the original image; the pooling operation took 2 x 2 max power and the output was 32 matrices of 18 x 24.

The third layer is a full connection layer, and the 32 output matrixes of 18 × 24 are fully connected to 512 output neurons;

and the fourth layer is a softmax layer, 512 input neurons are output to 9 output neurons, 9 numbers representing 1-9 are output, and the item with the largest output is taken as the recognition result.

The training method adopts a cross entropy evaluation model, and the cross entropy is used for describing the inefficiency of the true phase and measuring the prediction result. The network is trained using an adaptive moment estimation method (adam). Adam is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process, and can iteratively update neural network weights based on training data. The specific main implementation codes of the neural network are as follows:

after the code runs, after 2000 times of training, the recognition accuracy of the neural network model reaches 92.53%, the output of the training process is as follows, step is the training time, train _ accuracy is the recognition accuracy of the training set, and test accuracracacay is the recognition accuracy of the test:

step 0,train_accuracy 0.26

test accuracy 0.113806

step 100,train_accuracy 0.42

test accuracy 0.402985

step 200,train_accuracy 0.42

test accuracy 0.518657

step 300,train_accuracy 0.82

test accuracy 0.636194

step 400,train_accuracy 0.68

test accuracy 0.695895

step 500,train_accuracy 0.76

test accuracy 0.729478

step 600,train_accuracy 0.76

test accuracy 0.744403

step 700,train_accuracy 0.82

test accuracy 0.785448

step 800,train_accuracy 0.72

test accuracy 0.800373

step 900,train_accuracy 0.9

test accuracy 0.820895

step 1000,train_accuracy 0.96

test accuracy 0.845149

step 1100,train_accuracy 0.96

test accuracy 0.867537

step 1200,train_accuracy 0.98

test accuracy 0.873134

step 1300,train_accuracy 0.94

test accuracy 0.876866

step 1400,train_accuracy 0.98

test accuracy 0.882463

step 1500,train_accuracy 0.94

test accuracy 0.882463

step 1600,train_accuracy 0.96

test accuracy 0.897388

step 1700,train_accuracy 0.96

test accuracy 0.902985

step 1800,train_accuracy 0.96

test accuracy 0.893657

step 1900,train_accuracy 0.96

test accuracy 0.916045

step 2000,train_accuracy 0.98

test accuracy 0.925373

the gesture detection and recognition module 400 inputs the gesture image after being processed by the preprocessing module into the trained neural network, and returns a gesture recognition result. The module can be combined with practical application, for example, the module is integrated in a camera, a user takes a picture by using the camera, and a program acquires and preprocesses a taken picture, inputs the result into a neural network, obtains a neural network prediction result and feeds the result back to a user recognition result.

At present, the application of the 3D field by utilizing the double-shot mobile phone is very rare, and the invention creatively applies the double-shot mobile phone to other fields except the shooting field. The method is only applied to the mobile phone with the android platform, the android platform does not open an interface for acquiring data of the double cameras at present, the data of one camera cannot be acquired independently, and development interfaces of different mobile phone manufacturers need to be utilized. The invention realizes a double-shot data acquisition scheme by using the Huashi mobile phone series open interface.

Because the difference between the gesture image region and the background image is obvious in the depth image, namely the distance difference between the gesture and other parts and the camera is obvious, the sampling threshold segmentation method can obtain a good gesture image region intercepting effect.

Predefining numbers 1 to 9 to represent 9 gestures, and acquiring about 2500 sample images by 10 different testers in different background environments;

then, in an image preprocessing stage, extracting a gesture area in an image, extracting a depth image, and obtaining about 30000 gesture depth images through image expansion to serve as a training set of a neural network system;

the method comprises the steps of utilizing an open source neural network framework (transducer flow) to realize a deep convolutional neural network, and utilizing a training sample set to train to obtain a neural network system with the identification accuracy rate of more than 92%;

the application stage is identified by utilizing the deep neural network trained in the previous stage to be applied to specific practice. And after the user shoots gesture images through the two cameras, inputting the gesture images into the system. The system carries out the same preprocessing stage on the image, and then inputs the image into a neural network system for recognition to obtain a recognition result.

Claims

1. A gesture recognition system based on a double-camera mobile phone and an artificial intelligence system realizes the recognition of human body gestures by using the double-camera mobile phone and machine learning, and is characterized by comprising an image acquisition module (100), an image preprocessing module (200), a neural network training module (300) and a gesture recognition module (400); wherein:

the image acquisition module is used for acquiring and storing two different original images generated by different camera visual angles, including color images of a left camera and a right camera and images containing depth information;

the image preprocessing module is used for intercepting a gesture area from an original image and acquiring a depth image of the gesture area;

the neural network training module is used for training the acquired depth image by using a depth neural network to obtain a neural network system;

the gesture detection and recognition module is used for returning a gesture recognition result according to gesture image input information needing to be recognized;

acquiring JPG image data of two cameras by using the image acquisition module (100); the JPG image comprises 3 parts, namely a color image shot by a left camera, a color image shot by a right camera and a depth image obtained by preprocessing; then, JPG image segmentation processing is carried out, namely, the following are specified according to a JPG file format: extracting corresponding storage segments of the left camera image and the right camera image for storage respectively by taking a jpg file header as 0xFFD8 and an SOA format segment as 0 xFFDA; then starting the depth image segment with 0x0065646f6600, and storing the depth image segment after extraction; the ascii character of the hexadecimal string is denoted as an edof flag;

acquiring an image with depth information from an original image by using the image preprocessing module (200), and intercepting a gesture area in the depth image by using a threshold segmentation method; intercepting a corresponding gesture area from the color image as a primary gesture segmentation result; converting the color image from an RGB space to an HSV space, clustering the color information of the image by using a kmeans machine learning clustering method, and clustering the image data of the HSV space into 3 types, namely obtaining a background white type, a gesture area type and other areas; after the classified pixels of the gesture area are obtained, the pixel mean value and the variance are obtained, and the corresponding accurate gesture area in the color image is intercepted according to the mean value and the variance by utilizing a threshold segmentation method; cutting out a depth image gesture area by utilizing the color image accurate gesture area to obtain a final depth gesture image; performing transformation expansion on the final depth gesture image, and enhancing a training data set to reach more than about 30000 depth images;

performing neural network training on the gesture region depth map obtained by the image preprocessing module by using the neural network training module (300), wherein the neural network is composed of 4 layers, the first layer is a convolution neural network layer and comprises 16 convolution kernels with 5 × 5 and 1 convolution kernel with 2 × 2 maximum value subsampling, and 16 characteristics with 36 × 48 are output by the input gray scale map with the size of 72 × 96; the second layer is a convolution neural network layer, and the input 32 gray-scale graphs with the size of 36 × 48 are output with 64 features of 18 × 24 by 32 convolution kernels with 5 × 5 and 1 maximum subsampled convolution kernel with 2 × 2; the third layer is a full connection layer, and the 64 output characteristic maps of 18 × 24 are fully connected to 512 output neurons; the fourth layer is a softmax layer, 512 input neurons are output to 9 output neurons, 9 numbers representing 1-9 are output, and the maximum output item is taken as the recognition result;

and utilizing the gesture detection and recognition module (400), preprocessing the gesture depth map by the image preprocessing module, and inputting the gesture depth map into a neural network layer to obtain a prediction result.