CN108717524B - Gesture recognition system based on double-camera mobile phone and artificial intelligence system - Google Patents

Gesture recognition system based on double-camera mobile phone and artificial intelligence system Download PDF

Info

Publication number
CN108717524B
CN108717524B CN201810402470.9A CN201810402470A CN108717524B CN 108717524 B CN108717524 B CN 108717524B CN 201810402470 A CN201810402470 A CN 201810402470A CN 108717524 B CN108717524 B CN 108717524B
Authority
CN
China
Prior art keywords
image
gesture
depth
neural network
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810402470.9A
Other languages
Chinese (zh)
Other versions
CN108717524A (en
Inventor
邓琨
孟昭鹏
郑岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201810402470.9A priority Critical patent/CN108717524B/en
Publication of CN108717524A publication Critical patent/CN108717524A/en
Application granted granted Critical
Publication of CN108717524B publication Critical patent/CN108717524B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/2163Partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a gesture recognition system based on a double-camera mobile phone and an artificial intelligence system, which realizes the recognition of human body gestures by utilizing the double-camera mobile phone and machine learning, wherein an image acquisition module is used for acquiring two different original images generated by different camera visual angles, including color images of a left camera and a right camera and images containing depth information, and storing the images; the image preprocessing module is used for intercepting a gesture area from an original image and acquiring a depth image of the gesture area; the neural network training module is used for training the acquired depth image by using a depth neural network to obtain a neural network system with the recognition accuracy rate of more than 92%; and the gesture detection and recognition module is used for returning a gesture recognition result according to the gesture image input information to be recognized. Compared with the prior art, the method increases the depth information and has more accurate gesture information, so that the recognition accuracy is higher.

Description

Gesture recognition system based on double-camera mobile phone and artificial intelligence system
Technical Field
The invention relates to the technologies of computer image processing and artificial intelligence, in particular to a system and a method for gesture recognition by acquiring a 3D image through binocular stereo vision.
Background
Human-computer interaction refers to a way of conversation between a human and a machine. From the original keyboard, mouse to the present camera, various sensors, etc., have undergone great innovation and development. With the continuous development of VR technology, the recognition of motion interaction becomes the heat of new development. How to capture the action gestures of the user and perform recognition and judgment is a complex art.
With the continuous development of mobile phone software and hardware, the dual-camera is becoming the standard configuration of mainstream mobile phones, the mobile phone carrying the dual-camera can provide better telephoto performance, and the mutual cooperation of the two lenses can bring the background blurring capability like a camera, so that the camera has a good effect when people take photos. Moreover, by using binocular stereo vision of the two cameras, image video with a 3D effect can be realized, and depth image data of a scene can be obtained. Thereby applying the 3D data to other specific scenes.
The field of machine learning has continued to improve and evolve since 2006. In the field of image processing, a convolutional neural network obtains huge practical application results. Through a supervised deep learning model CNN (convolutional neural network), the number of parameters is reduced by utilizing spatial calculation modes such as weight sharing, downsampling and the like, so that the number of local minimum values is reduced, the parameters can be reduced, and a best local optimal solution is found during training. Thereby improving the recognition rate and achieving good effect.
Disclosure of Invention
Based on the prior art, the invention provides a gesture recognition system utilizing a double-camera mobile phone and an artificial intelligence system, which is used as a novel man-machine interaction means.
The invention discloses a gesture recognition system based on a double-camera mobile phone and an artificial intelligence system, which realizes the recognition of human body gestures by utilizing the double-camera mobile phone and machine learning, and comprises an image acquisition module, an image preprocessing module, a neural network training module and a gesture recognition module; wherein:
the image acquisition module 100 is configured to acquire and store two different original images, which are generated due to different viewing angles of the cameras, including color images of the left and right cameras and an image including depth information;
the image preprocessing module 200 is configured to intercept a gesture area from an original image, and obtain a depth image of the gesture area;
the neural network training module 300 is configured to train the acquired depth image with a depth neural network to obtain a neural network system;
the gesture checking and identifying module 400 is configured to return a gesture identification result according to the gesture image input information to be identified;
the image acquisition module 100 is utilized to simultaneously acquire JPG image data of two cameras. The JPG image comprises 3 parts, namely a color image shot by a left camera, a color image shot by a right camera and a depth image obtained by preprocessing; then, JPG image segmentation processing is carried out, namely, according to JPG file format, the following steps are specified: extracting corresponding storage segments of the left camera image and the right camera image for storage respectively by taking a jpg file header as 0xFFD8 and an SOA format segment as 0 xFFDA; the depth image fragment then starts at 0x0065646f6600 and is extracted and stored separately. The characters of the hexadecimal string are denoted as edof flags;
acquiring an image with depth information from an original image by using the image preprocessing module 200, and intercepting a gesture area in the depth image by using a threshold segmentation method; intercepting a corresponding gesture area from the color image as a primary gesture segmentation result; converting the color image from an RGB space to an HSV space, clustering the color information of the image by using a kmeans machine learning clustering method, and clustering the image data of the HSV space into 3 types, namely obtaining a background white type, a gesture area type and other areas types; after the classified pixels of the gesture area are obtained, the pixel mean value and the variance are obtained, and the corresponding accurate gesture area in the color image is intercepted according to the mean value and the variance by utilizing a threshold segmentation method; cutting out a depth image gesture area by utilizing the color image accurate gesture area to obtain a final depth gesture image; performing transformation expansion on the final depth gesture image, and enhancing a training data set to reach more than about 30000 depth images;
performing neural network training on the gesture region depth map obtained by the image preprocessing module by using the neural network training module 300, wherein the neural network is composed of 4 layers, the first layer is a convolution neural network layer and comprises 16 convolution kernels with 5 × 5 and 1 convolution kernel with 2 × 2 maximum value subsampling, and 16 characteristics with 36 × 48 are output by inputting a gray scale map with the size of 72 × 96; the second layer is a convolution neural network layer, and the input 32 gray-scale graphs with the size of 36 × 48 are output with 64 features of 18 × 24 by 32 convolution kernels with 5 × 5 and 1 maximum subsampled convolution kernel with 2 × 2; the third layer is a full connection layer, and the 64 output characteristic maps of 18 × 24 are fully connected to 512 output neurons; the fourth layer is a softmax layer, 512 input neurons are output to 9 output neurons, 9 numbers representing 1-9 are output, and the maximum output item is taken as the recognition result;
the gesture detection and recognition module 400 is used for obtaining a gesture depth map through the preprocessing of the image preprocessing module, and then the gesture depth map is input into a neural network layer to obtain a prediction result.
Compared with the traditional technology of utilizing color images to recognize images, the invention starts from depth images, combines the particularity of gesture recognition images and utilizes depth information to recognize gestures. The depth information has more accurate gesture information, and thus, higher recognition accuracy.
Drawings
FIG. 1 is a functional block diagram of a gesture recognition system based on a dual-camera phone and an artificial intelligence system according to the present invention;
FIG. 2 is a schematic flow diagram of an image acquisition module;
FIG. 3 is a schematic flow diagram of an image pre-processing module;
FIG. 4 is a schematic diagram of a result of preliminary gesture segmentation of a depth image, (4-1) is an original depth image, (4-2) is a depth image of a gesture area captured by a threshold segmentation method, and (4-3) is a depth image after gray scale stretching;
FIG. 5 is a diagram illustrating effects of the embodiment; (5-1) is an original left camera color image, (5-2) is a color image segmented by a primary gesture area, (5-3) is a cut and scaled color image, (5-4) is a color image subjected to fuzzy processing, (5-5) is a color image segmented by an accurate gesture area, (5-4) is a color image subjected to fuzzy processing, (5-6) is a depth image segmented by an accurate gesture area, and (5-7) is a depth image subjected to gray scale inversion;
FIG. 6 is a schematic overall flow chart of a gesture recognition method based on a dual-camera mobile phone and an artificial intelligence system according to the present invention;
FIG. 7 is a diagram of a neural network model for depth images according to the present invention;
FIG. 8 is a schematic diagram of an original captured image, (8-1) is the original image, (8-2) is the left camera image, (8-3) is the right camera image, and (8-4) is the depth image;
FIG. 9 is a schematic view of depth images of 1-9 gestures;
FIG. 10 is a diagram illustrating an expansion effect of a number 1 gesture. (10-1) is an original figure, (10-2) to (10-4) are a shearing effect, (10-5) is an outline delineation effect, (10-6) and (10-7) are a maximization and minimization effect, (10-8) to (10-10) are a rotation effect, (10-11) is a sharpening effect, and (10-12) is a softening effect.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
Fig. 1 is a functional block diagram of a gesture recognition system based on a dual-camera phone and an artificial intelligence system according to the present invention. The system includes an image acquisition module 100, an image preprocessing module 200, a neural network training module 300, and a gesture verification recognition module 400.
The image acquisition module mainly utilizes the principle that when two rear cameras of a double-camera mobile phone take pictures simultaneously, different visual angles are generated due to the position difference of the cameras, so that two images with small difference are acquired, and a scene depth information image generated by the camera by utilizing a binocular stereo vision principle is acquired; the image preprocessing module 200 intercepts the primarily segmented gesture regions from the depth image by using a threshold value method, segments the same gesture regions by using the corresponding color images, divides the image regions by using a clustering method to obtain accurate gesture image regions, and correspondingly removes other image regions in the depth image to obtain a more accurate depth image of the gesture regions; the neural network training module 300 is used for collecting about 2500 depth images with manually marked correct results in advance, carrying out image filtering processing such as overturning, blurring, sharpening, line smoothing, boundary enhancement and the like on sample images, expanding the number of images, enhancing a neural network training set to obtain about 30000 training sample sets, and training the training sample sets by using the depth neural network to obtain a neural network system with the recognition accuracy rate of more than 92%; the gesture detection and recognition module 400 utilizes the built 4-layer structure neural network system, inputs information according to the gesture image to be recognized, inputs the information into the neural network after being processed by the image preprocessing module, and returns a gesture recognition result. The module can be integrated into various mobile-side apps, pc-side apps and web apps. In the actual gesture recognition application, a user utilizes a camera to take a picture, and the program acquires and preprocesses the taken picture, inputs the preprocessed picture into a neural network to obtain a neural network prediction result and feeds the neural network prediction result back to a user recognition result.
Fig. 2 shows a flow chart of the image capturing module 100 according to the present invention. The image acquisition process specifically comprises the following processing:
step 1001, camera acquisition: the JPG image data shot by the two cameras is simultaneously obtained through an interface for shooting by a large aperture of the mobile phone. The JPG image contains 3 parts, namely, color images taken by two cameras separately and depth image information obtained by preprocessing with a camera program. In the depth image, darker pixels represent that the scene is closer to the camera, lighter pixels represent that the scene is farther from the camera, as shown in fig. 8, fig. 8-1 is an original jpg image, 8-2 is a left camera color image, 8-3 is a right camera color image, and 8-4 is a depth image;
step 1002, performing JPG image segmentation processing. According to the JPG file format specification, 0xFFD8 in the image file is identified as a JPG file header, 0xFFDA is identified as an SOA format segment header, and 0xFFD9 or 0xFFD8 is a format segment trailer. Two images shot by the left camera and the right camera are stored in two SOA format sections of the jpg file respectively. The method is used for accurately segmenting the color image gesture and only needs the image of any one camera, so that any SOA segment is extracted from the jpg source file. Similarly, the depth image format segment coding starts with 0x0065646f6600, ascii characters are represented as edof, 0xFFD9 or 0xFFD8 is represented as format segment tail, and an edof storage segment is extracted from an original jpg file and stored as a new jpg file;
and 1003, exporting the depth image until the image acquisition is finished. The derivation operation process is as follows:
an image acquisition process:
(1) color image acquisition
1. Checking jpg header 0xffd8 identifier
2. Search for the first 0 xfdda identifier
3. Obtaining SOA format segment length
4. And intercepting the SOA format section and outputting the SOA format section to an x _ n _ rgb.jpg file. Wherein x is the hand-marked gesture recognition result, namely the number is 1-9, and n is the picture serial number.
(2) Depth image acquisition
1. Checking jpg header 0xffd8 identifier
2. Search for 0x0065646f6600(ascii code corresponds to edof) identifier
3. Obtaining the edof Format segment Length
4. And intercepting the edof format segment and outputting the edif format segment to an x _ n _ dep. Wherein x is the hand-marked gesture recognition result, namely the number is 1-9, and n is the picture serial number.
5. And the image is turned over, so that the direction of the depth image is consistent with that of the color image.
Fig. 3 is a flow chart of the image preprocessing module 200 according to the present invention. The image preprocessing flow specifically comprises several important steps of preliminarily segmenting a depth image gesture area by using a threshold method, preliminarily segmenting a color image gesture area, acquiring pixel characteristics of a hand area by using kmeans, accurately segmenting the color image gesture area by using the threshold method, accurately segmenting the depth image gesture area, and carrying out image transformation and expansion, and comprises the following steps:
step 2001, preliminarily segmenting the depth image gesture region by using a threshold method. Firstly, obtaining depth image information of a gesture through an image acquisition module, and primarily intercepting a gesture area image by using a threshold segmentation method aiming at the depth image, wherein the processing process is as follows:
depth image preliminary gesture segmentation:
1. and acquiring image gray value histogram statistical data, namely counting the number of pixels appearing in each pixel value for the subsequent processing process.
2. And obtaining the maximum gray value appearing in the histogram statistical data, and simultaneously satisfying that the number of the gray value pixel points is more than 0.1 percent of the total number of the pixels of the picture. The larger the gray value is, the closer the object is to the camera when the object is shot, and due to the particularity of the gesture image, the gray value can be considered as the part of the human hand closest to the camera. Taking 0.1% as a threshold value effectively avoids some single noise pixel points from being recognized as valid objects by mistake. The value is obtained through multiple tests, so that the filtering effect is good, and individual noise pixel points can be effectively eliminated.
3. Regarding the particularity of the gesture image, 30 is taken as a threshold value for gesture area segmentation. That is, after the maximum gray value is obtained, a value greater than the gray value by more than 30 is filtered. Experiments prove that when a scene is shot in a close view, the pixel gray value changes by about 30 every time the scene is 20cm away from the camera, so that the depth of field of the hand gesture does not exceed 20cm, the point closest to the camera is the starting point of the hand region, and the non-gesture region of the image is filtered by taking the point more than 20cm away as the background region.
4. And carrying out gray level stretching on the filtered image to enlarge the contrast of the depth value. And extending the 30 pixel value spaces to the pixel value spaces of 0-255, so as to increase the difference of depth information and facilitate comparison.
5. The image is temporarily stored as a result of the preliminary gesture segmentation of the depth image, as shown in FIG. 4.
The original depth image is shown as 4-1, the depth image after threshold segmentation is shown as 4-2, and the effect shown as 4-3 is obtained after the gray stretching processing process, so that the gesture image has an effective gesture area primarily filtered out after the primary gesture segmentation process of the depth image, and the depth information contrast is obvious.
Step 2002, the color image gesture area is initially segmented. Due to the influence of background or camera shake, the gesture region depth map obtained by the threshold segmentation method may contain much noise, such as the arm, other parts of the body, or other extraneous background pixels. Therefore, the acquired color image is required to be used for further denoising, and the difference influence of the images of the left camera and the right camera is not great in the gesture segmentation process, so that the color image acquired by the left camera is used for processing by default. And intercepting a depth image preliminary gesture region corresponding to the color image, namely segmenting the gesture region of the corresponding part in the color image by using a preliminary gesture segmentation result generated in the step 2001. The original color image is shown in fig. 5-1, and the color image after the preliminary gesture segmentation step is shown in fig. 5-2.
Step 2003, kmeans obtains pixel characteristics of the human hand area. After the initial gesture segmentation, other irrelevant areas need to be filtered, and as shown in fig. 5-2, sleeves, shadows and other noise areas can be seen from the intercepted gesture area image. Therefore, the human hand region can be further accurately segmented from the angle of the difference between the human hand skin color and the background by utilizing the color information of the color image. The invention processes color information in the hsv color space. HSV is a space for locating colors by using H (hue), S (saturation), and V (brightness) as color values. The value range of the hue is 0-360 degrees and is used for representing the category of the color. Where red is 0 degrees, green is 120 degrees, and blue is 240 degrees. The value range of the saturation is 0-100%. To indicate the vividness of the color, the saturation of gray is 0% and the pure color saturation is 100%. The brightness range is 0% -100% and is used for representing the brightness of the color, the brightness is black when the brightness is 0%, the brightness is white when the brightness is 100%, and the brightness is between 0% -100% and is used for representing the brightness of each color. Compared with the RGB space, the HSV space can express the brightness, the tone and the vividness of colors very visually, and the contrast between colors is convenient to carry out. Therefore, by using the hsv color space, the different color regions have larger difference, and the difference of the different color regions can be distinguished more easily. Clustering the color information of the images by using a kmeans machine learning clustering method to obtain average h, s and v values of the gesture area; clustering image data of the HSV space into 3 types, and obtaining a background white type, a gesture area type and other areas in an ideal state; the specific detailed implementation steps are as follows:
kmeans acquires pixel characteristics of a human hand region:
1. and cutting and scaling the original color picture to reduce the pixel quantity of the picture and reduce the calculated value. From the practical situation, the gestures are generally concentrated near the center area of the picture, and in a few cases, the gestures are slightly deviated. Therefore, images with the width of 10 pixels are cut off in the upper, lower, left and right directions of the picture respectively, then the length and the width of the images are reduced by 40 times in an equal proportion and are scaled to be 72 × 96, so that the total size of the picture is reduced to 6912 pixels, and the calculation amount of pixel clustering and a neural network is greatly reduced.
2. The blurring operation is performed on the picture twice, so that the color change of different areas of the picture is smoother, the influence of partial bright points or dark points in the picture can be effectively reduced, and the obtained result is shown in fig. 5-3.
3. And converting the image pixel value from the rgb value to the hsv value by using a conversion rule from the rgb color space to the hsv color space, and then performing clustering analysis on the processed image pixels by using a kmeans clustering method. Kmeans is an unsupervised clustering machine learning algorithm, and the main objective is to automatically classify similar samples into one category. In the step, the image data of the HSV space is tried to be clustered into 3 types, and a background white type, a gesture area type and other areas types are obtained under an ideal state. In the default situation, the kmeans algorithm clusters 3 random initial centroids, and aiming at the characteristic of segmenting the color image by the initial gesture, the invention adopts 3 customized initial centroids, namely white (hsv: 0 degrees, 0 percent and 100 percent), black (hsv: 0 degrees, 0 percent and 0 percent) and yellow skin average color (hsv: 60 degrees, 90 percent and 60 percent). And 3 customized initial centroids are adopted, the centroids of 3 pixel classes of the final clustering result are closer to the initial centroids, and the obtained clustering result is more accurate. Through the kmeans clustering analysis, the picture pixels are divided into 3 classes, and the class with the initial centroid being the average color of the yellow skin is all the hand region pixels finally divided. The centroid value represents a color characteristic of the region of the human hand.
And step 2004, accurately segmenting the color image gesture area by using a threshold value method. The color characteristics of the region of the human hand, i.e., the centroid of the skin pixel class, are obtained from step 2003. And (3) obtaining the class centroid, and segmenting by using a threshold value method, namely using the pixel which is within 15 degrees, 10 percent and 10 percent of the hsv value of the centroid as the pixel of the hand region, and setting other pixels as white, wherein the threshold value is obtained through a plurality of tests, and the threshold value has a good segmentation effect. The resulting segmentation results are shown in FIGS. 5-4. And then, filtering the picture by adopting a minimum filter to eliminate part of independent noise points. The final segmentation result is obtained, as shown in fig. 5-5, the white area is the background area, and the middle area is the gesture area.
Step 2005, depth image precision gesture region segmentation. The result of the accurate gesture area segmentation of the color image is obtained in step 2004, and the result of the accurate gesture segmentation of the depth image is obtained by intercepting the area corresponding to the depth image and setting the other parts as white backgrounds, as shown in fig. 5-6. Then, the gray value of the picture is turned over, so that the neural network can calculate conveniently, and the obtained result is shown in fig. 5-7.
And step 2006, performing transformation expansion on the final depth gesture image. Due to the shortage of the image acquisition quantity, the neural network training set is too small, and large recognition errors are easily caused, so that the image quantity can be increased and the training set can be enlarged by carrying out scaling, filtering, overturning and other processing on the image. As shown in fig. 10, the depth gesture image and the expanded image are listed. FIG. 10-1 is an original depth gesture image, FIG. 10-2 is a 105% scaled image, FIG. 10-3 is a 110% scaled image, FIG. 10-4 is a 115% scaled image, FIG. 10-5 is a boundary enhancement filtered image, FIG. 10-6 is a maximum filter filtered image, FIG. 10-7 is a minimum filter filtered image, FIG. 10-8 is an image rotated 90 degrees counterclockwise, FIG. 10-9 is an image rotated 180 degrees counterclockwise, FIG. 10-10 is an image rotated 270 degrees counterclockwise, FIG. 10-11 is a sharpening filtered image, and FIG. 10-12 is a smooth filtered image.
And the gesture area depth map obtained by using the image preprocessing flow is used as the input of a neural network training module, so that the network is trained. As shown in fig. 9, the present invention collects 9 gestures representing numbers 1 to 9 for 10 testers, and the total number is about 2500; after the images are subjected to an image preprocessing flow, 30000 gesture depth images are obtained through image expansion and serve as a training set for training of a neural network system.
As shown in fig. 7, the neural network structure designed for the present invention. The neural network system uses a 4-layer network structure, which includes 2 convolutional layers (convolution) and maximum pooling (maxporoling), 1 full-link layer, and 1 softmax output layer in total:
the first layer is a convolutional neural network layer, with 16 convolution kernels of 5 x 5, and 1 maximum subsampling operation of 2 x 2. Each input is a gray value matrix with the size being 1 and 72 × 96, the stride parameter of convolution is 1, and the padding parameter selection is the same as that of the original image; the pooling operation took 2 x 2 max power and the output was 16 matrices of 36 x 48.
The second layer is a convolutional neural network layer, consisting of 32 convolution kernels of 5 x 5, and 1 maximum subsampled convolution kernel of 2 x 2. The input is the output of the upper network structure, 16 matrixes of 36 x 48 are input, the stride parameter of convolution is 1, and the padding parameter selection is the same as the original image; the pooling operation took 2 x 2 max power and the output was 32 matrices of 18 x 24.
The third layer is a full connection layer, and the 32 output matrixes of 18 × 24 are fully connected to 512 output neurons;
and the fourth layer is a softmax layer, 512 input neurons are output to 9 output neurons, 9 numbers representing 1-9 are output, and the item with the largest output is taken as the recognition result.
The training method adopts a cross entropy evaluation model, and the cross entropy is used for describing the inefficiency of the true phase and measuring the prediction result. The network is trained using an adaptive moment estimation method (adam). Adam is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process, and can iteratively update neural network weights based on training data. The specific main implementation codes of the neural network are as follows:
Figure GDA0003320810430000101
Figure GDA0003320810430000111
Figure GDA0003320810430000121
after the code runs, after 2000 times of training, the recognition accuracy of the neural network model reaches 92.53%, the output of the training process is as follows, step is the training time, train _ accuracy is the recognition accuracy of the training set, and test accuracracacay is the recognition accuracy of the test:
step 0,train_accuracy 0.26
test accuracy 0.113806
step 100,train_accuracy 0.42
test accuracy 0.402985
step 200,train_accuracy 0.42
test accuracy 0.518657
step 300,train_accuracy 0.82
test accuracy 0.636194
step 400,train_accuracy 0.68
test accuracy 0.695895
step 500,train_accuracy 0.76
test accuracy 0.729478
step 600,train_accuracy 0.76
test accuracy 0.744403
step 700,train_accuracy 0.82
test accuracy 0.785448
step 800,train_accuracy 0.72
test accuracy 0.800373
step 900,train_accuracy 0.9
test accuracy 0.820895
step 1000,train_accuracy 0.96
test accuracy 0.845149
step 1100,train_accuracy 0.96
test accuracy 0.867537
step 1200,train_accuracy 0.98
test accuracy 0.873134
step 1300,train_accuracy 0.94
test accuracy 0.876866
step 1400,train_accuracy 0.98
test accuracy 0.882463
step 1500,train_accuracy 0.94
test accuracy 0.882463
step 1600,train_accuracy 0.96
test accuracy 0.897388
step 1700,train_accuracy 0.96
test accuracy 0.902985
step 1800,train_accuracy 0.96
test accuracy 0.893657
step 1900,train_accuracy 0.96
test accuracy 0.916045
step 2000,train_accuracy 0.98
test accuracy 0.925373
the gesture detection and recognition module 400 inputs the gesture image after being processed by the preprocessing module into the trained neural network, and returns a gesture recognition result. The module can be combined with practical application, for example, the module is integrated in a camera, a user takes a picture by using the camera, and a program acquires and preprocesses a taken picture, inputs the result into a neural network, obtains a neural network prediction result and feeds the result back to a user recognition result.
At present, the application of the 3D field by utilizing the double-shot mobile phone is very rare, and the invention creatively applies the double-shot mobile phone to other fields except the shooting field. The method is only applied to the mobile phone with the android platform, the android platform does not open an interface for acquiring data of the double cameras at present, the data of one camera cannot be acquired independently, and development interfaces of different mobile phone manufacturers need to be utilized. The invention realizes a double-shot data acquisition scheme by using the Huashi mobile phone series open interface.
Because the difference between the gesture image region and the background image is obvious in the depth image, namely the distance difference between the gesture and other parts and the camera is obvious, the sampling threshold segmentation method can obtain a good gesture image region intercepting effect.
Predefining numbers 1 to 9 to represent 9 gestures, and acquiring about 2500 sample images by 10 different testers in different background environments;
then, in an image preprocessing stage, extracting a gesture area in an image, extracting a depth image, and obtaining about 30000 gesture depth images through image expansion to serve as a training set of a neural network system;
the method comprises the steps of utilizing an open source neural network framework (transducer flow) to realize a deep convolutional neural network, and utilizing a training sample set to train to obtain a neural network system with the identification accuracy rate of more than 92%;
the application stage is identified by utilizing the deep neural network trained in the previous stage to be applied to specific practice. And after the user shoots gesture images through the two cameras, inputting the gesture images into the system. The system carries out the same preprocessing stage on the image, and then inputs the image into a neural network system for recognition to obtain a recognition result.

Claims (1)

1. A gesture recognition system based on a double-camera mobile phone and an artificial intelligence system realizes the recognition of human body gestures by using the double-camera mobile phone and machine learning, and is characterized by comprising an image acquisition module (100), an image preprocessing module (200), a neural network training module (300) and a gesture recognition module (400); wherein:
the image acquisition module is used for acquiring and storing two different original images generated by different camera visual angles, including color images of a left camera and a right camera and images containing depth information;
the image preprocessing module is used for intercepting a gesture area from an original image and acquiring a depth image of the gesture area;
the neural network training module is used for training the acquired depth image by using a depth neural network to obtain a neural network system;
the gesture detection and recognition module is used for returning a gesture recognition result according to gesture image input information needing to be recognized;
acquiring JPG image data of two cameras by using the image acquisition module (100); the JPG image comprises 3 parts, namely a color image shot by a left camera, a color image shot by a right camera and a depth image obtained by preprocessing; then, JPG image segmentation processing is carried out, namely, the following are specified according to a JPG file format: extracting corresponding storage segments of the left camera image and the right camera image for storage respectively by taking a jpg file header as 0xFFD8 and an SOA format segment as 0 xFFDA; then starting the depth image segment with 0x0065646f6600, and storing the depth image segment after extraction; the ascii character of the hexadecimal string is denoted as an edof flag;
acquiring an image with depth information from an original image by using the image preprocessing module (200), and intercepting a gesture area in the depth image by using a threshold segmentation method; intercepting a corresponding gesture area from the color image as a primary gesture segmentation result; converting the color image from an RGB space to an HSV space, clustering the color information of the image by using a kmeans machine learning clustering method, and clustering the image data of the HSV space into 3 types, namely obtaining a background white type, a gesture area type and other areas; after the classified pixels of the gesture area are obtained, the pixel mean value and the variance are obtained, and the corresponding accurate gesture area in the color image is intercepted according to the mean value and the variance by utilizing a threshold segmentation method; cutting out a depth image gesture area by utilizing the color image accurate gesture area to obtain a final depth gesture image; performing transformation expansion on the final depth gesture image, and enhancing a training data set to reach more than about 30000 depth images;
performing neural network training on the gesture region depth map obtained by the image preprocessing module by using the neural network training module (300), wherein the neural network is composed of 4 layers, the first layer is a convolution neural network layer and comprises 16 convolution kernels with 5 × 5 and 1 convolution kernel with 2 × 2 maximum value subsampling, and 16 characteristics with 36 × 48 are output by the input gray scale map with the size of 72 × 96; the second layer is a convolution neural network layer, and the input 32 gray-scale graphs with the size of 36 × 48 are output with 64 features of 18 × 24 by 32 convolution kernels with 5 × 5 and 1 maximum subsampled convolution kernel with 2 × 2; the third layer is a full connection layer, and the 64 output characteristic maps of 18 × 24 are fully connected to 512 output neurons; the fourth layer is a softmax layer, 512 input neurons are output to 9 output neurons, 9 numbers representing 1-9 are output, and the maximum output item is taken as the recognition result;
and utilizing the gesture detection and recognition module (400), preprocessing the gesture depth map by the image preprocessing module, and inputting the gesture depth map into a neural network layer to obtain a prediction result.
CN201810402470.9A 2018-04-28 2018-04-28 Gesture recognition system based on double-camera mobile phone and artificial intelligence system Expired - Fee Related CN108717524B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810402470.9A CN108717524B (en) 2018-04-28 2018-04-28 Gesture recognition system based on double-camera mobile phone and artificial intelligence system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810402470.9A CN108717524B (en) 2018-04-28 2018-04-28 Gesture recognition system based on double-camera mobile phone and artificial intelligence system

Publications (2)

Publication Number Publication Date
CN108717524A CN108717524A (en) 2018-10-30
CN108717524B true CN108717524B (en) 2022-05-06

Family

ID=63899399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810402470.9A Expired - Fee Related CN108717524B (en) 2018-04-28 2018-04-28 Gesture recognition system based on double-camera mobile phone and artificial intelligence system

Country Status (1)

Country Link
CN (1) CN108717524B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767488A (en) * 2019-01-23 2019-05-17 广东康云科技有限公司 Three-dimensional modeling method and system based on artificial intelligence
CN109948483B (en) * 2019-03-07 2022-03-15 武汉大学 Character interaction relation recognition method based on actions and facial expressions
CN110322546A (en) * 2019-05-14 2019-10-11 广东康云科技有限公司 Substation's three-dimensional digital modeling method, system, device and storage medium
CN110322545A (en) * 2019-05-14 2019-10-11 广东康云科技有限公司 Campus three-dimensional digital modeling method, system, device and storage medium
CN110322544A (en) * 2019-05-14 2019-10-11 广东康云科技有限公司 A kind of visualization of 3 d scanning modeling method, system, equipment and storage medium
CN110141232B (en) * 2019-06-11 2020-10-27 中国科学技术大学 Data enhancement method for robust electromyographic signal identification
CN110348323B (en) * 2019-06-19 2022-12-16 广东工业大学 Wearable device gesture recognition method based on neural network optimization
CN111079530A (en) * 2019-11-12 2020-04-28 青岛大学 Mature strawberry identification method
CN111429156A (en) * 2020-03-26 2020-07-17 北京九歌创艺文化艺术有限公司 Artificial intelligence recognition system for mobile phone and application thereof
CN113553877B (en) * 2020-04-07 2023-05-30 舜宇光学(浙江)研究院有限公司 Depth gesture recognition method and system and electronic equipment thereof
CN115147672A (en) * 2021-03-31 2022-10-04 广东高云半导体科技股份有限公司 Artificial intelligence system and method for identifying object types
CN113408443B (en) * 2021-06-24 2022-07-05 齐鲁工业大学 Gesture posture prediction method and system based on multi-view images

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710418A (en) * 2009-12-22 2010-05-19 上海大学 Interactive mode image partitioning method based on geodesic distance
CN104050682A (en) * 2014-07-09 2014-09-17 武汉科技大学 Image segmentation method fusing color and depth information
CN105825494A (en) * 2015-08-31 2016-08-03 维沃移动通信有限公司 Image processing method and mobile terminal
CN107300976A (en) * 2017-08-11 2017-10-27 五邑大学 A kind of gesture identification household audio and video system and its operation method
CN107563333A (en) * 2017-09-05 2018-01-09 广州大学 A kind of binocular vision gesture identification method and device based on ranging auxiliary
CN107622257A (en) * 2017-10-13 2018-01-23 深圳市未来媒体技术研究院 A kind of neural network training method and three-dimension gesture Attitude estimation method
CN107766842A (en) * 2017-11-10 2018-03-06 济南大学 A kind of gesture identification method and its application

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710418A (en) * 2009-12-22 2010-05-19 上海大学 Interactive mode image partitioning method based on geodesic distance
CN104050682A (en) * 2014-07-09 2014-09-17 武汉科技大学 Image segmentation method fusing color and depth information
CN105825494A (en) * 2015-08-31 2016-08-03 维沃移动通信有限公司 Image processing method and mobile terminal
CN107300976A (en) * 2017-08-11 2017-10-27 五邑大学 A kind of gesture identification household audio and video system and its operation method
CN107563333A (en) * 2017-09-05 2018-01-09 广州大学 A kind of binocular vision gesture identification method and device based on ranging auxiliary
CN107622257A (en) * 2017-10-13 2018-01-23 深圳市未来媒体技术研究院 A kind of neural network training method and three-dimension gesture Attitude estimation method
CN107766842A (en) * 2017-11-10 2018-03-06 济南大学 A kind of gesture identification method and its application

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Probabilistic Combination of CNN and RNN Estimates for Hand Gesture Based Interaction in Car;Aditya Tewari et al.;《2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct)》;20171030;全文 *
基于Kinect传感器的动态手势识别;余旭;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140915;全文 *
结合深度信息的图像分割算法研究;皮志明;《中国博士学位论文全文数据库 信息科技辑》;20131015;全文 *

Also Published As

Publication number Publication date
CN108717524A (en) 2018-10-30

Similar Documents

Publication Publication Date Title
CN108717524B (en) Gesture recognition system based on double-camera mobile phone and artificial intelligence system
KR102102161B1 (en) Method, apparatus and computer program for extracting representative feature of object in image
WO2019128507A1 (en) Image processing method and apparatus, storage medium and electronic device
CN106056064B (en) A kind of face identification method and face identification device
CN109284738B (en) Irregular face correction method and system
CN110929569B (en) Face recognition method, device, equipment and storage medium
CA3100642A1 (en) Multi-sample whole slide image processing in digital pathology via multi-resolution registration and machine learning
WO2019080203A1 (en) Gesture recognition method and system for robot, and robot
CN112037320B (en) Image processing method, device, equipment and computer readable storage medium
CN111989689A (en) Method for identifying objects within an image and mobile device for performing the method
CN110929593A (en) Real-time significance pedestrian detection method based on detail distinguishing and distinguishing
CN109190456B (en) Multi-feature fusion overlook pedestrian detection method based on aggregated channel features and gray level co-occurrence matrix
CN110032932B (en) Human body posture identification method based on video processing and decision tree set threshold
CN111967319B (en) Living body detection method, device, equipment and storage medium based on infrared and visible light
CN110674759A (en) Monocular face in-vivo detection method, device and equipment based on depth map
CN110046544A (en) Digital gesture identification method based on convolutional neural networks
CN113011253B (en) Facial expression recognition method, device, equipment and storage medium based on ResNeXt network
CN109816694A (en) Method for tracking target, device and electronic equipment
CN110991349A (en) Lightweight vehicle attribute identification method based on metric learning
CN112633221A (en) Face direction detection method and related device
CN111209873A (en) High-precision face key point positioning method and system based on deep learning
CN112418032A (en) Human behavior recognition method and device, electronic equipment and storage medium
CN117252926B (en) Mobile phone shell auxiliary material intelligent assembly control system based on visual positioning
CN116630828B (en) Unmanned aerial vehicle remote sensing information acquisition system and method based on terrain environment adaptation
CN113128428A (en) Depth map prediction-based in vivo detection method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220506