CN108038455A

CN108038455A - Bionic machine peacock image-recognizing method based on deep learning

Info

Publication number: CN108038455A
Application number: CN201711374581.5A
Authority: CN
Inventors: 李成荣; 胡耀聪; 周世久; 徐玉龙; 李名扬
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2017-12-19
Filing date: 2017-12-19
Publication date: 2018-05-15

Abstract

The invention discloses the image-recognizing method of the bionic machine peacock based on deep learning, comprise the following steps：Image data set of the Face datection database disclosed in collection as training and verification；The deep learning framework based on convolutional neural networks is designed, Face datection function is realized in deep learning framework；Image scene taken by collection bionic machine peacock camera is finely adjusted trained convolutional Neural net net, realizes the Face datection function under indoor complex environment；Empirical parameter is obtained to determine that the dressing of spectators positions, and counts the corresponding proportion shared by a variety of colors；The present invention realizes the Face datection and colour recognition of amusement bio-robot precise and high efficiency under complex environment, and robustness is good；And small parameter perturbations are carried out to trained deep learning framework to image scene；Finally realize that the image scene captured for camera carries out real-time face detection and dressing identification；Can be applied to science and technology center, hotel, market, for tourist visit, amusement.

Description

Bionic machine peacock image recognition method based on deep learning

Technical Field

The invention relates to the technical field of computer identification, in particular to a bionic machine peacock image identification method based on deep learning.

Background

The bionic robot is a combined product of bionics and application requirements in the robot field. From the perspective of a robot, a bionic robot is a high-level stage of robot development, and biological characteristics provide many beneficial references for the design of the robot. The existing bionic robots are various in types, such as bionic robot fish and bionic robot dog. And is widely used in military and industrial fields, but is currently less applicable as an entertainment robot. The visual system is an important component of the bionic robot, and is equivalent to the 'eyes' of the bionic robot. The system generally captures surrounding environment information through a high-definition camera arranged on a robot, and analyzes and processes the captured image through the existing computer vision algorithm to realize the motion control of a robot actuating mechanism, such as underwater fishing of robot fish, gait planning of robot dogs and the like. At present, a vision system of a bionic robot depends on a traditional image processing algorithm, generally, image features of a current environment are firstly extracted, then the extracted features are described, and finally the current environment is identified, detected and other related tasks are carried out through feature description. Although a specific image feature has a good effect in the field of the bionic robot, the traditional image feature depends on manual experience, and in a complex environment, the feature extraction complexity is high and the robustness is low.

In view of the above, it is urgently needed to provide a bionic machine peacock visual identification method based on deep learning, which has good robustness and high real-time performance, is applied to places such as science and technology museums, superstores, hotels and the like, and teaches through lively activities.

Disclosure of Invention

In order to solve the technical problem, the technical scheme adopted by the invention is to provide an image recognition method of a bionic machine peacock based on deep learning, which comprises the following steps:

s1, collecting a public face detection database as a training and verification image data set;

s2, designing a deep learning framework based on a convolutional neural network, and realizing a face detection function in the deep learning framework;

s3, collecting the field image shot by the peacock camera of the bionic machine to finely adjust the trained convolutional neural network;

s4, testing according to the fine-tuned deep learning framework in the step S3, and achieving a face detection function in an indoor complex environment;

and S5, obtaining empirical parameters to determine the dressing position of the audience according to the position relation between the face frame position obtained in the step S4 and the position relation between the camera and the audience, and counting the corresponding proportion occupied by each color.

In the above method, the step S1 specifically includes the following steps:

s11, selecting a public wire _ face data set and a Celeba _ face data set as training samples for face detection by a face detection database; normalizing the original image in the image data set to a uniform size;

a large amount of face detection data are provided in the widget _ face data set and the Celeba _ face data set, and position information of a face frame is provided in a picture;

s12, randomly selecting a frame for the face image, and calculating the repetition degree IOU of the selected frame and the real frame;

s13, dividing the face detection data into three types, namely a face positive sample, a face negative sample and a face partial sample, wherein the proportion of the face positive sample, the face negative sample and the face partial sample is 1: 3: 1;

and S14, generating a file path from the training sample, and making labels corresponding to the file path, wherein the label content comprises positive and negative sample labels and offset values of a randomly generated frame and a real frame.

In the above method, in step S12, IOU > 0.65 is a positive face sample, and IOU < 0.4 is a negative face sample; the face part samples are between 0.4 IOU 0.65.

In the above method, the deep learning architecture is specifically as follows:

the deep learning framework is formed by cascading three convolutional neural networks, wherein the three convolutional neural networks are a first convolutional neural network PNet, a second convolutional neural network RNet and a third convolutional neural network ONet respectively;

the first convolution neural network PNet inputs image blocks with the size of 12 × 3 in the training stage, and the first convolution neural network PNet is a three-layer full convolution network and does not comprise a full connection layer;

the second convolutional neural network RNet inputs image blocks with the size of 24 × 3, and is a four-layer convolutional network comprising three convolutional layers and a full-connection layer;

image blocks of input size 48 x 3 of the third convolutional neural network ONet; the four-layer convolution network comprises three convolution layers and a full connection layer;

firstly, a training sample is sent to a first convolutional neural network PNet, and the output of the training sample is the type of the sample, namely a human face positive sample, a human face negative sample and a human face positioning predicted value;

the second convolutional neural network RNet is used to refine the output result of the PNet network;

the third convolutional neural network ONet is used for refining the output result of the RNet network and outputting the sample category, the predicted value of the face positioning and the predicted value of the face key point.

In the above method, the step S3 specifically includes the following steps:

s31, selecting a plurality of images including audiences and not including audiences collected on the scene of the bionic machine peacock to be divided into training samples and testing samples; selecting images in the training samples and taking random frames to obtain face positive samples, face negative samples and face part samples;

and S32, sending the determined training samples and labels to a deep learning framework, and finely adjusting the parameters of three convolutional neural networks in the deep learning framework.

In the above method, the step S4 specifically includes the following steps:

s41, sending the test sample into a first convolutional neural network PNet, and acquiring a candidate face frame and four point coordinates corresponding to each face;

s42, sending the candidate face frame obtained and output by the first convolutional neural network PNet into the second convolutional neural network RNet, carrying out thinning screening on the candidate face frame, carrying out non-maximum value suppression, and removing false detection;

and S43, screening the second convolutional neural network RNet and outputting a result to a third convolutional neural network ONet for further carrying out refinement screening on the candidate face frame, removing false detection and finally obtaining a face detection result, wherein the face detection result comprises the face frame positioning and the position relation between the camera and the audience.

In the above method, the step S5 specifically includes the following steps:

s51, according to multiple times of measurement and calibration of the camera, selecting the deviation of the dressing candidate frame with the positioning of 1.8 personal face frames in the image, wherein the size of the dressing candidate frame is 2 personal face frames;

s52, setting colors participating in the statistics of the proportion to be red, yellow, blue, green, purple, white, black and gray, and counting the corresponding proportion occupied by each color according to the selected dressing candidate frame.

The invention provides an image recognition method of a bionic robot peacock based on deep learning, aiming at realizing accurate and efficient face detection and color recognition of an entertainment bionic robot in a complex environment and having good robustness; performing parameter fine adjustment on the trained deep learning architecture by using the field image acquired by the bionic machine peacock; finally, real-time face detection and dressing identification are carried out on the field image captured by the camera; can be applied to science and technology museums, hotels, shopping malls, visitors and entertainment.

Drawings

FIG. 1 is a flow chart provided by the present invention;

FIG. 2 is a schematic structural diagram of a deep learning architecture according to the present invention; wherein,

(a) a first convolutional neural network PNet; (b) is a second convolutional neural network RNet; (c) is a third convolutional neural network ONet;

FIG. 3 is a block diagram of an image recognition process of a bionic machine peacock based on deep learning provided by the invention;

fig. 4 is an exemplary diagram of an image acquired by the camera and a statistical color proportion.

Detailed Description

The invention is described in detail below with reference to specific embodiments and the accompanying drawings.

As shown in fig. 1-3, the invention provides an image recognition method of a bionic machine peacock based on deep learning, which comprises the following steps:

s1, collecting the public face detection database as the training and verification image data set. Wherein, step S1 specifically includes the following steps:

s11, selecting a public wire _ face data set and a Celeba _ face data set as training samples for face detection by a face detection database. The widget _ face data set and the Celeba _ face data set provide a large amount of face detection data, position information of a face frame is provided in a figure, and more than 20 ten thousand face detection data can be obtained in total.

S12, randomly selecting a frame for the face image, and calculating the repetition degree IOU (intersection ratio) of the selected frame and the real frame. Defining the face positive sample with IOU more than 0.65 and the face negative sample with IOU less than 0.4; the face part samples are between 0.4 IOU 0.65.

S13, dividing the face detection data into three types, namely a face positive sample, a face negative sample and a face partial sample, wherein the proportion of the face positive sample, the face negative sample and the face partial sample is 1: 3: 1.

and S14, generating a file path from the training sample, and making labels corresponding to the file path, wherein the label content comprises positive and negative sample labels and offset values of a randomly generated frame and a real frame. The raw images in the image dataset are normalized to a uniform size, such as 800 x 600 x 3. And labels each image having a face. For example, 0001.png 204301264390 indicates that the image 0001.png contains a human face in a rectangular pixel region (204, 301, 264, 390), all labels are written in a txt file, and the image paths and the labels are in one-to-one correspondence.

S2, designing a deep learning framework based on the convolutional neural network, and realizing the face detection function in the deep learning framework. The deep learning architecture is specifically as follows:

as shown in fig. 2, the deep learning architecture is composed of three convolutional neural networks in cascade, the three convolutional neural networks are named as a first convolutional neural network PNet, a second convolutional neural network RNet and a third convolutional neural network ONet respectively; the first convolutional neural network PNet inputs image blocks of size 12 × 3 in the training phase, the second convolutional neural network RNet inputs image blocks of size 24 × 3, and the third convolutional neural network ONet inputs image blocks of size 48 × 3.

Firstly, a training sample is sent to a first convolution neural network PNet, the first convolution neural network PNet is a three-layer full convolution network and does not comprise a full connection layer, and the output of the first convolution neural network PNet is respectively the type (positive and negative samples) of the sample and the predicted value of face positioning.

The second convolutional neural network RNet is a four-layer convolutional network comprising three convolutional layers and a full-link layer. This network can be used to refine the output results of the PNet network.

The third convolutional neural network ONet is a four-layer convolutional network comprising three convolutional layers and a full-link layer. The network can be used to refine the output results of the RNet network and output predicted values of the classes of the samples (positive and negative samples), the face location and the face key points respectively.

S3, acquiring the field image shot by the peacock camera of the bionic machine, and finely adjusting the trained deep learning architecture. The captured live images include scenes with and without viewers. Wherein, step S3 specifically includes the following steps:

and S31, selecting 10 million images including audiences and not including audiences collected by the bionic machine peacock scene to be divided into training samples and testing samples. Selecting images in the training sample and taking a random frame to obtain a face positive sample, a face negative sample and a face partial sample, specifically including steps S12-S14.

And S4, testing according to the fine-tuned deep learning framework in the step S3, and realizing the face detection function in the indoor complex environment. Wherein, step S4 specifically includes the following steps:

s41, the test sample is firstly sent to a first convolutional neural network PNet, and a candidate face frame and four point coordinates corresponding to each face are obtained.

And S42, sending the candidate face frame acquired and output by the first convolutional neural network PNet into the second convolutional neural network RNet, carrying out thinning screening on the candidate face frame, carrying out non-maximum value suppression, and removing false detection.

And S43, screening the second convolutional neural network RNet, outputting the result to a third convolutional neural network ONet, wherein the operation of the second convolutional neural network RNet is the same as that of the second convolutional neural network RNet, and the second convolutional neural network RNet is used for further carrying out thinning screening on the candidate face frame, removing false detection and finally obtaining a face detection result which comprises the face frame positioning and the position relation (the face key point position) of the camera to the audience.

The number of layers of the three convolution network cascades is gradually deepened, the size of a convolution kernel is gradually increased, and the detection of the human face is also gradually refined. The first convolutional neural network PNet can obtain a rough face candidate frame, and the second convolutional neural network RNet and the third convolutional neural network ONet gradually screen the candidate face frame obtained by the first convolutional neural network PNet and perform non-maximum suppression to obtain a final face detection result.

The deep learning architecture adopts a multi-task learning strategy to optimize the network and consists of three paths of supervised learning signals: A. and learning whether the input image block is a human face or not by using the cross loss entropy of the softmax classifier. B. The euclidean regression loss of the rectangular box learns the location of the rectangular box. C. Regression loss learns the locations of five face key points.

The construction of the deep learning architecture specifically comprises the following steps:

the whole network is trained by using CAFFE open source codes, an Intel core I7 processor is configured by using a computer, and a graphics card is configured as a NIVIDA GTX TITAN 1080TI GPU. The operating system is Linux 16.04.

The learning parameters of the network are: the iteration times are 100000 times, the sample number sent in each iteration is 100, and the initial learning rate: 0.01, impulse term: 0.9, counter-propagating mode: random gradient descent method.

And S5, obtaining empirical parameters to determine the dressing position of the audience according to the position relation between the face frame position obtained in the step S4 and the position relation between the camera and the audience, and counting the corresponding proportion occupied by each color. Wherein, step S5 specifically includes the following steps:

s51, according to multiple measurements and calibrations by the camera, selecting a deviation of the dressing candidate frame with a location of 1.8 individual face frames in the image, and a size of the dressing candidate frame is 2 individual face frames, as shown in fig. 4, which is an exemplary diagram of the image collected by the camera and the statistical color proportion provided by the present invention. The distance between the installation of the camera of the bionic machine peacock and the field audience is set to be 2.5 meters.

The bionic machine peacock provided by the invention makes decisions on whether to spread the screen or not and the size of the spread screen according to the captured visual information (the number of audience on site and the specific weight of dressing color) so as to realize entertainment.

The construction of the whole bionic machine peacock image recognition system is as follows:

the method is characterized in that a Logitech Pro C920 camera is mounted at the head of the bionic machinery peacock and used as an 'eye' of the bionic machinery peacock to capture a scene environment picture, wherein the distance between the 'eye' and an audience is controlled to be about 2.5 meters.

The MFC interface program is written using a Visual Stdio 2013 development environment. And putting the trained model file and the network script file in the CAFFE into the whole project to realize the functions of face detection and dressing identification of the bionic machine peacock.

The codes with the image recognition function are packaged into an SDK library form, so that other function modules of the bionic machine peacock can be called mutually conveniently, such as a motion execution module, a voice conversation module, a face recognition module, a color screening module and the like, the whole function of the bionic machine peacock is tested, and the bionic machine peacock is put into a science and technology museum, a hotel or a market to play the entertainment of the bionic machine peacock.

The invention provides an image recognition method of a bionic robot peacock based on deep learning, and aims to realize accurate and efficient face detection and color recognition of an entertainment bionic robot in a complex environment and to achieve good robustness. First, a face detection library is used to make pre-training samples of the convolutional neural network. Then, designing a structure of the convolutional neural network, wherein the structure comprises the number of layers of a network convolutional layer, a pooling layer and a full-connection layer, parameters of each layer, such as the number and the size of convolutional cores and the number of neurons output by the full-connection layer, and compiling a network structure and a trained script program on a deep learning architecture CAFFE. And (3) carrying out parameter fine adjustment on the trained deep learning architecture by using the live images (including audiences and non-audiences) acquired by the bionic machine peacock. And finally, writing a test program to realize real-time face detection and dressing identification on the field image captured by the camera.

The present invention is not limited to the above-mentioned preferred embodiments, and any structural changes made under the teaching of the present invention shall fall within the protection scope of the present invention, which has the same or similar technical solutions as the present invention.

Claims

1. The image recognition method of the bionic machine peacock based on deep learning is characterized by comprising the following steps:

2. The image recognition method according to claim 1, wherein the step S1 specifically includes the steps of:

3. The image recognition method according to claim 2, wherein in step S12, IOU > 0.65 is a positive face sample, and IOU < 0.4 is a negative face sample; the face part samples are between 0.4 IOU 0.65.

4. The image recognition method of claim 1, wherein the deep learning architecture is specifically as follows:

5. The image recognition method according to claim 4, wherein the step S3 specifically includes the steps of:

6. The image recognition method according to claim 5, wherein the step S4 specifically includes the steps of:

7. The image recognition method according to claim 1, wherein the step S5 specifically includes the steps of: