CN111275082A

CN111275082A - Indoor object target detection method based on improved end-to-end neural network

Info

Publication number: CN111275082A
Application number: CN202010039334.5A
Authority: CN
Inventors: 陈略峰; 吴敏; 曹卫华; 张平平
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2020-01-14
Filing date: 2020-01-14
Publication date: 2020-06-12

Abstract

The invention discloses an indoor object target detection method based on an improved end-to-end neural network, which comprises the steps of marking each target in a training set by utilizing a marking frame, and obtaining the category and position information of each target in the training set; initializing a convolutional neural network, and preprocessing a training set; segmenting the preprocessed training set image into M multiplied by N grids; selecting an initial candidate frame by using a grid; detecting a target for each grid to obtain a category confidence of a target category; setting the output of the convolutional neural network according to the class confidence coefficient to obtain a final prediction frame; training a convolutional neural network to obtain a trained convolutional neural network; and testing the image of the target to be detected by using the trained convolutional neural network so as to determine the category and the location of the target object. The invention provides a feature extraction mode of first pooling and then convolution in the neural network, reduces the loss of feature information and simultaneously realizes rapid indoor target detection.

Description

Indoor object target detection method based on improved end-to-end neural network

Technical Field

The invention relates to the field of image recognition, in particular to an indoor object target detection method based on an improved end-to-end neural network.

Background

The intelligent robot needs to operate in a complex environment with real-time change of environment, climate, weather, illumination and scenery, and external factors such as pedestrians and obstacles with different postures and uncertain actions may exist in the operation process. These factors bring great challenges to the robot, and therefore have great significance and difficulty in the research of the environment perception algorithm of the intelligent mobile robot. The indoor space is a common working scene of the intelligent emotional robot. Compared with an outdoor environment, the indoor environment is often more complicated, so that the robot is more difficult to understand the environment. In addition, the individual demands of people in modern society for articles make objects have various and different shapes, which is one of the challenges of environmental understanding. The description of the objects in the environment and the relation between the objects and the surrounding objects are established, and the method has important significance for task execution of the emotional robot. For example, navigation of robots requires recognition and positioning of objects, interaction of human faces and gestures requires perception of the surrounding environment (including objects and people), and recognition and tracking of interacting people. The establishment of environment perception is an important step of the robot and a cognitive environment, and information support is provided for subsequent diversified operations of the robot. Scene objects typically include people, tables, chairs, and the like. The difficulty of detection increases significantly when they occur in the same scene, especially in complex indoor environments. Therefore, accurate detection of objects in a complex indoor environment is one of the difficulties of environmental sensing technology.

The target detection of the indoor object comprises three parts of extraction of a candidate frame, detection of a target to be detected and detection, identification and positioning of the object target. In particular, the object target detection technology has been developed after decades of research, and has made great progress in both detection accuracy and speed. The mainstream detection mainly includes a Deformable Part (DPM), a Deep Network (DN), and a Decision tree (DF). The traditional detection method is based on a manually designed feature extractor, and the purpose of object detection is achieved by extracting training classifiers such as Haar features, Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP) and the like. But the detection features of the manual design are difficult to adapt to the large changes of the dynamic object. The depth network can learn features from image pixels, improving object detector performance. The deep network is also deeply applied in the field of pedestrian detection, and with the construction of a large-scale training data set and the continuous enhancement of hardware computing capacity, the deep network structure has great success in different visual tasks. In the aspect of target detection, the method mainly comprises a first-stage RCNN (Region-CNN), a Fast-RCNN and a Fast RCNN series and a second-stage detection YOLO (you Only Look one), SSD (Single Shot Multi Box Detector) and CORNER NET, and the accuracy and speed of target detection reach new heights. The YOLO neural network is one of the most excellent target detection architectures at present, and is particularly remarkable in the aspect of detection real-time performance.

The feature expression method based on learning is widely concerned and researched, and compared with the feature designed manually, the deep learning feature is obtained by constructing a deep network structure and directly extracting from the original image pixel, so that the feature design problem is converted into a network architecture problem. Unnecessary feature design details are greatly reduced, meanwhile, a certain semantic attribute is displayed by high-level feature mapping of the deep neural network, and the best effect is achieved based on deep learning in related international events such as PASCAL VOC and Image Net large-scale visual recognition challenge games. Although deep learning feature expressions have more essential feature expressions, the training of the network requires a large amount of data due to the large number of parameters involved in learning the deep neural network, and therefore the calculation process is heavy and needs further optimization.

The target detection of the indoor object can be applied to the information processing of the intelligent machine for sensing the environment of the emotional robot system, the cognitive ability and the decision analysis ability of the intelligent machine can be further improved, and the intelligence and the adaptability of human-computer interaction are further enhanced. Particularly, on the basis of analyzing visual information of different modes, the environment is sensed and reflected, more abundant information can be obtained, and conditions are created for realizing higher-level machine intelligence.

Disclosure of Invention

The invention aims to solve the technical problems that in the prior art, the processing speed is low and the calculated amount is large, and provides an indoor object target detection method based on an improved end-to-end neural network.

The technical scheme adopted by the invention for solving the technical problems is as follows: an indoor object target detection method based on an improved end-to-end neural network is constructed, and the method comprises the following steps:

s1, constructing an end-to-end convolutional neural network, wherein the end-to-end convolutional neural network comprises a plurality of pooling layers for reducing image pixels, a plurality of convolutional layers for extracting image features, 1 full-connection layer and 1 classification output layer;

s2, acquiring a target image data set, constructing a training set based on the target image data set, labeling a labeling frame of each image in the training set, and determining the category and position information of each predefined target in the images of the training set;

s3, inputting the training set marked by the marking box into the convolutional neural network constructed in the step S1, and carrying out network initialization; the method comprises the steps that input image data are subjected to image pixel reduction through 1 pooling layer, then input into a convolution layer connected with the pooling layer, subjected to image feature extraction, subjected to weighting and processing on input feature vectors through a full-connection layer, and subjected to classification output layer, so that preprocessing of training set images is realized;

s4, dividing each image in the preprocessed training set into M multiplied by N network cells; selecting an initial candidate frame for each image by using the M multiplied by N network cells obtained by segmentation; b initial candidate frames are randomly generated by each network unit cell, and M multiplied by N multiplied by B initial candidate frames are generated in total;

s5, detecting a predefined target for each network cell obtained by segmentation to obtain a category confidence coefficient of the target category of M multiplied by N multiplied by B; setting the output of a convolutional neural network according to the obtained confidence coefficient of the object class, and determining a final object prediction frame;

s6, taking the training set marked by the marking box as the input of the convolutional neural network, taking the target prediction box obtained in the step S5 as the output of the convolutional neural network, and training the convolutional neural network to obtain the convolutional neural network finally used for target detection;

and S7, inputting the image to be subjected to target detection into the convolutional neural network trained based on the step S6, and carrying out indoor object target detection.

The implementation of the indoor object target detection method based on the improved end-to-end neural network has the following beneficial effects:

1. the invention designs an improved end-to-end neural model, provides a feature extraction mode of first pooling and then convolution in a neural network, reduces the loss of feature information and simultaneously realizes rapid indoor target detection;

2. while the model is improved and finely adjusted, detection and result optimization are carried out through a self-made picture data set and a VOC2007 data set of the human-computer interaction indoor environment of the emotion robot, and the detection performance of the indoor environment is improved;

3. the improved end-to-end neural model disclosed by the invention is verified and analyzed through an experimental result, namely, a learning model can be converted into a specific model from a general model for target detection, and the situation categories related to environmental information can be continuously enriched, so that a data set is enriched and applied to an emotional robot interaction system.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a diagram of the indoor object target detection process based on the improved end-to-end convolutional neural network model of the present invention;

FIG. 2 is a diagram based on an improved end-to-end neural model architecture;

FIG. 3 is a comparison graph of target recognition at different grid scales.

Detailed Description

For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

The detection method based on the improved end-to-end convolution neural network model unifies candidate frame extraction, feature extraction, target classification and target positioning into a neural network, extracts candidate areas from an image through the built neural network, and predicts the position and probability of a pedestrian through the characteristics of the whole image. The problem of detecting the indoor environment target is converted into a regression problem, and end-to-end detection is really realized.

Please refer to fig. 1, which is a diagram illustrating an indoor object target detection process based on an improved end-to-end convolutional neural network model according to the present invention. Wherein the input image is divided into M × N units, each unit being given B initial candidate frames of different specifications; the parameters M, N, B are all positive integers and are equal to or greater than 1. As shown in fig. 1, prediction candidate frames are extracted via the convolutional layer network, and the number of each image candidate frame is M × N × B.

The specific implementation steps are as follows:

step 1: constructing an end-to-end convolutional neural network, wherein the end-to-end convolutional neural network comprises a plurality of pooling layers for reducing image pixels, a plurality of convolutional layers for extracting image features, 1 full-connection layer and 1 classification output layer; referring to fig. 2, the improved end-to-end convolutional neural network model provided in this embodiment includes 18 convolutional layers for extracting image features, 6 pooling layers for reducing image pixels, 1 classification output layer, and 1 full connection layer. Under the structure, on one hand, the loss of characteristic information is reduced through a full connection layer connected with a convolution layer; on the other hand, a feature extraction mode of first pooling and then convolution is provided in an improved end-to-end neural network, so that the loss of feature information is reduced, and meanwhile, rapid indoor target detection is realized; finally, the convolutional neural network module provided by this embodiment converts the detection problem into a regression problem, and truly realizes end-to-end detection.

Step 2: acquiring a target image data set, constructing a training set based on the target image data set, labeling a labeling frame of each image in the training set, and determining the category and position information of each predefined target in the images of the training set;

the target image data set comprises an image data set for carrying out an indoor interaction environment of the emotional robot and a VOC2007 related data set, wherein the data set of the indoor environment is manufactured according to the two data sets, the data set is divided into a training set and a testing set, each image in the training set is labeled through labeling frame labeling software, each target in the images of the training set is determined, and the category and the position information of each target in the images of the training set are obtained; wherein:

in this step, the construction process of the training set and the test set is as follows: ten thousand images are selected from the collected indoor environment images and VOC data sets of the emotional robot as data sets, eight thousand images in the data sets are used as training sets, and the remaining two thousand images are used as testing sets; the training set is used for subsequent convolutional neural network training, and the test set is used for testing the accuracy of finally obtained positioning data when the training set is input into the trained convolutional neural network;

in this step, the predefined target is set as: according to the emotional robot interaction scene and the objects, setting the predefined targets as four types of targets including pedestrians, chairs where people sit, tables and computer displays in the image; since the existing pascalloc 2007 selects 20 object classes, in this embodiment, in order to improve the accuracy of object detection, unnecessary tags need to be reduced to adapt to the identification of the target object in the indoor environment, wherein the above-defined 4 object classes are considered as predefined specific models, including chairs, tables, people and computer monitors; the output data of the last layer (classified output layer) of the convolutional neural network constructed in the invention directly corresponds to the label, so that the output data can be realized by controlling the output number of the layer;

in this step, the process of labeling the labeling box is as follows: marking each target (a pedestrian, a chair on which a person sits, a table and a computer display) in the training set image by using a marking frame so as to obtain the category and position information of each indoor environment object target in the training set image; the category information is the category to which the name of the object target belongs, the position information is the coordinate of the center point of the marking frame and the width and height of the marking frame, and the currently obtained category and position information are stored in an options folder in an xml format;

after the xml format file which is marked is converted into a txt format file which is suitable for target detection of the improved end-to-end neural model, a folder for storing a data set is established under HOME, and three folders are generated under the folder and named as respectively Anotations, Image Sets and JPEG Images folders. Uniformly adjusting the format of indoor image picture data into a format of 'jpg', uniformly renaming the picture data from '000001. jpg' according to a PASCAL VOC official naming method, and finally storing the processed picture data in a JPEG Images folder;

labeling the picture data, namely labeling the category and the position information of the target, specifically comprising: and storing the labeling information as a file with the same name and the format of 'xml', and storing the file into an options folder. Generating a training sample set and a testing sample set according to the existing data in proportion, generating a 'train.txt' file and a 'test.txt' file, storing absolute path information of the training sample set and the testing sample set in the files, and placing the 'txt' file in a Main folder under an Image Sets folder.

And step 3: inputting the training set labeled by the labeling box into the convolutional neural network constructed in the step 1, and carrying out network initialization; the method comprises the steps that input image data are subjected to image pixel reduction through 1 pooling layer, then input into a convolution layer connected with the pooling layer, subjected to image feature extraction, subjected to weighting and processing on input feature vectors through a full-connection layer, and subjected to classification output layer, so that preprocessing of training set images is realized; wherein:

initializing the convolutional neural network, and inputting a training set labeled by using a labeling box into the convolutional neural network; preprocessing images in the training set; the preprocessing comprises one or more of rotation, contrast enhancement, inclination and scaling, the image has certain distortion after the preprocessing, and the accuracy of final image recognition can be increased through training of the distorted image.

And 4, step 4: in this embodiment, each image in the preprocessed training set is divided into 14 × 14 grids; the grid divided in the YOLO is used for detecting a target object, and an initial candidate frame is selected by using the grid; each grid randomly generates two initial candidate boxes, or the width and height of the initial candidate boxes are defined in advance according to experience, and a total of 14 × 14 × 2 candidate boxes are generated. The size is a size specified by the neural network model;

the present embodiment considers that the multi-layered convolved trellis and pooled trellis partitioning operation is changed from the original 7 × 7 to 14 × 14 to increase the size of the network feature map. FIG. 3 is a comparison of object recognition for different grid sizes. In fig. 3, the left side is a schematic diagram of target recognition with a 7 × 7 grid, and the right side is a schematic diagram of target recognition with a 14 × 14 grid; as can be seen from fig. 3, the system can predict only 1 target under 7 × 7 grid, but the improved technical solution proposed in this embodiment can identify 2 targets under 14 × 14 grid. When a plurality of target objects are arranged in the graph, particularly small target objects are contained, the extraction capacity of small target features can be increased, and the small target can be identified. The various targets are elements that constitute different environments, and the environments can be distinguished by the identification of objects.

In the improved end-to-end convolution neural network model of the workpiece, the size of the selected image is smaller than that of the image to be detected, so that the speed of operation processing can be ensured, and class identification can be rapidly carried out. Generally, 448 × 448 or 416 × 416, etc. are selected.

And 5: detecting a predefined target aiming at each network cell obtained by segmentation to obtain a class confidence coefficient of a target class of 14 multiplied by 2; setting the output of a convolutional neural network according to the obtained confidence coefficient of the object class, and determining a final object prediction frame;

in this step, the step of generating the target prediction frame specifically includes:

(1) firstly, generating an initial detection frame according to an initial preset coordinate point position;

(2) secondly, predicting a dynamic detection frame, and performing iterative prediction on the generated detection frame to generate a latest detection frame;

(3) secondly, calculating the contact ratio of the latest detection frame; if the coincidence degree of the latest detection frame is greater than or equal to a preset coincidence degree threshold value, the latest detection frame is reserved; if the coincidence degree of the latest detection frame is smaller than a preset coincidence degree threshold value, continuing to predict the dynamic detection frame;

(4) and finally, based on the coincidence degree of the detection frames, taking the latest detection frame reserved as a target prediction frame for detecting the object.

In this step, the calculation process of the confidence of the target category is as follows:

target detection is carried out based on the target prediction frames, whether a target to be distinguished exists in each target prediction frame is predicted, and the distinguishing result is positioned as follows: the confidence level conf (object) is calculated by the formula:

wherein, Pr (object) indicates whether an object falls into a cell corresponding to the candidate frame; if yes, the target confidence of the corresponding candidate box in the cell is

Otherwise, the candidate frame is determined to have no object, conf (object) is 0; specifically, the calculation formula of the target confidence may be described as:

illustrating the ratio of the intersection area to the union area of the predicted frame and the actual frame:

step 6: taking the training set image marked by the marking box in the step 2 as the input of the convolutional neural network, taking the training set image of the final target prediction box obtained in the step 5 as the output of the convolutional neural network, and training the convolutional neural network to obtain a final weight and the trained convolutional neural network; the training convolutional neural network comprises the following steps:

(1) firstly, receiving an image to be detected, and adjusting the size of the image to be detected according to a preset requirement to generate a first detection image; inputting the first detection image into a convolutional neural network for matching identification to generate an initial candidate box, classification identification information and a classification probability value corresponding to the classification identification information; during training, each picture in the data set marks the center coordinates of the object, when the object falls into a certain grid, the grid is responsible for detecting the object, and two candidate frames generated by the grid share the category;

(2) secondly, determining whether each initial candidate box identifies the target object or not based on the classification probability value, and taking the initial candidate box which successfully identifies the target object as a target prediction box; and performing prediction judgment on the target Object based on the obtained target prediction frames, setting the conditional probability of predicting the target Object to be Pr (Person | Object), and defining the confidence Conf of the target Object in the target prediction frames as follows:

wherein Pr (object) is used to determine whether there is an object falling into the object prediction boxIn the corresponding network cell;

representing the ratio of the intersection area and the union area of the prediction frame and the actual frame;

it should be further noted that, if the probability of identifying the object in the detection box exceeds the classification probability value, it indicates that the indoor object is enclosed in the detection box, and the object in the picture has been identified. And if the classification probability value is smaller than a preset classification probability threshold value, re-identifying until the classification probability value is larger than the preset classification probability threshold value. The neural network model performs multilayer convolution operation on the image.

(3) Finally, for each target prediction frame, predicting the probability of the target object and the position of the boundary frame, wherein the predicted value output by each target prediction frame is as follows:

[X,Y,W,H,Conf(Object),Conf]；

wherein X, Y is the offset of the predicted frame center relative to the network cell boundary, W, H is the ratio of the predicted frame width to the whole image; for each image data input, the final net output is the vector M × N × B × [ X, Y, W, H, Conf (object), Conf ].

And 7: and testing the images of the indoor environment of the test set by using the trained convolutional neural network and the final weight value so as to determine the target category and the positioning of the indoor environment.

The invention provides an indoor object target detection method based on an improved end-to-end neural model, which uses a deep neural network to perform an emotion robot interaction environment object target detection experiment. And improving an end-to-end neural model, and carrying out verification analysis on an experimental result. From experimental results, the improved end-to-end neural model on the self-made data set can improve the average accuracy of object detection. Based on the deep neural network, the learning model can be converted from a general model to a specific model for target detection. Context categories relating to environmental information can continue to be enriched, data sets enriched, and applied to emotional robot interaction systems.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An indoor object target detection method based on an improved end-to-end neural network is characterized by comprising the following steps:

s3, inputting the training set marked by the marking box into the convolutional neural network constructed in the step S1, and carrying out network initialization; the method comprises the steps that input data are subjected to image pixel adjustment through 1 pooling layer, then input into a convolution layer connected with the pooling layer, subjected to image feature extraction, subjected to weighting and processing through a full-connection layer, and output results through a classification output layer, so that the pre-processing of training set images is realized;

s4, dividing each image in the preprocessed training set into M multiplied by N network cells; selecting an initial candidate frame for each image by using the M multiplied by N network cells obtained by segmentation; b initial candidate frames are randomly generated by each network unit cell, and M multiplied by N multiplied by B initial candidate frames are generated in total; the parameters M, N, B are all positive integers and are greater than or equal to 1;

s6, taking the training set marked by the marking box as the input of the convolutional neural network, taking the target prediction box obtained in the step S5 as the output of the convolutional neural network, and training the convolutional neural network to obtain the final convolutional neural network for detecting the target of the indoor object;

and S7, inputting the image to be subjected to the indoor object target detection into the convolutional neural network obtained based on the training of the step S6 to obtain a target detection result.

2. The indoor object target detection method according to claim 1, wherein in step S2, the target image dataset includes an image dataset and a VOC2007 dataset of an emotional robot indoor interaction environment, and the image annotation software performs annotation of an annotation box on each image in the training set to obtain the category and position information of each target in the images in the training set.

3. The indoor object target detection method of claim 2, wherein the predefined target is set to be a pedestrian, a chair on which a person sits, a table, a computer display included in the image according to the emotional robot interaction scene and the object.

4. The indoor object target detection method according to claim 1, wherein in step S4, the preprocessed training set image is divided into 14 x 14 network cells; and selecting initial candidate frames by using the network cells, wherein 2 initial candidate frames are randomly generated in each network cell, and 14 × 14 × 2 initial candidate frames are generated in total.

5. The indoor object target detection method according to claim 1, wherein in step S5, the target detection is performed on the target prediction boxes, whether or not the target to be discriminated is predicted to exist in each target prediction box is determined based on the confidence conf (object), and the confidence of the target prediction boxes in which the target does not exist is set to 0; wherein, the mathematical formula of the confidence coefficient is defined as:

pr (object) is used for judging whether an object falls into the network cell corresponding to the object prediction frame;

if the target object exists in the network cell, setting the target confidence coefficient as

Otherwise, determining that no target object exists in the target prediction frame, and setting the confidence coefficient to be Conf (object) 0;

the ratio of the intersection area of the prediction box and the actual box to the union area is expressed.

6. The indoor object target detection method according to claim 1, wherein in step S6, the training of the convolutional neural network is divided into the following steps:

s51, receiving an image to be detected, and adjusting the size of the image to be detected according to a preset requirement to generate a first detection image; inputting the first detection image into a convolutional neural network for matching identification to generate an initial candidate box, classification identification information and a classification probability value corresponding to the classification identification information;

s52, determining whether each initial candidate box identifies the target object or not based on the classification probability value, and taking the initial candidate box which successfully identifies the target object as a target prediction box; and performing prediction judgment on the target Object based on the obtained target prediction frames, setting the conditional probability of predicting the target Object to be Pr (Person | Object), and defining the confidence Conf of the target Object in the target prediction frames as follows:

wherein, Pr (object) is used to judge whether there is object in the network cell corresponding to the object prediction frame;

s53, for each target prediction box, predicting the probability of the target object and the position of the boundary box, wherein the predicted value output by each target prediction box is as follows:

[X,Y,W,H,Conf(Object),Conf]；