WO2015078185A1

WO2015078185A1 - Convolutional neural network and target object detection method based on same

Info

Publication number: WO2015078185A1
Application number: PCT/CN2014/081676
Authority: WO
Inventors: 欧阳万里; 许春景; 刘健庄; 王晓刚
Original assignee: 华为技术有限公司
Priority date: 2013-11-29
Filing date: 2014-07-04
Publication date: 2015-06-04
Also published as: CN104680508B; CN104680508A

Abstract

A convolutional neural network and a target object detection method based on same. The convolutional neural network comprises: a feature extraction layer (21), a part detection layer (22), a deformation processing layer (23), a shielding processing layer (24) and a classifier (25). The convolutional neural network is united with the optimization feature extraction, part detection, deformation processing, shielding processing and classifier learning, the convolutional neural network is able to learn about the deformation of a target object via the deformation processing layer, and interaction is conducted on the deformation learning and the shielding processing, so that such interaction can increase the capability of the classifier to distinguish the target object from a non-target object according to learnt features.

Description

Convolutional neural network and target object detection method based on convolutional neural network

The present invention relates to data communication technologies, and more particularly to a convolutional neural network and a target object detection method based on a convolutional neural network.

Background technique

Object detection is one of the basic problems in machine vision. After detecting an object, it is convenient to store, analyze, 3D model, identify, track and search the object. Commonly used object detection, such as pedestrian detection, the purpose of pedestrian detection is to find the position and area of the pedestrian in the image. The main difficulty in pedestrian detection is the change of pedestrians in dressing, lighting, background, body deformation and occlusion. In the case of pedestrian detection, first, it is necessary to extract features that distinguish between pedestrians and non-pedestrians. The commonly used methods are Haar-like features and Histogram of Oriented Gradient (HOG). Secondly, since the movement of the pedestrian's body (such as the head, body, and legs) produces a change in the visual information of the pedestrian, a deformable model is proposed to deal with the deformation caused by the movement of the pedestrian. Again, in order to solve the loss of visual information due to occlusion, many methods of dealing with occlusion find the occluded portion of the picture in the picture to avoid using the occluded image information to determine whether there is a pedestrian in the given rectangle. Finally, the classifier is used to determine if a pedestrian is present in a given rectangle.

1 is a schematic diagram of a pedestrian detection method according to the prior art. As shown in FIG. 1, the pedestrian detection method of the prior art mainly includes the following steps: 1. Convolving an input image in a first stage, The result of the product is downsampled to obtain the output of the first stage; 2. The convolution and downsampling are continued according to the output of the first stage to obtain the output of the upper line in the second stage; 3. The output of the first stage is performed through the branch line. Sampling yields the output of the next row in the second phase; 4. Classifies according to the output of the second phase. In this method, the main feature is learning feature extraction. Each 歩 does not have a clear target for the processing result. Therefore, the output is unpredictable, and the body movement and occlusion of the pedestrian are not modeled. When the pedestrian image is deformed and occluded, the effect is poor. 2 is a schematic diagram of a method for pedestrian detection in prior art 2, which divides a pedestrian into a root node composed of a template of the entire pedestrian and a body part of the pedestrian (eg, The child nodes of the head, the upper part of the leg, or the lower part of the leg. The child node has a deformation constraint with the root node, for example, the head cannot be too far away from the body. As shown in FIG. 2, the prior art pedestrian detection method includes the following steps: 1. Feature extraction on an input image to obtain two feature maps with different resolutions; The low-resolution feature map is matched using the filter template as the root node to obtain the matched response. 3. The high-resolution feature map is matched using the filter template as the child node to obtain the matched response. The model in Figure 2 has 5 sub-nodes, so there are 5 sub-node filter templates, and 5 matched responses are obtained;

4. The response of the child node is corrected by the deformation constraint of the root node, and the corrected response is obtained;

5. Get an overall response to the presence of a pedestrian based on the response of the child node and the response of the root node. The prior art 2 can model the deformation of the object part and is more robust to body motion, but this technique uses artificially defined features when matching the template according to the feature map of the object, cannot automatically learn the feature, and cannot handle the occlusion. Case. SUMMARY OF THE INVENTION Embodiments of the present invention provide a convolutional neural network and a target object detection method based on a convolutional neural network, which are capable of processing deformation and occlusion of a target object.

A first aspect of the present invention provides a method for detecting a target object based on a convolutional neural network, the convolutional neural network comprising: a feature extraction layer, a part detection layer, a deformation processing layer, an occlusion processing layer, and a classifier;

The feature extraction layer performs pre-processing on the pixel value of the detection area according to the pixel value of the detection area in the extracted image, and performs feature extraction on the pre-processed image to obtain a feature map of the detection area;

The part detecting layer respectively detects the feature map of the detection area through the M filters, and outputs a response map corresponding to the M parts of the detection area, and each filter is used to detect one part, and each part corresponds to one response diagram. ;

Deformation processing layer respectively determines deformations of the M parts according to the response maps corresponding to the M parts, and determines a score map of the M parts according to deformations of the N parts; the occlusion processing layer is configured according to a score map of the M parts determines an occlusion corresponding to the M parts;

The classifier determines, according to an output result of the occlusion processing layer, that the detection area is No target object.

In a first possible implementation manner of the first aspect of the present invention, the feature extraction layer extracts a pixel value of the detection area in the image, and performs preprocessing on the pixel value in the detection area, including: the feature extraction layer Extracting a pixel value of the detection area in the image, converting the pixel value of the detection area into data of three channels, wherein the three channels are a first channel, a second channel, and a third channel;

The output data of the first channel corresponds to Y channel data of a YUV pixel value in the detection area;

The second channel is configured to reduce the size of the detection area to a quarter of the original size, and convert the reduced detection area into a YUV format, and convert the image to a YUV format by a Sobel edge operator. a first edge map of the detection area on the three channels Y, U, and V, wherein the Y, υ, and V channels respectively correspond to a first edge map, and the three first edge maps are taken a maximum value at each position in the middle, forming a second edge map, the three first edge maps and the second edge map having the same size, each being a quarter of the detection area, and the three a mosaic of the edge map and the second edge map as output data of the second channel;

The third channel is configured to reduce the size of the detection area to a quarter of the original size, and convert the reduced detection area into a YUV format, and convert the image to a YUV format by a Sobel edge operator. a detection area, respectively obtaining a first edge map of the detection area on three channels of Y, U, V, wherein the Y, U, V channels respectively correspond to a first edge map, and generate a third edge map, The data of each position of the third edge map is 0, and the three first edge maps and the third edge map have the same size, which are all a quarter of the detection area, and the three first edge maps are And a mosaic of the third edge map as output data of the third channel.

In a second possible implementation manner of the first aspect of the present invention, the part detecting layer includes three sub-layers, which are a first sub-layer, a second sub-layer, and a third sub-layer, respectively A sub-layer includes M1 filters, a second sub-layer of the portion detecting layer includes M2 filters, and a third sub-layer of the portion detecting layer includes M3 filters, where M1+M2+M3=M And M1 filters of the first sub-layer of the part detecting layer respectively detect M1 parts in the detection area to obtain M1 response pictures; The M2 filters of the second sub-layer of the part detecting layer respectively detect M2 parts in the detection area, and obtain M2 response patterns;

The M3 filters of the third sub-layer of the part detecting layer respectively detect M3 parts in the detection area to obtain M3 response maps.

In a third possible implementation manner of the first aspect of the present invention, the deformation processing layer respectively determines deformations of the M parts according to the response maps corresponding to the M parts, and according to the deformation of the M parts Determining the score map of the M parts, including:

The deformation processing layer obtains the shape of the Pth portion according to the formula (1) according to the response map corresponding to the M parts:

B _p = M _p + _∑ D _n ^ _p (1) where ^ denotes that the shape of the p-th part becomes a partial graph, l ≤ _p ≤ M, M _p denotes a response map corresponding to the p-th part, and N denotes a constraint condition of the p-th part, ^D w represents a score map corresponding to the n-th constraint condition, 1 ≤ ^ ≤ 0^ represents a weight corresponding to the n-th constraint condition; the deformation processing layer becomes a partial map according to the shape, Determine the score map of the Pth part according to formula (2):

=maxB(, (2) where s ) represents the value of ^ at the position (χ, y).

In a fourth possible implementation manner of the first aspect of the present invention, the occlusion processing layer includes three sub-layers, respectively a first sub-layer, a second sub-layer, and a third sub-layer, and the occlusion processing layer is The score maps of the M parts determine the occlusion corresponding to the M parts, including:

The occlusion processing layer determines a score map and visibility of the M portions on a sublayer of the occlusion processing layer;

The first sub-layer, the second sub-layer, and the third sub-layer of the occlusion processing layer are respectively according to formula (3),

(4), (5) Calculate the visibility of each part:

h;'^S((h') ^T w^ + c' _{+ g} ), 1=1, 2 (4) y = S((h ³ ) ^T w ^ds + b) (5) where, denotes the P a score map of the part on the first layer of the occlusion treatment layer, A weight matrix representing 4, indicating an offset of 4, indicating the visibility of the Pth portion on the first layer of the occlusion processing layer, ^tXl + expi-t)) - ¹ , indicating that the Pth portion is The visibility on the Zth sublayer of the occlusion processing layer, using ^ to represent the transfer matrix between ^ and ^, represents the jth column of ^, W ^ds represents the parameter of the linear classifier of the implicit variable ^, «representation matrix The transposition of X represents the output of the convolutional neural network. A second aspect of the present invention provides a convolutional neural network, including:

a feature extraction layer, configured to preprocess a pixel value of the detection area according to a pixel value of the detection area in the extracted image, and perform feature extraction on the preprocessed image to obtain a feature map of the detection area;

a part detecting layer, configured to respectively detect a feature map of the detection area by M filters, and output a response map corresponding to M parts of the detection area, each filter is used to detect one part, and each part corresponds to one response Figure

a deformation processing layer, configured to respectively determine deformations of the M parts according to the response maps corresponding to the M parts, and determine a score map of the M parts according to the deformation of the N parts; the occlusion processing layer, Determining the occlusion corresponding to the M parts according to the score map of the M parts;

And a classifier, configured to determine, according to an output result of the occlusion processing layer, whether there is a target object in the detection area.

In a first possible implementation manner of the second aspect of the present invention, the feature extraction layer includes three channels, which are a first channel, a second channel, and a third channel, respectively;

The second channel is configured to reduce the size of the detection area to a quarter of the original size, and convert the reduced detection area into a YUV format, and convert the image to YUV by Sobel edge operator filtering a detection area of the format, respectively obtaining a first edge map of the detection area on three channels of Y, U, V, wherein the Y, U, V channels respectively correspond to a first edge map, and the three first edges are taken The maximum value at each position in the figure constitutes a second edge map, and the three first edge maps and the second edge map are the same size, which are one quarter of the detection area. a size, the mosaic of the three first edge maps and the second edge map is used as output data of the second channel;

The third channel is configured to reduce the size of the detection area to a quarter of the original size, and convert the reduced detection area into a YUV format, and convert the image to YUV by Sobel edge operator filtering. a detection area of the format, respectively obtaining a first edge map of the detection area on three channels of Y, U, V, wherein the Y, υ, V channels respectively correspond to a first edge map, and generate a third edge map, The data of each position of the third edge map is 0, and the three first edge maps and the third edge map have the same size, all of which are one quarter of the detection area, and the three first edges are The figure and the mosaic of the third edge map serve as output data of the third channel.

In a second possible implementation manner of the second aspect of the present invention, the part detecting layer includes three sub-layers, which are a first sub-layer, a second sub-layer, and a third sub-layer, respectively A sub-layer includes M1 filters, a second sub-layer of the portion detecting layer includes M2 filters, and a third sub-layer of the portion detecting layer includes M3 filters, where M1+M2+M3=M The first sub-layer of the part detecting layer is configured to respectively detect M1 parts in the detection area by using M1 filters, and obtain M1 response patterns;

a second sub-layer of the part detecting layer is configured to respectively detect M2 parts in the detecting area by using M2 filters, and obtain M2 response patterns;

The third sub-layer of the part detecting layer is configured to respectively detect M3 parts in the detection area by M3 filters, and obtain M3 response patterns.

In a third possible implementation manner of the second aspect of the present invention, the deformation processing layer is specifically configured to:

Β _ρ = Μ _{ρ η} , _ρ . _η , _ρ ( 1 ) where ^ denotes that the shape of the ρth portion becomes a partial graph, ί ≤ _ρ ≤ Μ, Μ _ρ represents a response map corresponding to the ρth portion, and Ν represents a constraint condition of the ρth portion , ^D ",p represents the score map corresponding to the nth constraint condition, 1 ≤ ^ ≤ 0^ represents the weight corresponding to the nth constraint condition; the deformation processing layer becomes a partial map according to the shape, and is determined according to the formula (2) The score map of the third part: =maxB(, (2) where β ) represents the value at the position (χ, y).

In a fourth possible implementation manner of the second aspect, the occlusion processing layer includes three sub-layers, which are a first sub-layer, a second sub-layer, and a third sub-layer;

(4), (5) Calculate the visibility of each part:

= + g (3) h; ^l ^S((h') ^T w^ + c'+g' _s ' ), 1=1,2 (4) y^S(( ) ^T w ^ds + b) ( 5) wherein, the score map of the Pth portion on the first layer of the occlusion processing layer, ^ indicates the weight matrix of 4, and the offset of 4 indicates that the Pth portion is at the occlusion processing layer The visibility on the 1st layer, ^tXl + expi-t)) - ¹ , indicates the visibility of the Pth part on the Zth sublayer of the occlusion processing layer, and the transfer matrix between ^ and ^ is represented by ^ , represents the jth column of ^, represents the parameter of the linear classifier of the implied variable ^, « represents the transpose of the matrix X, and represents the output result of the convolutional neural network. An embodiment of the present invention provides a method subject, including:

The convolutional neural network and the object detection method based on convolutional neural network in the embodiment of the present invention are combined with optimized convolutional neural network model integrating optimization feature extraction, part detection, deformation processing, occlusion processing and classifier learning. Through the deformation processing layer, the convolutional neural network can learn the deformation of the target object, and the deformation learning and the occlusion processing interact, and the interaction can improve the ability of the classifier to distinguish the target object from the non-target object according to the learned feature. Illustration

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description Are some embodiments of the invention, to those of ordinary skill in the art In other words, other drawings can be obtained based on these drawings without paying for creative labor.

1 is a schematic diagram of a pedestrian detection method according to prior art 1;

2 is a schematic diagram of a method for pedestrian detection in the prior art 2;

3 is a flow chart of an embodiment of a method for detecting a target object based on a convolutional neural network according to the present invention;

4 is a schematic view of a filter for detecting various parts of a body according to the present invention;

Figure 5 is a schematic diagram showing the results of the detection of the part detection layer;

6 is a schematic diagram of an operation flow of a deformation processing layer;

7 is a schematic view showing a processing procedure of an occlusion processing layer;

8 is a schematic diagram of detection results of a target object according to the present invention;

Figure 9 is a schematic view of the overall model of the present invention;

10 is a schematic structural view of an embodiment of a convolutional neural network according to the present invention;

11 is a schematic structural diagram of still another embodiment of a convolutional neural network according to the present invention;

The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the accompanying drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

3 is a flowchart of an embodiment of a method for detecting a target object based on a convolutional neural network according to the present invention. In this embodiment, a convolutional neural network includes: a feature extraction layer, a part detection layer, a deformation processing layer, an occlusion processing layer, and a classifier. As shown in FIG. 3, the method in this embodiment may include: Step 101: The feature extraction layer preprocesses the pixel values of the region according to the pixel values of the detection regions in the extracted image, and performs feature on the preprocessed image. Extraction, obtaining a feature map of the detection area.

In this embodiment, detecting the target object only refers to detecting whether there is a target object in the detection area, and the detection area may be an arbitrarily set area, for example, an image is divided into two rectangular frames, and each rectangular frame is As a detection area. The target object can be a pedestrian, steam Cars, animals, etc. In this embodiment, between the feature extraction of the image in the detection area, the image is pre-processed to eliminate some interference factors of the image, and any existing method, such as gradation transformation, may be adopted for the image pre-processing. , histogram correction, image smoothing, etc.

In this embodiment, the feature extraction layer extracts the pixel value of the detection area in the image, and converts the pixel value of the detection area into three channels of data, and the three channels are the first channel, the second channel, and the third channel, respectively. The data for each channel is acquired independently as an input part of the entire model.

Specifically, the output data of the first channel corresponds to the data of the Y channel of the YUV pixel value in the detection area.

The second channel is used to reduce the size of the detection area to a quarter of the original size, and converts the reduced detection area into a YUV format, and converts it into a detection area of the YUV format by Sobel edge operator filtering, respectively obtaining a detection area. The first edge map is on the three channels of Y, υ, V, wherein the Y, υ, and V channels respectively correspond to a first edge map, and the maximum values at each position in the three first edge maps are formed to form a second The edge map, the three first edge maps and the second edge map are the same size, all of which are one quarter of the detection area, and the mosaic map of the three first edge maps and the second edge map is used as the output data of the second channel. .

The third channel is used to reduce the size of the detection area to a quarter of the original size, and converts the reduced detection area into a YUV format, and converts it into a detection area of the YUV format by Sobel edge operator filtering, respectively obtaining a detection area. The first edge map on the three channels Y, U, V,

The Y, U, and V channels respectively correspond to a first edge map, and generate a third edge map. The data of each position of the third edge map is 0, and the three first edge maps and the third edge map have the same size, and both are detection regions. a quarter of the size, the mosaic of the three first edge map and the third edge map as the output data of the third channel;

The output data of the first channel, the second channel, and the third channel are used as pre-processed pixel values, and then the pre-processed image is subjected to feature extraction to obtain a feature map of the detection region, and the feature extraction layer can pass the direction gradient value. The feature maps of the abbreviated regions are extracted by means of a square map H0G, SIFT, Gabor, LBP, and the like.

Step 102: The part detecting layer respectively detects the feature map of the detection area through the M filters, and outputs a response map corresponding to the M parts of the detection area, and each filter is used to detect one part, and each part corresponds to one response picture.

The part detection layer can be regarded as a downsampling layer of the convolutional neural network system, through M The filter detects the feature map of the detection area separately, and obtains more detailed feature parts than the feature map. In this embodiment, the part detecting layer includes three sub-layers, which are a first sub-layer, a second sub-layer, and a third sub-layer, respectively, and the first sub-layer of the part detecting layer includes M1 filters, and the second part of the detecting layer The sublayer includes M2 filters, and the third sublayer of the part detection layer includes M3 filters, wherein M1, M2, and M3 are positive integers greater than 1, and M1+M2+M3=M, usually, for one In the case of a convolutional layer, the size of the corresponding filter is fixed. However, in the case of pedestrian detection, since the size of each part of the human body is different, the size of each filter may be different in this embodiment, and the present invention is not correct. This is a limitation.

M1 filters of the first sub-layer of the part detecting layer respectively detect M1 parts in the detection area to obtain M1 response pictures, and M2 filters of the second sub-layer of the part detecting layer respectively detect M2 in the detection area At the site, M2 response maps are obtained; M3 filters of the third sub-layer of the part detection layer respectively detect M3 parts in the detection area, and M3 response maps are obtained.

The following will be explained by a specific example. Suppose Ml is 6, M2 is 7, and M3 is 7, that is, the first sub-layer has 6 filters, the second sub-layer has 7 filters, and the third sub-layer has 7 The filter has a total of 20 filters. In this embodiment, the filters of each sub-layer are related to each other, the filter of the first sub-layer is smaller, and the filter of the second sub-layer is larger than the first sub-layer. The filter of the third sub-layer is larger than the filter of the first sub-layer, and the filter of the second sub-layer can be combined by the filter of the first sub-layer according to certain rules, and the filter of the third sub-layer The filter of the second sub-layer can be combined according to certain rules, as shown in FIG. 4, FIG. 4 is a schematic diagram of the filter for detecting various parts of the body according to the present invention, the first filter and the first sub-layer The two filters combine to obtain the first filter of the second sub-layer, the first filter of the first sub-layer and the third filter combine to obtain the second filter of the second sub-layer, but some filters Cannot be combined, such as the first filter of the first sublayer and The fifth filter is not combinable. The parameters of each filter are obtained when training the convolution network. In this step, only the respective filters and convolution images are convoluted to obtain 20 response graphs, each filtering. The device outputs a response map, and each response map corresponds to some parts of the target object to obtain the position of each part of the target object. Figure 5 is a schematic diagram showing the results of the detection of the part detection layer.

Step 103: The deformation processing layer determines the deformation of the M parts according to the response map corresponding to the M parts, and determines the score map of the M parts according to the deformation of the N parts.

The part detecting layer can detect some parts of the target object appearing in the detection area, In the actual image, the target object will be deformed due to the movement of various parts. For example, the movement of the pedestrian body (such as the head, body, and legs) will cause the deformation of the pedestrian's visual information. The deformation processing layer is to learn the various parts of the target object. The correlation relationship before the row change, the deformation processing layer extracts the M position positions and the scores which are most suitable for the human body from the M part detection response maps, thereby extracting the association between the respective parts.

The deformation processing layer determines the deformation of the M parts according to the response maps corresponding to the M parts, and determines the score maps of the M parts according to the deformation of the M parts, specifically:

First, the deformation processing layer obtains the shape of the M parts according to the formula (1) according to the response map corresponding to the M parts:

Where β _ρ indicates that the shape of the p-th part becomes a partial graph, 1 ≤ _ρ ≤ Μ, Μ _ρ represents the response map corresponding to the ρ-th part, Ν represents the constraint condition of the ρ-th part, and ^D w represents the n-th constraint condition Corresponding score graph, 1 ≤ ^ ≤ 0^ represents the weight corresponding to the nth constraint condition, where each constraint condition corresponds to one deformation, taking the first part as the human head as an example, the head movement usually has left The rotation, right turn, down, and up are deformed in four. Each constraint corresponds to one weight, and the weight is used to indicate the probability of each deformation of the head.

The shape of each part is calculated by the formula (1), and then the deformation processing layer becomes a sub-graph according to the shape, and the score map of the first part is determined according to the formula (2):

=maxB( , ( 2 ) where β ) represents the value of ^ at the position of (x, y), and the meaning of the above formula is to take the maximum value of the P-th shape into a partial map, and the position corresponding to the maximum value is the first P Part of the location, therefore,

The position of the P part can be expressed as. Figure 6 is a schematic diagram of the operation flow of the deformation processing layer. In the figure, ^M P represents the response map corresponding to the p-th part, and represents the first restriction condition of the P-th part, and represents the second restriction condition of the P-th part, indicating the P-th. The third constraint of the part, A represents the fourth constraint of the P part, ρ represents the weight corresponding to the first constraint, ⁰ ^ represents the weight corresponding to the second constraint, and ⁰ ^ represents the third restriction The weight corresponding to the condition, ⁰ ^ indicates the fourth limit The weight corresponding to the condition is then weighted, and then the respective constraint conditions and the response map corresponding to the p-th part are weighted and summed to obtain a shape of the P-th part, and then the shape becomes a coordinate corresponding to the maximum value in the partial image (X, y) Position as the best position for Part P.

Step 104: The occlusion processing layer determines the occlusion corresponding to the M parts according to the score map of the M parts.

The deformation processing layer gives the score map s ^{= {Sl} A''' ^} of each part, and determines the occlusion corresponding to each part according to the score map of each part. In this embodiment, the occlusion processing layer includes three sub-layers, which are a first sub-layer, a second sub-layer, and a third sub-layer, respectively, and the occlusion processing layer determines the occlusion corresponding to the M parts according to the score map of the M parts, specifically :

The occlusion processing layer determines the score map and visibility of the M parts on the sub-layer of the occlusion processing layer; the first sub-layer, the second sub-layer, and the third sub-layer of the occlusion processing layer are respectively according to formulas (3), (4) ),

(5) Calculate the visibility of each part:

= + g (3) h; ^l ^S((h') ^T w^ + c'+g' _s ' ), 1=1, 2 (4) y ^S((h ³ ) ^T w ^ds +b (5) where 4 indicates the score map of the Pth portion on the first layer of the occlusion processing layer, and the weight matrix indicated, indicating the offset of 4, indicating that the Pth portion is on the first layer of the occlusion processing layer. Visibility, ) is a type 8 function,

Indicates the visibility of the Pth part on the Zth sublayer of the occlusion processing layer, and the transfer matrix between ^ and ^, the jth column of ^, the parameter of the linear classifier of the implied variable ^, Represents a transpose of matrix X, representing the output of the convolutional neural network.

In this embodiment, only the hidden variables of adjacent layers are connected to each other, and each part may have multiple parent nodes and child nodes, and the visibility of each part is related to the visibility of other parts of the same layer. Expressed as having the same parent node, the visibility of the next layer is related to the visibility of several parts of the previous layer. As shown in FIG. 7, FIG. 7 is a schematic view of the processing process of the occlusion processing layer, and the visibility of the first two portions of the first layer is strongly correlated with the visibility of the second layer, because structurally, The two parts mentioned can be combined to obtain the part of the second layer, that is, the two parts of the previous layer have higher visibility in the image (the degree of matching of the parts is relatively high), which directly causes the latter layer to be combined by them. The visibility of the parts is also relatively high. In addition to the part of the previous layer, the visibility of the second layer is also related to the score of its own part, which is intuitively understood. Yes, when the matching score of a part is relatively high, the visibility is naturally higher. All parameters of the occlusion processing layer are learned by the backward propagation algorithm.

Step 105: The classifier determines whether there is a target object in the detection area according to an output result of the occlusion processing layer.

The occlusion processing layer determines the occlusion degree of each part according to the score map of each part, and the occlusion degree is embodied by visibility. The classifier determines whether there is a target object in the detection area according to the output result of the occlusion processing layer, and outputs the detection result. . As shown in Fig. 8, Fig. 8 is a schematic view showing the detection result of the target object of the present invention.

The method provided in this embodiment is a unified convolutional neural network model which combines optimized feature extraction, part detection, deformation processing, occlusion processing and classifier learning, and enables the convolutional neural network to learn the target object through the deformation processing layer. The deformation, and deformation learning and occlusion processing interact, which enhances the ability of the classifier to distinguish between pedestrians and non-pedestrians based on the learned features.

Before adopting the convolutional neural network-based target object detection method provided in the first embodiment, it is first necessary to pre-train the convolutional neural network to obtain parameters of each layer of the convolutional neural network. In the present invention, all of our parameters, including image features, deformation parameters, and visibility relationships, can be learned through a unified architecture. In the process of training such a multi-level network, a multi-level training strategy is adopted. First, a supervised learning method was used to learn a convolutional network with only one layer. In this process, a Gabor filter was used as the initial value of the filter. When the network of this layer is well learned, add a second layer, and then learn the two-tier network, and the network that only learned one layer before is treated as the initial value. Throughout the learning process, all parameters are learned using the method of backward propagation.

After each parameter is obtained through one pre-training, the parameters obtained by the learning can also be adjusted. The following is an example of parameter adjustment of the occlusion estimation layer. The prediction error updates all parameters by the backward propagation method, where the propagation for s The expression of the gradient is as follows: :", dL dL dh _ dL

Where Θ denotes the Hadamard product, and the Hadamard product computes as (" ^ΰ ""· ^{= t7} " , where L represents the loss function.

The loss function can have many forms. For example, for a squared error loss function, the expression is:

L = y _gnd - f / 2 ,

For a logarithmic error loss function, its expression is:

^L = y _gnd ^lo g ί + (ΐ- y _gnd ) iog(i - y)

Wherein, the actual result of the training sample is represented, and the output result obtained by the convolutional neural network of the present invention is expressed. If the value of the loss function does not satisfy the preset condition, the parameters are continuously trained until the loss function satisfies the preset condition. .

Based on the first embodiment, the second embodiment of the present invention will be described in detail by using a specific example. As shown in FIG. 9, FIG. 9 shows the overall model of the present invention. First, enter an image of size ⁸⁴ x ^72. The image consists of 3 layers. The first layer is convolved on the input image. The size of the partial sliding window is ^9x9 . The filtered 64-layer ^{76x is obtained.} The image of the size is then averaged according to the four surrounding pixels adjacent to each pixel point to obtain a 64-layer image of l ⁹ x l ⁵ size, and then the feature map of the image of the size of ¹⁹ × ^{1.5 is} extracted, These processes are completed by the feature extraction layer, and then the second layer convolution operation is performed on the extracted feature map by the location detection. The 20 filters are used to filter the image to obtain 20 parts response maps. Then, The deformation processing layer determines the score maps of 20 parts according to the response maps of 20 parts. Finally, the occlusion processing layer determines the occlusion corresponding to 20 parts according to the score map of 20 parts, and obtains 20 parts. Visibility The visibility determination portion 20 determines whether an object within the detection target area.

10 is a schematic structural diagram of an embodiment of a convolutional neural network according to the present invention. As shown in FIG. 10, the convolutional neural network provided by the present embodiment includes: a feature extraction layer 21, a portion detection layer 22, a deformation processing layer 23, and an occlusion processing layer. 24 and classifier 25.

The feature extraction layer 21 is configured to preprocess the pixel value of the detection area according to the pixel value of the detection area in the extracted image, and perform feature extraction on the preprocessed image to obtain a feature map of the detection area; The part detecting layer 22 is configured to respectively detect the feature map of the detection area by the M filters, and output a response map corresponding to the M parts of the detection area, each filter is used for detecting one part, and each part corresponds to one response picture;

The deformation processing layer 23 is configured to respectively determine deformations of the M parts according to the response maps corresponding to the M parts, and determine the score maps of the M parts according to the deformation of the N parts;

The occlusion processing layer 24 is configured to determine the occlusion corresponding to the M parts according to the score map of the M parts;

The classifier 25 is configured to determine whether there is a target object in the detection area according to an output result of the occlusion processing layer.

In this embodiment, the feature extraction layer 21 may include three channels, which are respectively a first channel, a second channel, and a third channel; wherein, the output data of the first channel corresponds to the Y channel data of the YUV pixel value in the detection area;

The second channel is configured to reduce the size of the detection area to a quarter of the original size, and convert the reduced detection area into a YUV format, and convert it into a detection area of the YUV format by Sobel edge operator filtering, and respectively detect The first edge map of the three channels Y, U, V, the Y, υ, V channels respectively correspond to a first edge map, taking the maximum value at each position in the three first edge maps to form a second edge The three first edge maps and the second edge maps are the same size, each of which is a quarter of the detection area, and the mosaic map of the three first edge maps and the second edge map is used as the output data of the second channel;

The third channel is configured to reduce the size of the detection area to a quarter of the original size, and convert the reduced detection area into a YUV format, and convert it into a detection area of the YUV format by Sobel edge operator filtering, and respectively detect The first edge map of the region on the Y, U, V channels, the Y, U, V channels respectively correspond to a first edge map, generating a third edge map, the data of each position of the third edge map is 0, three The first edge map and the third edge map are the same size, and each is a quarter of the detection area, and the mosaic of the three first edge maps and the third edge map is used as the output data of the third channel.

The part detecting layer 22 includes three sub-layers, which are a first sub-layer, a second sub-layer and a third sub-layer, respectively, the first sub-layer of the part detecting layer includes M1 filters, and the second sub-layer of the part detecting layer includes M2 Filter, the third sub-layer of the part detection layer comprises M3 filters, wherein M1+M2+M3=M; the first sub-layer of the part detection layer is used for separately detecting by Ml filters Measuring M1 parts in the detection area to obtain M1 response maps; the second sub-layer of the part detection layer is used for detecting M2 parts in the detection area by M2 filters respectively, and obtaining M2 response patterns; The third sub-layer is used to detect M3 parts in the detection area by M3 filters respectively, and obtain M3 response patterns.

The deformation processing layer 23 is specifically configured to: according to the response map corresponding to the M parts, respectively obtain the shape of the Pth part according to the formula (1):

Where β _ρ indicates that the shape of the p-th part becomes a partial graph, 1≤Ρ≤Μ Μ„table small ρ

For the corresponding response map, Ν denotes the constraint condition of the part, ^{D. /"} ν η water limiter corresponds to the score map, 1 ≤ ^ ≤ 0^ represents the weight corresponding to the nth constraint condition;

And according to the shape becomes a sub-graph, according to formula (2) to determine the score map of the first part

s„ = maxB (x,y) (2)

p where β ) represents the value of β _ρ at the position (x, y).

The occlusion processing layer 24 includes three sub-layers, respectively:

The first sub-layer, the second sub-layer, and the sub-layer of the occlusion processing layer calculate the visibility of each part according to formulas (3), (4), and (5), respectively:

(3) (4) y^S((h ³ ) ^T w ^cls + b) (5) where ^ represents the score map of the Pth part on the first layer of the occlusion treatment layer, table /", hi

The weight matrix, the small offset of the table, means: the first part of the P part in the occlusion processing layer

The visibility on the layer, ^tXl + expi-t)) - ¹ , indicates the visibility of the Pth part on the Zth sublayer of the occlusion processing layer, and the transfer matrix between ^ and ^ is represented by ^ The jth column, which represents the parameter of the linear classifier of the implied variable ^, "represents the transpose of matrix X, which represents the output of the convolutional neural network.

The convolutional neural network provided in this embodiment is used to implement the technical solution provided by the method embodiment shown in FIG. 3. The specific implementation manner and technical effects are similar, and details are not described herein again. 11 is a schematic structural diagram of still another embodiment of a convolutional neural network according to the present invention. As shown in FIG. 11, the convolutional neural network 300 of this embodiment includes: a processor 31 and a memory 32, and the processor 31 and the memory 32 are connected by a bus. The memory 32 stores execution instructions. When the convolutional neural network system 300 is in operation, the processor 31 communicates with the memory 32, and the processor 31 executes instructions to cause the convolutional neural network 300 to perform the convolutional neural network system based on the present invention. Target object detection method. In this embodiment, the feature extraction layer, the part detection layer, the deformation processing layer, the occlusion processing layer, and the classifier of the convolutional neural network may be implemented by the processor 31, and the functions of the layers are performed by the processor 31. specifically:

The processor 31 controls the feature extraction layer to preprocess the pixel values of the detection area according to the pixel values of the detection area in the extracted image, and performs feature extraction on the preprocessed image to obtain a feature map of the detection area;

The control part of the processor 31 detects the feature map of the detection area through the M filters, and outputs a response map corresponding to the M parts of the detection area, and each filter is used for detecting one part, and each part corresponds to one response picture;

The processor 31 controls the deformation processing layer to determine the deformation of the M parts according to the response maps corresponding to the M parts, and determines the score map of the M parts according to the deformation of the N parts;

The processor 31 controls the occlusion processing layer to determine the occlusion corresponding to the M parts according to the score map of the M parts;

The processor 31 controls the classifier to determine whether or not there is a target object in the detection area based on the output result of the occlusion processing layer.

In this embodiment, the feature extraction layer includes three channels, which are a first channel, a second channel, and a third channel, respectively.

The output data of the first channel corresponds to the Y channel data of the YUV pixel value in the detection area;

The second channel is used to reduce the size of the detection area to a quarter of the original size, and converts the reduced detection area into a YUV format, and converts it into a detection area of the YUV format by Sobel edge operator filtering, respectively obtaining a detection area. The first edge map on the Y, U, V channels, the Y, U, V channels respectively correspond to a first edge map, taking the maximum values at each position in the three first edge maps to form a second edge map The three first edge maps and the second edge maps have the same size, all of which are one quarter of the detection area, and the three first edge maps and the second edge map are spelled together. Connected as the output data of the second channel;

The third channel is used to reduce the size of the detection area to a quarter of the original size, and converts the reduced detection area into a YUV format, and converts it into a detection area of the YUV format by Sobel edge operator filtering, respectively obtaining a detection area. In the first edge map on the Y, U, V channels, the Y, U, and V channels respectively correspond to a first edge map, and a third edge map is generated, and the data of each position of the third edge map is 0, three The first edge map and the third edge map are the same size, and each is a quarter of the detection area, and the mosaic of the three first edge maps and the third edge map is used as the output data of the third channel.

The part detecting layer comprises three sub-layers, namely a first sub-layer, a second sub-layer and a third sub-layer, respectively, the first sub-layer of the part detecting layer comprises Ml filters, and the second sub-layer of the part detecting layer comprises M2 a filter, the third sub-layer of the part detecting layer includes M3 filters, wherein

M1+M2+M3=M; M1 filters of the first sub-layer of the part detection layer respectively detect M1 parts in the detection area to obtain M1 response pictures; M2 filters of the second sub-layer of the part detection layer M2 regions in the detection area are respectively detected, and M2 response maps are obtained; M3 filters in the third sub-layer of the part detection layer respectively detect M3 parts in the detection area, and M3 response maps are obtained.

In this embodiment, the deformation processing layer determines the deformation of the M parts according to the response maps corresponding to the M parts, and determines the score maps of the M parts according to the deformation of the M parts, specifically: the deformation processing layer corresponds to the M parts. The response graph, according to the formula (1), obtains the shape of the Pth part into a subgraph:

Wherein, the shape indicating the ρth portion becomes a partial graph, l≤p≤M, M _p represents a response map corresponding to the pth portion, N represents a restriction condition of the pth portion, ^D ",p represents an nth constraint condition Corresponding score graph, 1 ≤ ^ ≤ 0^ indicates the weight corresponding to the nth constraint condition;

The deformation processing layer is divided into graphs according to the shape, and the score map of the Pth portion is determined according to the formula (2):

=maxB( , ( 2 ) where β ) represents the value at the position of (x, y).

In this embodiment, the occlusion processing layer includes three sub-layers, which are a first sub-layer, a second sub-layer, and a third sub-layer, respectively, and the occlusion processing layer determines the occlusion corresponding to each part according to the score map of the one part, include:

The occlusion processing layer determines the score map and visibility of the M parts on the sub-layer of the occlusion processing layer; the first sub-layer, the second sub-layer, and the third sub-layer of the occlusion processing layer are respectively according to formulas (3), (4) ), (5) Calculate the visibility of each part:

= + g (3) + ), 1=1, 2 (4) y ^S((h ³ ) ^T w ^ds + b) (5) where, the Pth part is on the first layer of the occlusion layer The score map, ^ denotes the weight matrix of 4, denotes the offset of 4, and shows the visibility of the Pth part on the first layer of the occlusion processing layer, ^tXl + expi-t)) - ¹ , indicating the Pth The visibility of the parts on the Z-th sub-layer of the occlusion processing layer, the transfer matrix between ^ and ^, the j-th column of the ^, the parameter of the linear classifier of the implied variable ^, «representing the matrix X The transpose, which represents the output of the convolutional neural network. The convolutional neural network provided in this embodiment is used to implement the technical solution provided by the method embodiment shown in FIG. 3. The specific implementation manner and technical effects are similar, and details are not described herein again.

One of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the various method embodiments described above can be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the above-described method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

It should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims

claims

1. A target object detection method based on a convolutional neural network, characterized in that the convolutional neural network includes: a feature extraction layer, a part detection layer, a deformation processing layer, an occlusion processing layer and a classifier;

The feature extraction layer performs preprocessing on the pixel values of the detection area based on the extracted pixel values of the detection area in the image, and performs feature extraction on the preprocessed image to obtain a feature map of the detection area;

The part detection layer detects the feature maps of the detection area through M filters respectively, and outputs response maps corresponding to the M parts of the detection area. Each filter is used to detect one part, and each part corresponds to a response map. ;

The deformation processing layer determines the deformations of the M parts according to the response maps corresponding to the M parts, and determines the score maps of the M parts based on the deformations of the N parts; the occlusion processing layer determines the deformations of the M parts according to The score map of the M parts determines the occlusion corresponding to the M parts;

The classifier determines whether there is a target object in the detection area based on the output result of the occlusion processing layer.

2. The method according to claim 1, characterized in that: the feature extraction layer extracts pixel values of the detection area in the image, and preprocesses the pixel values in the detection area, including: the feature extraction layer extracts The pixel values of the detection area in the image are converted into data of three channels, and the three channels are respectively the first channel, the second channel, and the third channel;

Wherein, the output data of the first channel corresponds to the Y channel data of the YUV pixel value in the detection area;

The second channel is used to reduce the size of the detection area to a quarter of the original size, convert the reduced detection area into YUV format, and filter the conversion into YUV format through Sobel edge operator detection area, respectively obtain the first edge map of the detection area on the three channels Y, U, and V. The Y, U, and V channels respectively correspond to a first edge map, and obtain the three first edge maps. The maximum value at each position in , forms a second edge map. The three first edge maps and the second edge map are of the same size, and are all a quarter of the size of the detection area. The three first edge maps are A spliced image of an edge image and the second edge image is used as the second channel the output data;

The third channel is used to reduce the size of the detection area to a quarter of the original size, convert the reduced detection area into YUV format, and filter the conversion into YUV format through Sobel edge operator detection area, respectively obtain the first edge map of the detection area on the three channels Y, U, and V. The Y, U, and V channels respectively correspond to a first edge map, and a third edge map is generated, so The data at each position of the third edge map is 0, the three first edge maps and the third edge map are the same size, and are all a quarter of the size of the detection area, and the three first edge maps are and the third edge map as the output data of the third channel.

3. The method according to claim 2, characterized in that the part detection layer includes three sub-layers, namely a first sub-layer, a second sub-layer and a third sub-layer, and the first part of the part detection layer The sub-layer includes M1 filters, the second sub-layer of the part detection layer includes M2 filters, and the third sub-layer of the part detection layer includes M3 filters, where M1+M2+M3=M; The M1 filters of the first sub-layer of the part detection layer respectively detect M1 parts in the detection area, and obtain M1 response maps;

The M2 filters of the second sub-layer of the part detection layer respectively detect M2 parts in the detection area, and obtain M2 response maps;

The M3 filters of the third sub-layer of the part detection layer respectively detect M3 parts in the detection area, and obtain M3 response maps.

4. The method according to claim 1, wherein the deformation processing layer determines the deformation of the M parts according to the response maps corresponding to the M parts, and determines the deformation of the M parts according to the deformation of the M parts. The score maps of the M parts include:

The deformation processing layer obtains the deformation component diagram of the P-th part according to the response maps corresponding to the M parts according to formula (1):

Among them, β _ρ represents the deformation component diagram of the ρ-th part, 1≤ρ≤M, M _ρ represents the response diagram corresponding to the ρ-th part, N represents the restriction condition of the ρ-th part, and ^D w represents the response diagram of the ρ-th part. The score map corresponding to n restriction conditions, 1≤^≤0^ represents the weight corresponding to the nth restriction condition; the deformation processing layer determines the score of the Pth part according to the deformation score map according to the formula (2) picture: ^maxB (2) Among them, β ) represents the value at the (x, y) position.

5. The method according to claim 1, characterized in that the occlusion processing layer includes three sub-layers, namely a first sub-layer, a second sub-layer and a third sub-layer, and the occlusion processing layer is based on the The score map of M parts determines the occlusion corresponding to the M parts, including:

The occlusion processing layer determines the score map and visibility of the M parts on the sub-layer of the occlusion processing layer;

The first sub-layer, second sub-layer and third sub-layer of the occlusion processing layer are respectively formed according to formula (3),

(4), (5) Calculate the visibility of each part:

= + g (3)

y^S((h ³ ) ^T w ^cls + b) (5) where, represents the score map of the P-th part on the first layer of the occlusion processing layer, ^ represents the weight matrix of 4, represents the bias of 4 Position, represents the visibility of the P-th part on the first layer of the occlusion processing layer, ^tXl + expi-t))- ¹ , represents the visibility of the P-th part on the Z-th sub-layer of the occlusion processing layer For visibility, let ^ represent the transfer matrix between ^ and , represent the jth column of ^ , W ^ds represent the parameters of the linear classifier of the latent variable, « represent the transpose of the matrix X, represent the convolutional neural The output of the network.

6. A convolutional neural network, characterized by including:

The feature extraction layer is used to preprocess the pixel values of the detection area according to the pixel values of the detection area in the extracted image, and perform feature extraction on the preprocessed image to obtain a feature map of the detection area;

The part detection layer is used to detect the feature maps of the detection area through M filters, and output response maps corresponding to the M parts of the detection area. Each filter is used to detect one part, and each part corresponds to a response. picture;

Deformation processing layer, used to determine the M according to the response maps corresponding to the M parts. deformations of the N parts, and determine the score maps of the M parts based on the deformations of the N parts; an occlusion processing layer, used to determine the occlusion corresponding to the M parts based on the score maps of the M parts;

A classifier, used to determine whether there is a target object in the detection area based on the output result of the occlusion processing layer.

7. The convolutional neural network according to claim 6, wherein the feature extraction layer includes three channels, namely a first channel, a second channel, and a third channel;

The second channel is used to reduce the size of the detection area to a quarter of the original size, convert the reduced detection area into YUV format, and filter the conversion into YUV through Sobel edge operator format detection area, respectively obtain the first edge map of the detection area on the three channels Y, U, and V. The Y, U, and V channels respectively correspond to a first edge map, and obtain the three first edges. The maximum value at each position in the figure forms a second edge map. The three first edge maps and the second edge map are of the same size, and are all a quarter of the size of the detection area. The three first edge maps are The spliced image of the first edge image and the second edge image is used as the output data of the second channel;

The third channel is used to reduce the size of the detection area to a quarter of the original size, convert the reduced detection area into YUV format, and filter the conversion into YUV through Sobel edge operator Format detection area, obtain the first edge map of the detection area on the Y, U, and V channels respectively. The Y, υ, and V channels respectively correspond to a first edge map, and generate a third edge map, The data at each position of the third edge map is 0, the three first edge maps and the third edge map are the same size, and are all one-quarter of the size of the detection area. The three first edges are The spliced image of the image and the third edge image is used as the output data of the third channel.

8. The convolutional neural network according to claim 7, wherein the part detection layer includes three sub-layers, namely a first sub-layer, a second sub-layer and a third sub-layer, and the part detection layer The first sub-layer of includes M1 filters, the second sub-layer of the part detection layer includes M2 filters, and the third sub-layer of the part detection layer includes M3 filters, where,

M1+M2+M3=M; The first sub-layer of the part detection layer is used to detect Ml parts in the detection area through Ml filters respectively, and obtain Ml response maps;

The second sub-layer of the part detection layer is used to detect M2 parts in the detection area through M2 filters respectively, and obtain M2 response maps;

The third sub-layer of the part detection layer is used to respectively detect M3 parts in the detection area through M3 filters, and obtain M3 response maps.

9. The convolutional neural network according to claim 8, characterized in that the deformation processing layer is specifically used for:

Among them, ^ represents the deformation component diagram of the p-th part, l≤p≤M, M _p represents the response diagram corresponding to the p-th water part, N represents the restriction condition of the p-th part, ^D w represents the n-th The score map corresponding to the water restriction condition, l≤w≤N, C„^ represents the weight corresponding to the nth restriction condition; the deformation processing layer determines the P-th part according to the deformation score map according to the formula (2) Score chart of:

s _D = maxB (2)

p where, β ) represents the value of β _ρ at the (χ, y) position.

10. The convolutional neural network according to claim 8, wherein the occlusion processing layer includes three sub-layers, namely a first sub-layer, a second sub-layer, and a third sub-layer;

(4), (5) Calculate the visibility of each part:

^ w^ + c^+g'; ¹ ^ ¹ ), 1=1, 2 (4) y^S((h ³ ) ^T w ^c!s + b)

(5) Among them, ^ represents the score map of the P-th part on the first layer of the occlusion processing layer, the weight matrix of table/", represents the bias, represents the P-th part on the occlusion processing layer of

The visibility on layer 1, ^tXl + expi-t))- ¹ , represents the P-th part during the occlusion process Visibility on the Z-th sub-layer of the layer, ^ represents the transfer matrix between / ^ and / ^, represents the j-th column of ^, represents the parameters of the linear classifier of the latent variable ^, ^ represents the transformation of the matrix X Set, represents the output result of the convolutional neural network.