CN109214505A

CN109214505A - A kind of full convolution object detection method of intensive connection convolutional neural networks

Info

Publication number: CN109214505A
Application number: CN201810998184.3A
Authority: CN
Inventors: 胡海峰; 黄福强; 王伟轩; 张运鸿; 孙永丞
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2019-01-15
Anticipated expiration: 2038-08-29
Also published as: CN109214505B

Abstract

The present invention relates to artificial intelligence fields, more particularly to a kind of full convolution object detection method of intensive connection convolutional neural networks.The present invention is in order to overcome the shortcomings of that existing method cannot more accurately detect multiscale target, provide a kind of full convolution object detection method of intensive connection convolutional neural networks, it is characterized in that multiple dimensioned Feature Mapping can be effectively utilized to carry out target detection, so that convolutional neural networks are to the detection of the different scale target in same image accuracy rate all with higher.

Description

A kind of full convolution object detection method of intensive connection convolutional neural networks

Technical field

The present invention relates to artificial intelligence fields, more particularly to a kind of full convolution of intensive connection convolutional neural networks Object detection method.

Background technique

Convolutional neural networks have invariance to the detection of feature.Such as convolution mind after an object is translated, rotated It remains to identify that they are same object through network, but the target less for some occupied areas in the picture, information exist Convolutional neural networks can be lost during extracting feature, lead to not accurately detect target.With pushing away for recent research Into, it has been found that it can effectively improve in the character representation of use " multiple dimensioned " to the accurate of the target detection of different scale Rate.It there have been attempts the detection that multiscale target is carried out using image pyramid, specific practice is first more to sub-picture progress Then the image of different scale is input in convolutional neural networks by the scaling of a scale, but this method needs are very big Calculation amount and memory, therefore do not have feasibility.

Summary of the invention

In order to overcome existing method that cannot more accurately be detected to multiscale target, the present invention provides a kind of intensive Connect the full convolution object detection method of convolutional neural networks.

To realize the above goal of the invention, the technical solution adopted is that:

A kind of full convolution object detection method of intensive connection convolutional neural networks, specifically includes the following steps:

Step S1: construction feature extracts network Densenet, and feature extraction network is by multiple intensive link blocks and conversion layer Composition, the visual signature that identification is had more in image can be recognized using intensive link block, and input picture passes through feature extraction After network, retain the features with different semantemes and different resolution of each intensive link block output；

Step S2: construction feature pyramid FPN, it is input in FPN each layer feature is retained in step S1, according to feature ruler Degree stacks, formed one from bottom to top, the incremental low semantic feature pyramid of scale, by lowest level, every layer of feature is all passed through It crosses " parallel path " and carries out convolution operation to obtain higher Semantic；The feature after convolution can be by upper sampling to upper one layer simultaneously The same scale of feature, and merging with upper one layer of feature, this feature continues to up transmit, until pyramid tower top, This step is recycled until constructing complete feature pyramid；

Step S3: full convolution fallout predictor FCP network is constructed, full convolution fallout predictor FCP is one can export target side simultaneously The fallout predictor of boundary's frame information and class probability is respectively predicted the Feature Mapping of all scales in feature pyramid, in advance Survey the vector conduct that device makes the Feature Mapping of input export a size S*S* (B*5+C) after a convolutional neural networks Prediction result functions as and original image is divided into S*S grid, to B bounding box of each grid forecasting, each boundary Frame includes 5 information, the centre coordinate deviant (t including bounding box_x, t_y), the high deviant (t of the width of bounding box_w,t_h), and The confidence level t of predicted boundary frame₀, there are also the probability to C target category of each grid forecasting；

Step S4: training overall network, acquisition target image parameter are simultaneously input in network, the parameter of each layer network according to The mode of Xavier initializes, and using the stochastic gradient returned as bounding box coordinates with loss function composed by object classification Descent algorithm is calculated loss gradient and is finely adjusted using reverse conduction algorithm to the parameter in layers all in whole network.

Preferably, specific step is as follows in the step S1:

Existing trained intensive connection convolutional neural networks model is adjusted to obtain preliminary by step S101 Feature extraction network model；

Step S102 intensively connects convolutional neural networks and is divided into multiple intensive link blocks in implementation process, and different is close It is attached between collection link block by conversion layer；

Step S103 has multiple convolutional neural networks layers, each convolutional neural networks layer in an intensive link block Input be all convolutional neural networks layers in the same intensive link block before it output superposition；If intensive connection L layers of convolutional network input is x in block_l, export as y_l, then x_l=(x₁+y₁+…+y_l-1), y_l=H (x_l), wherein H () is fixed Justice is activation primitive；

Step S104 H () is the activation primitive that every layer of convolutional neural networks are followed by, it is a composition operation herein, Indicate input x_lA BN operation is first passed through, using a ReLU function, finally by the processing of a convolutional layer as whole The output of a activation primitive；

Step S105 is since the space size of different intensive link blocks is different, so passing through a conversion layer between each other It is attached, the more than conversion layer output of one intensive link block first passes through a BN operation as input, then connects a convolution The space size of Feature Mapping finally is adjusted to meet next intensive link block by neural net layer by a pond layer Input；Setting the space size by pond layer Feature Mapping herein becomes original 1/n times；

The intensive link block of step S106 and conversion layer repeatedly alternately connect, so that the every warp of the space size of Feature Mapping All reduce after crossing an intensive link block, and the port number of Feature Mapping then increases, and sets each intensive link block herein most The Feature Mapping of later layer convolutional neural networks output is c_m；

The overall situation that step S107 deletes existing intensive connection convolutional neural networks is averaged pond layer and the classification that connects entirely Layer, and using the Feature Mapping of the last layer convolutional neural networks of the last one intensive link block output as feature extraction network Output.

Preferably, specific step is as follows in the step S2:

Step S201FPN is made of " feature pyramid from bottom to top " and one " parallel path ", and FPN is first mentioned from feature It takes and obtains the visual signature that its each layer has different semantic different scales in network, it is then raw by the build stack of " from bottom to top " At the feature pyramid of lower semantic feature；

Step S202 takes first input of the Feature Mapping exported in step S107 as FPN, and the Feature Mapping of input is used Port number is adjusted to a constant d by one convolutional layer, and will be by port number Feature Mapping adjusted as feature pyramid Lowermost layer Feature Mapping, set the Feature Mapping of every layer of feature pyramid herein as D_m；

" path from bottom to top " in step S203FPN, main task are that low one layer of feature pyramidal to feature is reflected It injects and samples on row, the factor sampled thereon is characterized the n reciprocal for extracting the diminution factor of pond layer in network, obtained feature Map the Feature Mapping space size having the same of intensive link block output corresponding with step S1；

" parallel path " in step S204FPN, it is made with the Feature Mapping that intensive link block each in step S1 exports For input, the port number of the Feature Mapping of output is then adjusted to d using a convolutional layer；

Step S205 passes through step S203 and step S204, obtains two identical features on space size and port number The two Feature Mappings are carried out corresponding element addition by mapping, are then reached in the upper sampling process of reduction by a convolutional layer Aliasing effect, result in low one layer of the Feature Mapping of feature pyramid, in step S203 and step S204 to input Operation be denoted as f () and g () respectively, then D_m=g (C_m), D_k=∫ (f (D_k+1)+g(C_k)), wherein (0 < k < m), ∫ indicates S2.5 In convolution operation；

Step S206 repeats step S203, step S204 and step S205, so that layer-by-layer from the pyramidal lowermost layer of feature Entire feature pyramid is constructed toward Shangdi.

Preferably, specific step is as follows in the step S3:

Step S301 has obtained a feature pyramid in step S02, its main feature is that the pyramidal characteristic dimension of feature Successively increase from bottom to top, but each layer of port number remains unchanged, the ratio of the space size of the Feature Mapping of adjacent two layers The example factor is n, and building one exports the fallout predictor of object boundary frame information and class probability simultaneously, and fallout predictor will act on feature Pyramidal each layer of feature enables network to utilize the Feature Mapping of different scale；

Step S302 exports the building of the fallout predictor of object boundary frame information and class probability, pyramidal a certain with feature Layer Feature Mapping is input, after the processing of two full articulamentums, exports the vector an of S*S* (B*5+C) as prediction knot Fruit functions as and original image is divided into S*S grid, and to B bounding box of each grid forecasting, each bounding box includes 5 information, the centre coordinate deviant (t including bounding box_x, t_y), the high deviant (t of the width of bounding box_w, t_h), and prediction side The confidence level t of boundary's frame₀, there are also the probability to C target category of each grid forecasting；

The calculating of step S303 coordinate value:

X=c_x+σ(t_x)

Y=c_y+σ(t_y)

σ(t₀)=Pr (object) * IOU (b, object)

Wherein x, y are the actual coordinate of bounding box center in the picture, and w, h are respectively the width and height of bounding box；(c_x, c_y) It is p for the top left co-ordinate of grid_w, p_hWidth and high difference for input picture.

Preferably, specific step is as follows in the step S4:

Step S401 Image Acquisition: the image comprising all kinds of targets is as training image, every figure in acquisition daily life As take through processing all obtain about the bounding box of target in the image and the information of classification；

Step S402 is that each premeasuring establishes cost function for training, for the centre coordinate of bounding box, using public affairs Formula

It is high for the width of bounding box as cost function, using formula

As cost function, for predicting classification, using formula

Wherein λ_coordAnd λ_noobjBe in order to allow cost function to make balance between bounding box and the cost of probability, andTable Show that target appears in i-th of grid,Indicate the target of the corresponding prediction of j-th of bounding box in i-th of grid, it is final to obtain To following cost function:

Step S403 is input to the data marked being collected into step S401 in network, and the parameter of each layer is pressed It is initialized according to the mode of Xavier, and using the boarding steps returned as bounding box coordinates with loss function composed by object classification Degree descent algorithm is calculated loss gradient and is finely adjusted using reverse conduction algorithm to the parameter in layers all in whole network, is reached To the purpose being trained to network.

Preferably, in the step S1, the network structure for replacing connection with conversion layer using intensive link block carries out feature It extracts, the Feature Mapping for more preferably having identification in image can be extracted.

Preferably, under the intensive connection convolution of described one kind and on feature pyramid " and " parallel path " FPN for forming Network, can efficiently use high semantic low scale and high yardstick is spoken in a low voice the Feature Mapping of justice, construct and have high semantic feature, big The feature pyramid of scale and high location information.

Compared with prior art, the beneficial effects of the present invention are:

The present invention provides a kind of full convolution object detection methods of intensive connection convolutional neural networks, it is characterized in that can Target detection is carried out to effectively utilize multiple dimensioned Feature Mapping, so that convolutional neural networks are to the difference in same image The detection of scaled target accuracy rate all with higher.

Detailed description of the invention

Fig. 1 is flow chart of the invention.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

Below in conjunction with drawings and examples, the present invention is further elaborated.

Embodiment 1

As shown in Figure 1, the present invention provides a kind of full convolution object detection method of intensive connection convolutional neural networks, specifically The following steps are included:

Preferably, specific step is as follows in the step S1:

Preferably, specific step is as follows in the step S2:

Preferably, specific step is as follows in the step S3:

The calculating of step S303 coordinate value:

X=c_x+σ(t_x)

Y=c_y+σ(t_y)

σ(t₀)=Pr (object) * IOU (b, object)

Preferably, specific step is as follows in the step S4:

It is high for the width of bounding box as cost function, using formula

As cost function, for predicting classification, using formula

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of full convolution object detection method of intensive connection convolutional neural networks, which comprises the following steps:

Step S1: construction feature extracts network Densenet, and feature extraction network is made of multiple intensive link blocks and conversion layer, The visual signature that identification is had more in image can be recognized using intensive link block, input picture passes through feature extraction network Afterwards, retain the features with different semantemes and different resolution of each intensive link block output；

Step S2: construction feature pyramid FPN, it is input in FPN each layer feature is retained in step S1, according to characteristic dimension heap It is folded, formed one from bottom to top, the incremental low semantic feature pyramid of scale, by lowest level, every layer of feature is all passed through " flat Walking along the street diameter " carries out convolution operation to obtain higher Semantic；The feature after convolution can be by upper sampling to upper one layer of feature simultaneously Same scale, and merged with upper one layer of feature, this feature continues to up transmit, until pyramid tower top, circulation This step is until constructing complete feature pyramid；

Step S3: full convolution fallout predictor FCP network is constructed, full convolution fallout predictor FCP is one can export object boundary frame simultaneously The fallout predictor of information and class probability is respectively predicted the Feature Mapping of all scales in feature pyramid, fallout predictor The Feature Mapping of input is set to export the vector of a size S*S* (B*5+C) after a convolutional neural networks as prediction As a result, it, which is functioned as, is divided into S*S grid original image, and to B bounding box of each grid forecasting, each bounding box packet Containing 5 information, the centre coordinate deviant (t including bounding box_x, t_y), the high deviant (t of the width of bounding box_w,t_h), and prediction The confidence level t of bounding box₀, there are also the probability to C target category of each grid forecasting；

2. the full convolution object detection method of intensive connection convolutional neural networks according to claim 1, which is characterized in that Specific step is as follows in the step S1:

Step S101 is adjusted existing trained intensive connection convolutional neural networks model to obtain preliminary feature Extract network model；

Step S102 intensively connects convolutional neural networks and is divided into multiple intensive link blocks in implementation process, and different intensively connects It connects and is attached between block by conversion layer；

Step S103 in an intensive link block have multiple convolutional neural networks layers, each convolutional neural networks layer it is defeated Enter be all convolutional neural networks layers in the same intensive link block before it output superposition；If in intensive link block L layers of convolutional network input is x_l, export as y_l, then x_l=(x₁+y₁+…+y_l-1), y_l=H (x_l), wherein H () is defined as Activation primitive；

Step S104 H () is the activation primitive that every layer of convolutional neural networks are followed by, it is a composition operation herein, is indicated Input x_lA BN operation is first passed through, using a ReLU function, is finally swashed by the processing of a convolutional layer as entire The output of function living；

Step S105 is since the space size of different intensive link blocks is different, so being carried out between each other by a conversion layer It connects, the more than conversion layer output of one intensive link block first passes through a BN operation as input, then connects a convolutional Neural The space size of Feature Mapping finally is adjusted to meet the defeated of next intensive link block by network layer by a pond layer Enter；Setting the space size by pond layer Feature Mapping herein becomes original 1/n times；

The intensive link block of step S106 and conversion layer repeatedly alternately connect, so that the space size of Feature Mapping is every to pass through one All reduce after a intensive link block, and the port number of Feature Mapping then increases, and sets each intensive link block last herein The Feature Mapping of layer convolutional neural networks output is C_m；

The overall situation that step S107 deletes existing intensive connection convolutional neural networks is averaged pond layer and the classification layer that connects entirely, and The Feature Mapping that the last layer convolutional neural networks of the last one intensive link block are exported is as the defeated of feature extraction network Out.

3. the full convolution object detection method of intensive connection convolutional neural networks according to claim 2, which is characterized in that Specific step is as follows in the step S2:

Step S201 FPN is made of " feature pyramid from bottom to top " and one " parallel path ", and FPN is first from feature extraction The visual signature that its each layer has different semantic different scales is obtained in network, is then generated by the build stack of " from bottom to top " The feature pyramid of lower semantic feature；

Step S202 takes first input of the Feature Mapping exported in step S107 as FPN, and the Feature Mapping of input is with one Port number is adjusted to a constant d by convolutional layer, and will be pyramidal most as feature by port number Feature Mapping adjusted Low-level feature mapping, sets the Feature Mapping of every layer of feature pyramid herein as D_m；

" path from bottom to top " in step S203FPN, main task be low one layer of Feature Mapping pyramidal to feature into It samples on row, the factor sampled thereon is characterized the n reciprocal for extracting the diminution factor of pond layer in network, obtained Feature Mapping The Feature Mapping space size having the same of intensive link block output corresponding with step S1；

" parallel path " in step S204FPN, the Feature Mapping of its each intensive link block output using in step S1 is as defeated Enter, the port number of the Feature Mapping of output is then adjusted to d using a convolutional layer；

Step S205 pass through step S203 and step S204, obtain two on space size and port number identical feature reflect It penetrates, the two Feature Mappings is carried out corresponding element addition, then reach in the upper sampling process of reduction by a convolutional layer Aliasing effect results in low one layer of the Feature Mapping of feature pyramid, in step S203 and step S204 to input Operation is denoted as f () and g () respectively, then D_m=g (C_m), D_k=∫ (f (D_k+1)+g(C_k)), wherein (0 < k < m), ∫ is indicated in S2.5 Convolution operation；

Step S206 repeats step S203, step S204 and step S205, so that successively up from the pyramidal lowermost layer of feature Construct entire feature pyramid in ground.

4. the full convolution object detection method of intensive connection convolutional neural networks according to claim 2, which is characterized in that Specific step is as follows in the step S3:

Step S301 has obtained a feature pyramid in step S02, its main feature is that the pyramidal characteristic dimension of feature is under And upper layer-by-layer increase, but each layer of port number remains unchanged, the ratio of the space size of the Feature Mapping of adjacent two layers because Son is n, and building one exports the fallout predictor of object boundary frame information and class probability simultaneously, and fallout predictor will act on feature gold word Each layer of feature of tower enables network to utilize the Feature Mapping of different scale；

Step S302 exports the building of the fallout predictor of object boundary frame information and class probability, special with the pyramidal a certain layer of feature Sign is mapped as inputting, and after the processing of two full articulamentums, exports the vector of a S*S* (B*5+C) as prediction result, It, which is functioned as, is divided into S*S grid original image, and to B bounding box of each grid forecasting, each bounding box includes 5 Information, the centre coordinate deviant (t including bounding box_x, t_y), the high deviant (t of the width of bounding box_w, t_h) and predicted boundary The confidence level t of frame₀, there are also the probability to C target category of each grid forecasting；

The calculating of step S303 coordinate value:

X=c_x+σ(t_x)

Y=c_y+σ(t_y)

σ(t₀)=Pr (object) * IOU (b, object)

Wherein x, y are the actual coordinate of bounding box center in the picture, and w, h are respectively the width and height of bounding box；(c_x, c_y) it is lattice The top left co-ordinate of son is p_w, p_hWidth and high difference for input picture.

5. the full convolution object detection method of intensive connection convolutional neural networks according to claim 1, which is characterized in that Specific step is as follows in the step S4:

Step S401 Image Acquisition: the image comprising all kinds of targets is as training image, every picture strip in acquisition daily life It is upper by processing all obtain about the bounding box of target in the image and the information of classification；

Step S402 is that each premeasuring establishes cost function for training, for the centre coordinate of bounding box, using formula

It is high for the width of bounding box as cost function, using formula

As cost function, for predicting classification, using formula

Wherein λ_coordAnd λ_noobjBe in order to allow cost function to make balance between bounding box and the cost of probability, andIndicate mesh It marks in present i-th of grid,The target for indicating the corresponding prediction of j-th of bounding box in i-th of grid, finally obtain as Under cost function:

Step S403 is input to the data marked being collected into step S401 in network, the parameter of each layer according to The mode of Xavier initializes, and using the stochastic gradient returned as bounding box coordinates with loss function composed by object classification Descent algorithm is calculated loss gradient and is finely adjusted using reverse conduction algorithm to the parameter in layers all in whole network, is reached The purpose that network is trained.

6. a kind of full convolution object detection method of intensive connection convolutional neural networks according to claim 1, feature It is, in the step S1, the network structure for replacing connection with conversion layer using intensive link block carries out feature extraction, can extract More preferably there is the Feature Mapping of identification into image.

7. the feature pyramid under a kind of intensive connection convolution according to claim 1 " and " parallel path " composition FPN network, can efficiently use high semantic low scale and high yardstick is spoken in a low voice the Feature Mapping of justice, construct have it is high semantic special The feature pyramid of sign, large scale and high location information.