WO2019144575A1

WO2019144575A1 - Fast pedestrian detection method and device

Info

Publication number: WO2019144575A1
Application number: PCT/CN2018/095058
Authority: WO
Inventors: 林倞; 尹森堂; 张冬雨; 王青
Original assignee: 中山大学
Priority date: 2018-01-24
Filing date: 2018-07-10
Publication date: 2019-08-01
Also published as: CN108399362A; CN108399362B

Abstract

Disclosed in the present invention are a fast pedestrian detection method and a device. The method comprises the following steps: step S1, constructing a configurable deep model on the basis of a convolutional neural network, and utilizing training samples to learn parameters of a constructed network to obtain a model used for a test process; and step S2, inputting test samples, utilizing a variation law of neural-network perception domains and using different intermediate layers to detect target objects in different scale ranges through a trained model, and obtaining box graphs of the target objects in images by prediction. The method of the invention uses the different intermediate layers to detect the target objects in the certain scale ranges through utilizing the variation law of the neural-network perception domains, better adapts to relationships of the perception domains and object sizes, and effectively improves a detection result.

Description

Fast pedestrian detection method and device

Technical field

The invention relates to the field of pedestrian detection technology, in particular to a fast pedestrian detection method and device for an embedded system based on deep learning.

Background technique

As part of the detection of targets in computer vision, the application of pedestrian detection in the real world is of great significance. With the maturity of image acquisition technology and the decline in the cost of storage technology, more and more cameras are deployed in public places. With the implementation of autonomous driving and intelligent transportation, the in-vehicle camera has also produced a huge amount of video resources. Traditional manual screening and processing is not only inefficient, it consumes a lot of manpower and material resources, but also may introduce some human factors, leading to some deviations. In recent years, deep learning has achieved unprecedented breakthroughs in the field of computer vision. Not only is efficiency far better than manpower, but accuracy has surpassed humans in many fields. Therefore, the problem of effectively using the deep learning method for pedestrian detection has attracted attention.

People are one of the most important goals in video surveillance or autonomous driving, and the primary task of pedestrian detection is to identify the presence of the human body and provide corresponding annotation information. Because the quality of images captured in the real world is uneven, the detection of small objects and occluded objects has always been a difficult point for pedestrian detection. On the other hand, car cameras often capture some blurred images. There are also a large number of objects that are similar to pedestrians but not pedestrians. Specific to embedded systems, large-scale neural network models with strong recognition capabilities are often difficult to run efficiently on embedded devices with limited computing resources, while the application requirements for embedded devices are real-time, so the accuracy of detection is considered. And efficiency is a top priority for fast pedestrian detection for embedded systems.

Summary of the invention

In order to overcome the deficiencies of the prior art described above, it is an object of the present invention to provide a method and apparatus for rapid pedestrian detection, which utilizes different intermediate layers to perform target objects in a specific scale range by utilizing the variation law of the neural network sensing domain. The detection better adapts to the relationship between the sensing domain and the size of the object, and effectively improves the detection result.

Another object of the present invention is to provide a fast pedestrian detection method and apparatus, which can adjust and train the VGG-16 network to obtain a squeezeze VGG-16 network that meets the requirements of the embedded system, thereby effectively reducing the parameter amount of the network model and speeding up. Computational efficiency.

A further object of the present invention is to provide a method and apparatus for rapid pedestrian detection, which can amplify a feature map of a specific network layer by a method of deconvolution, and enhance detection of a small object, compared to a conventional image enlargement method. Hardly increase the amount of memory and calculations.

Another object of the present invention is to provide a fast pedestrian detection method and apparatus, which is excellent in detecting a blurred object and a long-distance small object by using a region of 1.5 times the size of the target object as a background semantic feature. performance.

To achieve the above and other objects, the present invention provides a rapid pedestrian detection method comprising the following steps:

Step S1, constructing a configurable depth model based on a convolutional neural network, learning a constructed network parameter using the training sample, and obtaining a model for the testing process;

In step S2, the test sample is input, and the trained model is used to detect the target object in different scales by using different intermediate layers to detect the target object in the image, and the block diagram of the target object in the image is predicted.

Preferably, step S1 further comprises:

Construct a configurable depth model based on convolutional neural networks;

Enter training samples;

Initializing the convolutional neural network and its parameters, including the weight and offset of each layer connection in the network layer;

Using the forward propagation algorithm and the backward propagation algorithm, the training sample is used to learn the constructed network parameters, that is, the model used for the test process.

Preferably, the depth model comprises a multi-scale target candidate network and a target detection network, and the target candidate network proposes differences of features based on different layers of the convolutional neural network, and generates candidates for different scale target objects in the intermediate layer respectively. Block diagram; the target detection network performs refined classification and detection on the basis of candidate block diagrams output by the target candidate network.

Preferably, the convolutional neural network is formed by stacking a convolution layer, a downsampling layer, and an upsampling layer. The convolution layer refers to a convolution operation on a two-dimensional space of an input image or a feature map to extract a layered feature; the downsampling layer uses a max-pooling operation without overlap, which is used to extract a shape and Offset invariant features, while reducing the size of the feature map, and improving computational efficiency; the upsampling layer refers to the operation of deconvolving the input feature map in a two-dimensional space to increase the pixel of the feature map .

Preferably, the depth model uses a Squeeze VGG-16 convolutional neural network as a backbone network, and the Squeeze VGG-16 convolutional neural network is characterized by a conv1-1 layer and a 12-layer Fire module layer immediately following it. Network structure.

Preferably, on the basis of the Squeeze VGG-16 convolutional neural network, according to the convolution layer feature, the target candidate network generates network branches in Fire9, Fire12, conv6 and the added pooling layer to detect different scales. The regression of the candidate box of the object.

Preferably, on the basis of the target candidate area, the target detection area uses the picture area of the target candidate area preset multiple size as the target background semantic information, and performs the upsampling of the feature map of the Fire9 layer as an enhanced pair. The small object perceives the information, and the background semantic information and the upsampled information are obtained through the pooling of the region of interest to obtain a fixed size feature, and then a layer of fully connected layers is added to perform the regression of the category and the final candidate frame.

Preferably, the training sample includes RGB image data and annotation information of a pedestrian area in the image, and the actual training image data is a small patch cropped according to the region where the pedestrian is located.

Preferably, in the backward propagation algorithm, the target block diagram of the forward propagation prediction and the loss function of the actual target block diagram of the image are first obtained.

Then find its gradient to the parameter W, and update the W with a gradient descent algorithm to minimize the loss function.

It is assumed that there are M branches in the middle layer to output the target candidate region, l ^m represents the loss function of the branch m, α _m represents the weight of the l ^m function, and S={S ¹ , S ² , . . . , S ^M } refers to the target of the corresponding scale. Object, loss function

Can be defined as:

To achieve the above object, the present invention also provides a fast pedestrian detection system, comprising:

a training unit for constructing a configurable depth model based on a convolutional neural network, learning a constructed network parameter using the training sample, and obtaining a model for the testing process;

The detecting unit is configured to input the test sample, and the trained model is used to detect the target object in different scales by using different intermediate layers to detect the target object in the image, and predict the block diagram of the target object in the image.

Compared with the prior art, a fast pedestrian detection method and device of the present invention draws on a compression network method, adjusts and trains the VGG-16 network to obtain a squeezeze VGG-16 network that meets the requirements of the embedded system, and effectively reduces the network model. The parameter quantity accelerates the calculation efficiency; on the other hand, the problem of the inconsistency between the sensing domain and the object size in the traditional detection method, the invention utilizes the variation law of the neural network sensing domain (ie, the deeper the neural network layer, the larger the sensing domain is, suitable for Detecting larger target objects), using different intermediate layers to detect target objects in a specific scale range, better adapting to the relationship between the sensing domain and the object size, effectively improving the detection results; in addition, in order to enhance the small objects The detection method of the present invention uses a deconvolution method to amplify the feature map of a specific network layer. Compared with the conventional image enlargement method, the display memory and the calculation amount are hardly increased; in order to enhance the detection of the fuzzy object, the layer is On the feature map, use the area 1.5 times the size of the target object as the background semantic feature to add to the network. Blur distant objects and the detection of small objects, with excellent performance.

DRAWINGS

1 is a flow chart showing the steps of a fast pedestrian detection method according to the present invention;

2 is a schematic structural diagram of a Squeeze VGG-16 neural network according to an embodiment of the present invention;

3 is a schematic diagram of a Fire module in a specific embodiment of the present invention;

4 is a schematic structural diagram of a target candidate network according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a target detection network according to an embodiment of the present invention; FIG.

6 is a schematic diagram of a process of fast pedestrian detection in a specific embodiment of the present invention;

7 is a system architecture diagram of a fast pedestrian detection device according to the present invention;

Figure 8 is a detailed structural view of a training unit in a specific embodiment of the present invention;

Figure 9 is a detailed structural view of a detecting unit in a specific embodiment of the present invention.

Detailed ways

The embodiments of the present invention will be described by way of specific examples and the accompanying drawings, and those skilled in the art can readily understand the advantages and advantages of the present invention. The present invention may be embodied or applied in various other specific embodiments, and various modifications and changes may be made without departing from the spirit and scope of the invention.

1 is a flow chart showing the steps of a fast pedestrian detection method according to the present invention. As shown in FIG. 1, a fast pedestrian detection method of the present invention includes the following steps:

In step S1, a configurable depth model based on a convolutional neural network is constructed, and the constructed network parameters are learned by using the training samples to obtain a model for the testing process. In a specific embodiment of the present invention, the depth model is composed of two sub-networks: a first sub-network, which is a multi-scale target candidate network, used to extract character features and give candidate regions, specifically, the target candidate network. Based on the difference of features proposed by different layers of the convolutional neural network, candidate block diagrams for pedestrians of different scales are generated in the middle layer respectively; the second sub-network is the target detection network, and the effect of the detection is enhanced, and the parameters are shared with the target candidate network. Refined classification and detection based on candidate block diagrams. Specifically, step S1 further includes:

Step S100, constructing a configurable depth model based on a convolutional neural network.

The convolutional neural network is formed by stacking a convolutional layer, a downsampling layer, and an upsampling layer, and the convolutional layer is a convolution operation on an input image or a feature image in a two-dimensional space to extract hierarchical features. The downsampling layer uses a max-pooling operation without overlap, which is used to extract features with shape and offset invariance, while reducing the size of the feature map and improving computational efficiency; the upsampling layer refers to the pair The input feature map performs a deconvolution operation on the two-dimensional space to increase the pixels of the feature map, and is mainly used for the target detection network to improve the detection effect. In the specific embodiment of the present invention, the Squeeze VGG-16 volume is adopted. As an backbone network, as shown in Figure 2, the Squeeze VGG-16 convolutional neural network uses a conv1-1 layer followed by a 12-layer Fire module as a convolutional layer to extract features; -pool5 is the downsampling layer; the pre-trained model on the ImageNet dataset is used for initialization. That is, the present invention first pre-trains Squeeze VGG-16 as a network initialization using the ImageNet data set.

FIG. 3 is a schematic structural diagram of a Fire module according to an embodiment of the present invention. As shown in Figure 3, the Fire module consists of two convolution layers with a convolution kernel size of 1 × 1 and a convolution layer with a convolution kernel size of 3 × 3, in order to replace the 1 × 1 convolution kernel. 3 × 3 convolution kernel, so that the parameter amount is reduced by 9 times, but in order not to affect the network representation ability, not all replacement, but part is to use 1 × 1 convolution kernel, part uses 3 × 3 convolution Nuclear, another advantage of this is to reduce the input channel of the 3 × 3 convolution kernel, and at the same time reduce the amount of parameters. Specifically, the Fire module first uses the 1 × 1 convolution layer to reduce the input layer. Then, referring to the GoogLeNet structure, using 1x1 and 3x3 convolutional layers to extract features, and finally connecting the two parts of the feature, this way greatly reduces the amount of computation and model parameters.

4 is a schematic structural diagram of a target candidate network in a specific embodiment of the present invention. In the specific embodiment of the present invention, the target candidate network is based on the Squeeze VGG-16 convolutional neural network, and according to the convolution layer feature, a total of 4 layers are generated in the Fire9, Fire12, conv6, and the added pooling layer, and a network branch is generated. The branch performs regression of the candidate frame of the object detected at different scales. But for the Fire-9 layer, it is closer to the lower layer of the backbone network. Compared to other layers, the gradient will have a large impact, and the learning process is unstable. Therefore, a buffer layer is added, as shown in the det-conv layer in Figure 4. As shown, the buffer layer avoids detecting the gradient of the branch being directly back-propagated (backpropagated) to the backbone layer.

The invention utilizes the variation law of the neural network sensing domain (ie, the deeper the neural network layer is, the larger the sensing domain is, it is suitable for detecting a larger target object), and the different intermediate layers are used to detect the target object in a specific scale range, which is better. Adapted to the relationship between the sensing domain and the size of the object, effectively improving the detection results.

FIG. 5 is a schematic structural diagram of a target detection network according to an embodiment of the present invention. The target detection network shares parameters with the target candidate network, and summarizes candidate blocks of the target candidate network to enhance the ability of the monitoring network to distinguish objects from the background. In a specific embodiment of the present invention, the target detection network uses, as a target background semantic information, a picture area of 1.5 times the target candidate area on the basis of the target candidate area; and performs a upsampling of the feature map of the Fire9 layer. As information for enhancing the perception of small objects, the background semantic information and the upsampled information are subjected to ROI pooling to obtain fixed-size features, and then a layer of fully-connected layers is added to perform regression of categories and final candidate frames. Specifically, the backbone cnn layer is connected to a proposal node for summarizing the candidate frame information obtained by the target candidate network; on the other hand, for the feature map of the fire9 layer, W and H are the width and height of the input picture, and the cube 1 Represents the mapping of the object area in the feature map, and cube 2 represents the mapping of the context area on the feature map. The context area is about 1.5 times the object area. At the same time, in order to enhance the detection of the small object, the Fire9 layer is upsampled once. , then similar to the fast RCNN algorithm, using the pooling of the region of interest to obtain fixed-size features; The 9-layer processed features are concated with the features of the proposals, and then a layer of fully connected layers is added for regression of the category and final candidate boxes, which are not described here.

In step S101, a training sample is input.

The training process needs to provide the corresponding frame of the reference character in the image, and in order to speed up the training, the training process cuts the image containing the reference character from the original image to form a patch, and the patch is smaller than the original image. For training, it effectively speeds up the training process. Specifically, in the present invention, the input training samples include RGB image data and annotation information of the pedestrian area in the image, and the actual training image data is a small patch (image block) which is cropped according to the region where the pedestrian is located. Expressed in mathematical language, training samples

Where X _i represents a patch of the training picture; in practical applications, in addition to the category of pedestrians, there are other categories, such as background, bicycle rider, sitting person, etc. K categories, so the label data Y _i = (y _i ,b _i ) by category label y _i ∈{0,1,2,...,K} and block coordinate points

Composition, of which

For the starting coordinate point in the upper left corner of the block diagram,

For block diagram width and height.

Step S102, initializing the convolutional neural network and its parameters, including weights and offsets of each layer connection in the network layer. In particular, the present invention pre-trains the Squeeze VGG-16 convolutional neural network as a network initialization using the ImageNet data set.

Step S103, using a forward propagation algorithm and a backward propagation algorithm, using the training samples to learn the constructed network parameters, that is, the model used for the testing process.

In the present invention, the forward propagation algorithm first normalizes the size of the input image to 3×480×640, and intercepts a patch of 3×448×448 size and corresponding annotation information as an input of the convolutional neural network. After the convolutional layer, the downsampling layer and the corrected linear unit layer (ReLU Nonlinearity Layer), in the Fire9 layer, the image feature size is 512×60×80; in the Fire12 layer, the feature image size is 512×30×40, behind The two branch feature map sizes are 512 x 15 x 20 and 512 x 8 x 10, respectively. On the different feature maps, the four coordinate points and category information of the target block diagram are obtained by convolution. Taking the Fire9 layer as an example, if only the pedestrian and the background are detected, the output is 6×60×80, 6 Contains four coordinates of background, pedestrian category and candidate block diagram. In the target detection network, the candidate block diagrams obtained by each branch layer are summarized in the proposal node, and the background semantic information of the Fire9 layer and the features obtained by the pooling operation of the upsampled information through the region of interest are superimposed to be the final block diagram. Regression and category regression.

In the present invention, the backward propagation algorithm needs to first find the target block diagram of the forward (ie, forward) propagation prediction and the loss function of the actual target block diagram of the image.

It is assumed that there are M branches in the middle layer to output target candidate regions (the sensing domains of M scales can approximate all target objects in the image), l ^m represents the loss function of the branch m, and α _m represents the weight of the l ^m function, S ={S ¹ ,S ² ,...,S ^M } refers to the target object of the corresponding scale, then the loss function

Can be defined as:

The loss function, for a particular detection layer m, only contributes to the loss function if the target scale is within the range detectable by m, so the loss function is defined as

Where p(X)=(p ₀ (X), . . . , p _K (X)) represents the probability distribution of the target category; λ is the balance coefficient; b is the four coordinate points of the block diagram,

Refers to the coordinate point obtained by forward propagation; in the loss function, the cross-entropy loss function is used to define the category regression, ie

L _cls (p(X), y)=-log _y (P(X)) (3)

Use the smooth Manhattan distance standard (smooth L1 criterion) to perform regression of the target block diagram, as defined below

Step S2, using the trained model to utilize the variation rule of the neural network perception domain, using different intermediate layers to detect the target objects in different scales, and predicting the block diagram of the target object (such as a pedestrian) in the image.

Specifically, step S2 further includes:

Step S200, loading the trained model;

Step S201, inputting a test sample;

Step S202, using the trained model, using different intermediate layers to detect pedestrians in different scales through the variation pattern of the neural network perception domain, and predicting the pedestrian block diagram in the image. 6 is a schematic diagram of a process of fast pedestrian detection in a specific embodiment of the present invention, that is, using a target candidate network in a model based on the Squeeze VGG-16 convolutional neural network, according to the characteristics of the convolution layer, in fire9, fire12, conv6, and increasing The pooling layer has a total of 4 layers to generate network branches, and the target candidate regions (intermediate layer a, intermediate layer b, intermediate layer c) of the object are detected at different scales; then the target detection network is used, and the target is selected based on the target candidate region The 1.5-time-size image area of the candidate area is used as the background semantic information of the target, and the feature map of the Fire9 layer is up-sampled once, as information for enhancing the perception of the small object, and the background semantic information and the upsampled information are pooled through the region of interest. A fixed size feature is obtained, followed by a layer of fully connected layers for regression of the category and final candidate box. Preferably, in step S202, the feature map of the specific network layer is also amplified by using a method of deconvolution.

The pedestrian detection method proposed by the invention draws on two evaluation indexes respectively: an average precision rate mAP and a frame number per second FPS. The mAP is used to evaluate the ratio of the final detection area to the real target person area, and the average value of the precision is compared under different cross-section ratios; FPS, mainly the efficiency index, refers to the number of pictures that can be processed per second.

FIG. 7 is a system architecture diagram of a fast pedestrian detection device according to the present invention. As shown in FIG. 7, a fast pedestrian detecting device of the present invention includes:

The training unit 70 is configured to construct a configurable convolutional neural network-based depth model, learn the constructed network parameters using the training samples, and obtain a model for the testing process. In a specific embodiment of the present invention, the depth model constructed by the training unit 70 is composed of two sub-networks: a first sub-network, which is a multi-scale target candidate network, for extracting character features and giving candidate regions, specifically The target candidate network proposes feature differences based on different layers of the convolutional neural network, and generates candidate block diagrams for different scale pedestrians in the middle layer; the second sub-network is the target detection network, enhances the detection effect, and the target candidate Network sharing parameters, refined classification and detection based on candidate block diagrams. Specifically, as shown in FIG. 8, the training unit 70 further includes:

The model construction unit 701 is configured to construct a configurable convolutional neural network based depth model.

The convolutional neural network is formed by stacking a convolutional layer, a downsampling layer, and an upsampling layer, and the convolutional layer is a convolution operation on an input image or a feature image in a two-dimensional space to extract hierarchical features. The downsampling layer uses a max-pooling operation without overlap, which is used to extract features with shape and offset invariance, while reducing the size of the feature map and improving computational efficiency. The upsampling layer refers to the pair The input feature map performs a deconvolution operation on a two-dimensional space to increase the pixels of the feature map. In a specific embodiment of the invention, a Squeeze VGG-16 convolutional neural network is employed as the backbone network.

In a specific embodiment of the present invention, the target candidate network is based on the Squeeze VGG-16 convolutional neural network, and according to the convolution layer feature, a total of 4 layers are generated in fire9, fire12, conv6, and the added pooling layer, and a network branch is generated. The branch performs regression of the candidate frame of the object detected at different scales. But for the fire-9 layer, it is closer to the lower layer of the backbone network. Compared with other layers, the gradient will have a great influence on the gradient. The learning process is unstable, so there is an additional buffer layer. The buffer layer avoids detecting the gradient of the branch. Direct back-propagated to the backbone layer.

The target detection network shares parameters with the target candidate network, and summarizes candidate blocks of the target candidate network to enhance the ability of the monitoring network to distinguish objects from the background. In a specific embodiment of the present invention, the target detection network uses, as a target background semantic information, a picture area of 1.5 times the target candidate area on the basis of the target candidate area; and performs a upsampling of the feature map of the Fire9 layer. As information for enhancing the perception of small objects, the background semantic information and the upsampled information are obtained by pooling the region of interest to obtain fixed-size features, and then a layer of fully connected layers is added to perform regression of the category and the final candidate frame. Specifically, The backbone cnn layer is connected to a proposal subnet, W and H are the width and height of the input picture, cube 1 represents the pooling of the object area, and cube 2 represents the pooling of the context area, which is about 1.5 times the object area, and Enhance the detection of small objects, and then perform upsampling on the Fire9 layer. Then, similar to the fast RCNN algorithm, use the pooling of the region of interest to obtain fixed-size features, then add a layer of fully-connected layers to classify and finalize the candidate box. Return.

The training sample input unit 702 is configured to input a training sample.

Specifically, training samples

Where X _i represents a patch of the training picture, and the annotation data Y _i =(y _i ,b _i ) is determined by the category label y _i and the block coordinate point

composition.

The initializing unit 703 is configured to initialize the convolutional neural network and its parameters, including weights and offsets of each layer connection in the network layer. In particular, the present invention pre-trains the Squeeze VGG-16 convolutional neural network as a network initialization using the ImageNet data set.

The sample training unit 704 is configured to adopt a forward propagation algorithm and a backward propagation algorithm, and use the training samples to learn the constructed network parameters, that is, the model used for the testing process.

The backward propagation algorithm needs to first find the target block diagram of the forward propagation prediction and the loss function of the actual target block diagram of the image.

Can be defined as:

Where p(X)=(p ₀ (X), . . . , p _K (X)) is the probability distribution of the target class. In the loss function, the cross-entropy loss function is used to define the category regression, ie

L _cls (p(X), y)=-log _y (P(X))

Use the smooth L1 criterion to perform the regression of the target block diagram, as defined below

The detecting unit 71 is configured to input a test sample, and use a trained model to detect a target object (such as a pedestrian) in different scales by using different intermediate layers to detect a target object in the image ( A block diagram such as a pedestrian.

Specifically, as shown in FIG. 9, the detecting unit 71 further includes:

a model loading unit 710, configured to load the trained model;

a test sample input unit 711 for inputting a test sample;

The image prediction unit 712 is configured to use the trained model to detect pedestrians in different scales by using different intermediate layers through the trained model to predict the pedestrian's block diagram in the image. Specifically, the image prediction unit 712 generates a network branch in total of 4 layers in Fire9, Fire12, conv6, and the added pooling layer according to the convolution layer feature, based on the Squeeze VGG-16 convolutional neural network, using the target candidate network in the model. The target candidate region of the object is detected at different scales; then, the target detection region is used, and the image region of the target candidate region is used as the background semantic information of the target candidate region, and the feature map of the Fire9 layer is performed. Once upsampling, as information to enhance the perception of small objects, the background semantic information and the upsampled information are pooled through the region of interest to obtain fixed-size features, and then a layer of fully connected layers is added to perform regression of categories and final candidate frames. .

In summary, the fast pedestrian detection method and device of the present invention learns from the compression network method, adjusts and trains the VGG-16 network to obtain the squeezeze VGG-16 network that meets the requirements of the embedded system, and effectively reduces the parameter amount of the network model. On the other hand, in view of the problem that the sensing domain and the object size are inconsistent in the traditional detection method, the present invention utilizes the variation law of the neural network sensing domain (ie, the deeper the neural network layer is, the larger the sensing domain is, which is suitable for detecting large Some target objects) use different intermediate layers to detect target objects in a specific scale range, better adapt to the relationship between the sensing domain and the object size, and effectively improve the detection result; in addition, in order to enhance the detection of small objects The present invention uses a deconvolution method to amplify a feature map of a specific network layer. Compared with the conventional image enlargement method, the display memory and the calculation amount are hardly increased; in order to enhance the detection of the fuzzy object, the feature map at the layer is enhanced. On the top, using the target object 1.5 times the size of the area as a background semantic feature added to the network, for blur Sample distance and small objects, with excellent performance.

The above-described embodiments are merely illustrative of the principles of the invention and its effects, and are not intended to limit the invention. Modifications and variations of the above-described embodiments can be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of protection of the invention should be as set forth in the claims.

Claims

A rapid pedestrian detection method includes the following steps:

Step S1, constructing a configurable depth model based on a convolutional neural network, learning a constructed network parameter using the training sample, and obtaining a model for the testing process;

In step S2, the test sample is input, and the trained model is used to detect the target object in different scales by using different intermediate layers to detect the target object in the image, and the block diagram of the target object in the image is predicted.
The method of detecting a rapid pedestrian detection according to claim 1, wherein the step S1 further comprises:

Construct a configurable depth model based on convolutional neural networks;

Enter training samples;

Initializing the convolutional neural network and its parameters, including the weight and offset of each layer connection in the network layer;

Using the forward propagation algorithm and the backward propagation algorithm, the training sample is used to learn the constructed network parameters, that is, the model used for the test process.
A fast pedestrian detection method according to claim 2, wherein the depth model comprises a multi-scale target candidate network and a target detection network, and the target candidate network proposes features based on different layers of the convolutional neural network Differentiating, generating candidate block diagrams for different scale target objects in the middle layer; the target detection network performs refined classification and detection on the basis of the candidate block diagram outputted by the target candidate network.
A fast pedestrian detection method according to claim 3, wherein said convolutional neural network is formed by stacking a convolutional layer, a downsampling layer, and an upsampling layer, and said convolutional layer is for inputting The image or feature map is convoluted on a two-dimensional space to extract hierarchical features; the downsampling layer uses a max-pooling operation without overlap, which is used to extract features with shape and offset invariance while reducing features The size of the graph increases the computational efficiency; the upsampling layer refers to an operation of deconvolving the input feature map in a two-dimensional space to increase the pixels of the feature map.
A fast pedestrian detection method according to claim 4, wherein said depth model uses a Squeeze VGG-16 convolutional neural network as a backbone network, and said Squeeze VGG-16 convolutional neural network uses a conv1-1 layer. And the 12-layer Fire module layer that follows is the feature extraction network structure.
A fast pedestrian detection method according to claim 5, wherein said target candidate network is based on said Squeeze VGG-16 convolutional neural network, according to convolutional layer characteristics, in Fire9, Fire12, conv6, and The increased pooling layer generates network branches to detect the regression of candidate frames of objects at different scales.
The method of claim 5, wherein the target detection network uses the target area as a target background information by using a target area of a target multiple candidate size based on the target candidate area. The feature map of the Fire9 layer is upsampled once, as information for enhancing the perception of the small object, and the background semantic information and the upsampled information are obtained through the pool of the region of interest to obtain a fixed size feature, and then a layer of fully connected layer is added. , the regression of the category and the final candidate box.
A fast pedestrian detection method according to claim 1, wherein the training sample comprises RGB image data and annotation information of a pedestrian area in the image, and the actual training image data is small according to the region where the pedestrian is located. Patch.
A fast pedestrian detection method according to claim 1, wherein said backward propagation algorithm first needs to obtain a target block diagram of the forward propagation prediction and a loss function of the actual target block diagram of the image.
Then find its gradient to the parameter W, and update the W with a gradient descent algorithm to minimize the loss function.
It is assumed that there are M branches in the middle layer to output the target candidate region, l m represents the loss function of the branch m, α m represents the weight of the l m function, and S={S 1 , S 2 , . . . , S M } refers to the target of the corresponding scale. Object, loss function
Can be defined as:
A fast pedestrian detection system comprising:

a training unit for constructing a configurable depth model based on a convolutional neural network, learning a constructed network parameter using the training sample, and obtaining a model for the testing process;

The detecting unit is configured to input the test sample, and the trained model is used to detect the target object in different scales by using different intermediate layers to detect the target object in the image, and predict the block diagram of the target object in the image.