CN112686207A

CN112686207A - Urban street scene target detection method based on regional information enhancement

Info

Publication number: CN112686207A
Application number: CN202110085069.9A
Authority: CN
Inventors: 张逞逞; 郑全新; 宁志勇; 张磊; 刘阳; 董小栋; 孟祥松; 刘婷婷; 江龙; 郭俊
Original assignee: Beijing Tongfang Software Co Ltd
Current assignee: Beijing Tongfang Software Co Ltd
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2021-04-20
Anticipated expiration: 2041-01-22
Also published as: CN112686207B

Abstract

A city street scene target detection method based on regional information enhancement relates to the field of artificial intelligence and the field of computer vision. The method comprises the following steps: 1) the daytime data was added to the training data. 2) And outputting a target position code segmask, and detecting and outputting a network Detection Block output category prediction module cls and a size regression module size. 3) Optimizing a detection algorithm, and a) initializing network model parameters. b) Forward outputting the target category and the detection frame; and filtering and outputting a final detection result. The invention designs a target detection deep learning network, trains an urban street scene detection model, and designs a detection system which is suitable for various events under the whole scene in the daytime and at night by matching with an intelligent system of a video intelligent analysis static video frame target detection technology and a target behavior analysis technology in a dynamic video, thereby accurately and rapidly completing the automatic detection of illegal events and effectively avoiding the false detection and missed detection of targets.

Description

Urban street scene target detection method based on regional information enhancement

Technical Field

The invention relates to the field of artificial intelligence and the field of computer vision, in particular to an intelligent target detection method based on an image processing technology and a video analysis technology and applied to a monitoring scene of a camera in an urban street.

Background

With the development of modern science and technology, the camera is utilized to realize high-efficiency city supervision and is applied to city management work, and the camera helps city management personnel to play a vital role in dealing with complex city street emergency. Currently, more and more researchers are focusing on addressing the automated administration of urban street scenarios. The purpose of the urban scene visual supervision is to improve the resolution capability of a scene monitoring image and endow an intelligent urban management system with the capability of correctly understanding scene information so as to improve the safety of urban streets, parking lots and communities. Meanwhile, the camera device in the night scene is affected by some uncontrollable factors such as bad weather and illuminance, and the common target detection method cannot meet all-weather scene supervision requirements of the whole scene.

In the field of computer vision, two different detection ideas of anchor-base and anchor-free can be utilized for target detection in a complex scene, a paper "CenterNet: Objects as Points" published in 2019 of Xinyi Zhou et al is utilized to convert category regression of a target detection problem to find a target central point in the first real sense, namely, a Gaussian heat point is adopted by a detector as a key point estimation to fit the target central point, the target detection problem is changed into a standard key point estimation problem, and other target attributes such as size, 3D position, direction and even posture can be derived. Compared with a BBox-based detector, the model is microminiature from end to end, the detection process is simpler, faster and more accurate, and the best balance between the detection speed and the accuracy is achieved. Therefore, the idea based on the Gaussian heat point shows obvious advantages for the problems of target segmentation, target tracking and the like. A paper published in 2019 by Kailun Yang et al, "See Clearer at Right directions Robust righttime Semantic Segmentation through Day-Night Image Conversion", proposes the use of a generative countermeasure network (GAN) to alleviate the problem of low accuracy in applying a Semantic Segmentation model to a Nighttime environment. The GAN-based night semantic segmentation framework includes two approaches. The first uses GAN to convert night images to day images and trains a robust model to perform semantic segmentation on the day dataset already used; the second method converts the daytime images in the dataset into nighttime images using GAN to produce a very robust model under nighttime conditions and to predict the nighttime images directly. In the paper experiment, the second method significantly improves the segmentation performance of the model on the night image. The method is not only beneficial to the optimization of the visual perception of the intelligent vehicle, but also can be applied to various navigation auxiliary systems. A paper published by 2019 of Yishi et al, namely a night target identification method based on infrared thermal imaging and YOLOv3, provides infrared thermal imaging image reaction object temperature information, is less influenced by environmental conditions, and has strong application value in the aspects of night security monitoring, driving assistance, shipping, military reconnaissance and the like under specific conditions. In recent years, the development of the technology for detecting and identifying the target in the image by using artificial intelligence is greatly advanced, and the method is widely applied to various fields. The thesis proposes a night target identification method combining an infrared thermal imaging image processing technology and an artificial intelligence target identification technology. The method comprises the steps of collecting a thermal imaging video in real time, preprocessing the thermal imaging video to enhance the contrast and details of the thermal imaging video, detecting a specific target in a thermal imaging image after collection and processing by using a latest target detection framework YOLOv3 based on a deep learning technology, and outputting a detection result. The test result shows that the method has high night target recognition rate and strong real-time performance, combines the advantages of infrared thermal imaging night monitoring and artificial intelligent target detection, and has great application value for night target recognition and tracking technology.

In summary, for target detection in an urban management scene, it is an effective method to reasonably utilize preprocessing of images and design a more reasonable target detection algorithm, however, while target detection in a daytime scene is solved, night target detection in a complex scene still has certain disadvantages, and a complex background is easy to cause target false detection, missed detection and the like. How to improve the detection capability and reduce false detection is also a hot spot of complex scene target detection research.

Meanwhile, the existing detection algorithm also has some defects:

based on the traditional algorithm, the requirements are difficult to meet in the aspects of identification and understanding of urban scene monitoring, and the main reasons are that the color of two scenes in the day and at night is changed greatly, the complexity of the scenes is too high, the edges of targets are fuzzy, the targets are shielded, and the like. These factors require that the algorithm has super-strong generalization capability and accuracy, the traditional algorithm cannot meet the target detection requirement on the theoretical basis, and the false detection and the missing detection of the target are difficult to eliminate in the field application.

Based on the deep learning algorithm, a method with strong generalization capability can be designed to solve the problem of various target forms at night, but the method also has high requirements on the design of a network model. In addition, most of common deep learning algorithms are applied to target detection in daytime scenes by calibrating training samples, but static characteristics of targets in night monitoring scenes are sparse, and false detection and missing detection of the targets are easy to occur.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, an object of the present invention is to provide an object detection network based on regional information enhancement and a detection method thereof. The invention designs a target detection deep learning network, trains an urban street scene detection model, and designs a detection system which is suitable for various events under the whole scene in the daytime and at night by matching with an intelligent system of a video intelligent analysis static video frame target detection technology and a target behavior analysis technology in a dynamic video, thereby accurately and rapidly completing the automatic detection of illegal events and effectively avoiding the false detection and missed detection of targets.

In order to achieve the above object, the technical solution of the present invention is implemented as follows:

a city street scene target Detection method based on regional information enhancement uses a target Detection network which comprises a first WiFPN1, a feature selection network Seg Block, a second WiFPN2, an up-sampling network UATB and a Detection output network Detection Block which are connected in sequence. The downsampling network backhaul comprises a feature fusion module I WiFPN1 and a feature fusion module II WiFPN 2. The method comprises the following steps:

1) scene pre-processing of image data

Adding 40% of daytime data into training data, wherein the data enhancement aspect in the model training process comprises the enhancement of turning, scaling, cutting, color brightness and chroma; the pixel size of the input image is normalized to 448 x 256.

2) Network model design

An anchor free target Detection algorithm is adopted, a down-sampling and up-sampling network structure is designed, a feature selection network Seg Block is used for outputting a target position code segmask, and an output network Detection Block output category prediction module cls and a size regression module size are detected.

3) Detection algorithm optimization

a) Training process

Initializing network model parameters, and setting a learning target, a learning rate and an attenuation coefficient; and performing iterative learning and parameter updating on the loss function through an optimization algorithm.

b) Reasoning process

Outputting the target category and the detection frame in the forward direction by utilizing the cls and the size; and filtering and outputting a final detection result by setting a threshold and a non-maximum suppression algorithm.

In the method for detecting the urban street scene target, the downsampling network Backbone extracts intermediate features, the feature selection network Seg Block optimizes the intermediate features, and the upsampling network UATB extracts predicted features.

In the above method for detecting urban street scene targets, the feature selection network Seg Block contains area information for supervised learning. Firstly, designing a learnable variable soft with the size between 0 and 1, and carrying out calibration selection fusion on three output characteristics of a characteristic fusion module WiFPN1 by using the soft; compressing the features to a one-dimensional channel by using 1 × 1 convolution operation, and outputting a target position code segmask through an activation function sigmoid; designing a target segmask _ gt with a target area scalar value of 1, optimizing a target position coding segmask through a loss function, and finally multiplying the optimized target position coding segmask by three input pixel values of a feature selection network Seg Block correspondingly to complete the selection of the bottom features of the downsampling network Back bone network.

In the method for detecting the urban street scene target, the UATBUATB is an up-sampling network with composite multi-stage semantic features, the features of each stage of the down-sampling network Backbone are used as signal input, and a double-interaction attention module C _ ATB is adopted to carry out operation layer by layer until a prediction feature is obtained.

In the above method for detecting an object in a city street scene, the dual interaction attention module C _ ATB includes learning two-stage region information. Compressing two input features into a shared single-channel attention feature AT through convolution, transposition convolution, Concat merging and a Sigmoid activation function, and multiplying the single-channel attention feature AT by an upper layer input feature element and a lower layer input feature element to complete feature interaction in a first stage, so that the adjustment of a target space position is realized; and merging the characteristics of the intermediate adjustment, sending the merged characteristics into an SEnet network, and selecting the second-stage fusion of the characteristics of the two stages at the channel level to finish the up-sampling output information.

In the method for detecting the urban street scene target, the Detection output network Detection Block takes 2-time down-sampling output as a shared feature, convolution operation of an n-dimensional convolution kernel is adopted, feature values of n channels are normalized by a Sigmoid function and mapped to a category prediction module cls, and convolution operation of the 2-dimensional convolution kernel is adopted to obtain two layers of channels which are respectively mapped to a size regression module size corresponding to the size of the Detection target.

As the detection method is adopted, compared with the prior art, the invention has the following advantages:

1. a network structure is designed based on the anchor-free target detection method, and 2 times of down-sampling output is used for prediction, so that the detection effect of the small target is improved.

2. And (3) paying attention to the spatial information of the features by using a weak supervision method, and enhancing the attention to the position information of the target in the image.

3. Two WiFPN modules are used for effectively and independently integrating shallow features and deep features.

4. And a simpler UATB module is used for finishing the upsampling work, and the characteristic information of each input layer of the pyramid is effectively utilized for prediction.

The invention is further described with reference to the following figures and detailed description.

Drawings

FIG. 1 is a schematic diagram of a detection network used in the method of the present invention;

FIG. 2 is a schematic structural diagram of a Seg Block module in an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a C _ ATB module according to an embodiment of the present invention.

Detailed Description

Referring to fig. 1, in the city street scene target Detection method based on regional information enhancement of the present invention, the target Detection network used includes a first WiFPN1, a feature selection network Seg Block, a second WiFPN2, an up-sampling network UATB, and a Detection output network Detection Block, which are connected in sequence. The method extracts intermediate features through a downsampling network Backbone, wherein the Backbone comprises a feature fusion module I WiFPN1 and a feature fusion module II WiFPN2, WiFPN1 extracts network shallow features, and WiFPN2 extracts network deep features. Designing a feature selection network Seg Block optimization intermediate feature; designing an up-sampling network UATB to extract prediction characteristics; the input video signal passes through the background, Seg Block, UATB and Detection Block in sequence to output segmask, cls and size.

The invention relates to a city street scene target detection method based on regional information enhancement, which comprises the following steps:

1) scene pre-processing of image data

The pixel size of the image in an urban scene is 1920 x 1080, and the pixel size of the image is normalized to 448 x 256 in order to reduce the occurrence of large deformation of small targets caused by excessive changes in the scale of the input dimensions of the network model. In order to supplement night target color features and target texture features, make the network model more robust to color interference and reduce the influence caused by the lack of night target features to a certain extent, 40% of day data is added into training data. The data enhancement in the process of training the model is added with the enhancement of color brightness and chroma besides conventional turning, scaling and clipping.

2) Network model design

In the invention, in order to solve the problems of false detection and missing detection of the target detection of the street scene, a new target detection network structure ADetNet is designed. ADetNet is a target detection algorithm adopting no anchor point, accelerates network convergence and increases the detection rate and accuracy of a target by three modules of a target position coding segmask, a category prediction module cls and a size regression module size, and the network structure comprises an Input and three output segmasks, cls and sizes. Adding a branch Seg Block on the basis of an ADetNet network WiFPN1 module, and outputting segmask information sampled by 8 times; after the Detection Block module of the ADetNet network, the 2-time down-sampled target class information cls and the 2-time down-sampled Detection frame information size are output.

Referring to fig. 1, the signal input represents the input of the ADetNet network, segmask, cls, size represent its three outputs, Conv represents its forward operation structure, Feature represents its operation intermediate result, dashed box WiFPN1 represents its shallow Feature fusion module, dashed box WiFPN2 represents its deep Feature fusion module, and dashed box UATB represents its attention upsampling module. In the network structure, each time of downlink operation carries out one characteristic downsampling operation, each time of uplink operation carries out one characteristic upsampling operation, the ADetNet network uses 5 times of Conv to realize the size reduction of feature map size by 32 times, and 2 times of downsampling is completed through one Conv.

Wherein, Conv represents convolution operation, Add represents element addition operation, and Max pooling represents maximum pooling operation.

The WiFPN module is a feature fusion module proposed in the thesis EfficientDet and aims to deliver better semantic features of the upper layer and the lower layer of the network for a detection part. The WiFPN module can realize weight-enhanced feature fusion for information features of different sizes, so that weighted operation of semantic features is realized by each layer of a downsampling network at the cost of small calculation. Therefore, a WiFPN module with three inputs and three outputs is built to organically integrate shallow and deep features, two groups of input features with different resolutions are fused, an extra weight is added for each input in the feature fusion process, the network is made to know the importance of each input feature, and the output O of each layer is as shown in formula (1).

（1）

Wherein, I_iRepresenting the inputs of a three-layer WiFPN module, W_iAre learnable weights, which may be scalars (per feature), vectors (per channel), or multidimensional tensors (per pixel),

by applying at each W_iThe Relu function was then applied for normalization and setting epsilon =0.000 avoids numerical instability. Similarly, the value of each normalized weight is also between 0 and 1, which is more efficient since softmax operation is not used here.

To further improve the fusion effect, we use separable deep convolutions for feature fusion and add bulk normalization and activation after each convolution. The WiFPN1 module is used for performing attention enhancement on the network output features of the network structure 2-8 times of down-sampling, optimizing a plurality of different resolution input features of the shallow network, and conveying better spatial semantic information for the deep network. The WiFPN2 module is used for carrying out attention enhancement on network output characteristics of network structure 8-32 times down-sampling, optimizing a plurality of input characteristics with different resolutions of a deep network, and accumulating better receptive field information for up-sampling operation. Various ways can be sampled during the upsampling process: (1) deconvolution, (2) linear upsampling, (3) linear upsampling in combination with 1 × 1 convolution operation can be implemented according to different requirements.

The shallow feature map has rich information semantics, and in order to fully mine texture features beneficial to target detection, weak supervision segmentation is used as a branch to complete local feature enhancement. The method takes an output feature map of a module WiFPN1 and a truth value of boundary box-level segmentation as input, generates a semantic feature mapping mask with the same dimension, multiplies the output feature map of the WiFPN1 by elements by using the semantic feature mapping mask to obtain a local feature to be concerned, and finally weights and sums the local feature and an underlying feature map by weight capable of learning and transmits the local feature and the underlying feature map downwards. Based on this idea, we designed a Seg Block module, as shown in fig. 2.

Unlike the centeret algorithm which uses 4 times of downsampled feature output as a prediction layer, a novel upsampling network UATB with composite image multi-stage semantic features is designed, considering that 2 times of downsampled feature output is more favorable for acquiring feature information of a small-size target (less than 50 x 50 pixels), and the structure of the upsampling network UATB is formed by combining sub-modules C _ ATB (double interaction attention modules). For the downsampling output characteristics of ADetNet network in each stage, starting from a 32-time downsampling layer, performing upsampling operation layer by adopting a C _ ATB module until 2-time downsampling characteristics are generated and are used as the input of a Detection Block module (prediction module), and the UATB module has the other function of enabling the output of two WiFPN modules to flow and interact characteristic information through an upsampling process.

Fig. 3 abstractly shows the size change process of each feature in the C _ ATB network, as shown in the figure, the length and width of the upper layer input is 2 times that of the lower layer input, two input features are compressed into a common single-channel attention feature AT through convolution, transposed convolution, Concat merging and Sigmoid activation functions, and then multiplied by the upper and lower layer input feature elements to complete the feature interaction in the first stage, so that the adjustment of the target spatial position is realized; and merging the characteristics of the intermediate adjustment, sending the merged characteristics into an SEnet network, and performing selective fusion on the characteristics of the two stages at the channel level to finish up-sampling output information. The C _ ATB network learns different weights on the spatial position and channels in different stages, and adaptively adjusts the spatial position information and the context semantic information of input features in different stages.

The Detection Block module serves as a functional differentiation mechanism of the ADetNet network, and takes the features before the last up-sampling (2-time down-sampling) of the UATB module as sharing features to generate final class prediction cls and size regression size. For an n-type target detection task, in order to map the class prediction cls, the convolution operation of an n-dimensional convolution kernel is adopted, the feature values of n channels are normalized by using a Sigmoid function, and each layer of channel corresponds to the corresponding value of one type of detection target. In order to map the size regression size, the convolution operation of the 2-dimensional convolution kernel is adopted, and the obtained two layers of channels respectively correspond to the size W, H of the detection target.

3) Detection algorithm optimization

a) Training process

Initialization of model parameters:

in the design of the whole network structure, the segmask module and the detection module composed of the cls module and the size module share one network backbone structure backbone. For an n-class object detection network, order

Is an input image, which is wide W and high H. Its position code generates vector

The branch cls module in the detection module outputs the vector

Detection of the branch size module output vector

. Learning objectives are set for the above three network outputs,and coding to obtain a corresponding target vector, and performing iterative learning through a loss function.

Optimizing the network weight:

aiming at segmask module class output, the learning target is a thermodynamic diagram of a key region under R times of original encoding

Where R is the output size scaling, with R = 8;

a key area indicating the presence of a target;

representing a background area; we use the Seg Block encoding-decoding network to predict the image I

. When training segmask key area prediction network, order

Is target k (class C)_k) By bbox, we pass the mask _ GT key point through a rectangular thermal frame

Is dispersed in thermodynamic diagram

The above. To reduce computational burden, a common segmask prediction is used for all target classes

The training objective function is as follows, pixel-level logistic regression focal loss:

（2）

where α and β are the hyper-parameters of the focal loss, two numbers are set to 2 and 4 respectively in the experiment, N is the number of critical regions in input I, and dividing by N is mainly to normalize all the focal losses.

For the detection module class branch cls output, the learning objective is to generate a key point Gaussian thermodynamic diagram

Where R is the detection output vector size scaling, we use a larger output size for better prediction of small targets, the number of downsamples R = 2;

which is indicative of the detected key point(s),

representing background, namely, setting a positive sample by one target; we predict image I using the ADetNet full convolutional encoder-decoder network

. Setting a group route (GT) key point of a classification target in an original image as C and the position of the key point as C

Calculating to obtain corresponding key points on low resolution (after down sampling)

. We pass GT keypoint through the Gaussian kernel

Is dispersed in thermodynamic diagram

Therein are disclosed

Is the standard deviation of the target scale adaptation. If two gaussian functions overlap for the same class n (the same keypoint or object class), we choose the element level that is the largest. For regression class loss, focal loss of pixel-level logistic regression is used as the training objective function.

For the output of the size branch size of the detection module, the learning target directly adopts the size width and height of the target, and order

Is target k (class C)_k) Bbox having a central position of

We use keypoint estimation

All the center points are obtained. In addition, the size of the target is regressed for each target k

. To reduce computational burden, a single size prediction is used for each target class

We add L1 loss at the center point position:

（3）

in order to adjust the relationship of the three loss, the relationship is multiplied by a coefficient, and the target loss function of the whole training is:

（4）

in the course of the training process,

the entire net prediction will output 2 x n +2 values at each position (i.e. keypoint class n, critical region class n, size w, h), all outputs sharing a fully-convolved backoff-bone.

b) Reasoning process

Only the detection module of the model is needed to predict the object type and the object size

Is an input image, which is wide W and high H. Firstly, through network forward calculation, using model class branch cls to output vector

The width of the vector is 0.5W, the height of the vector is 0.5H, N is the number of categories, and the value of each point in the vector represents the probability of the occurrence of the target; output vector using model size branch size

And the two layers of characteristics in the vector are respectively mapped to the width and the height of the point location detection frame corresponding to the class output. Then, aiming at the category output, a part of the prediction results with correspondingly lower values is filtered out by a method of setting a threshold value. And finally, removing redundant detection frames by using a non-maximum suppression algorithm NMS and outputting a final detection result.

The embodiment of the invention is only used for explaining the technical scheme of the application, and similar substitution is made by technical personnel in the field on the basis of the application, for example, the deep learning anchor-free detection method is replaced by other anchor-free target detection methods; in addition, the present invention uses the Conv module to perform downsampling operation on the image, and replaces the downsampling operation with other full convolution encoding-decoding networks, mathematical models, and the like, which should fall into the protection scope of the present application.

Claims

1. A city street scene target Detection method based on regional information enhancement, the target Detection network that it uses includes the feature fusion module one WiFPN1, feature selection network Seg Block, feature fusion module two WiFPN2, up-sampling network UATB and Detection output network Detection Block that connect sequentially, the down-sampling network backhaul includes feature fusion module one WiFPN1 and feature fusion module two WiFPN2, its method step is:

1) scene pre-processing of image data

Adding 40% of daytime data into training data, wherein the data enhancement aspect in the model training process comprises the enhancement of turning, scaling, cutting, color brightness and chroma; the pixel size of the input image is normalized to 448 x 256;

2) network model design

Designing down-sampling and up-sampling network structures by adopting an anchor free target Detection algorithm, selecting a network Seg Block output position code segmask by utilizing characteristics, and detecting an output network Detection Block output class cls and a Detection frame size;

3) detection algorithm optimization

a) Training process

Initializing network model parameters, and setting a learning target, a learning rate and an attenuation coefficient; iterative learning and parameter updating are carried out on the loss function through an optimization algorithm;

b) reasoning process

2. The urban street scene target detection method based on regional information enhancement as claimed in claim 1, wherein the downsampling network backup extracts intermediate features, the feature selection network Seg Block optimizes the intermediate features, and the upsampling network UATB extracts predicted features.

3. The urban street scene target detection method based on regional information enhancement as claimed in claim 1 or 2, wherein the feature selection network Seg Block extracts regional information by means of supervised learning, firstly, a learnable variable soft with a size between 0 and 1 is designed, and the soft is used for carrying out calibration selection fusion on three output features of a feature fusion module WiFPN 1; compressing the features to a one-dimensional channel by using 1 × 1 convolution operation, and outputting a position code segmask by activating a function sigmoid; designing a target segmask _ gt with a target area scalar value of 1, optimizing a position coding segmask through a loss function, and finally multiplying the optimized position coding segmask by the pixel values of three inputs of a feature selection network Seg Block to complete the selection of the bottom features of the downsampling network Back bone network.

4. The city street scene target detection method based on regional information enhancement as claimed in claim 1 or 2, wherein the UATB is an up-sampling network with composite multi-stage semantic features, each stage feature of the down-sampling network backhaul is used as signal input, and a double-interaction attention module C _ ATB is adopted to perform operation layer by layer until a prediction feature is obtained.

5. The urban street scene target detection method based on regional information enhancement as claimed in claim 4, wherein the double interaction attention module C _ ATB comprises learning two-stage regional information, compressing two input features into a common single-channel attention feature AT through convolution, transposition convolution, Concat merging and Sigmoid activation functions, and then multiplying the single-channel attention feature AT with upper and lower input feature elements to complete feature interaction in the first stage, thereby realizing adjustment of a target spatial position; and merging the characteristics of the intermediate adjustment, sending the merged characteristics into an SEnet network, and selecting the second-stage fusion of the characteristics of the two stages at the channel level to finish the up-sampling output information.

6. The urban street scene target Detection method based on regional information enhancement as claimed in claim 1, wherein the Detection output network Detection Block uses 2 times down-sampling output as a shared feature, and adopts convolution operation of n-dimensional convolution kernel, and uses Sigmoid function to normalize the feature value of n channels to map to class cls, and adopts convolution operation of 2-dimensional convolution kernel to obtain two layers of channels which are respectively mapped to the size of the Detection target.