CN112686207A - Urban street scene target detection method based on regional information enhancement - Google Patents

Urban street scene target detection method based on regional information enhancement Download PDF

Info

Publication number
CN112686207A
CN112686207A CN202110085069.9A CN202110085069A CN112686207A CN 112686207 A CN112686207 A CN 112686207A CN 202110085069 A CN202110085069 A CN 202110085069A CN 112686207 A CN112686207 A CN 112686207A
Authority
CN
China
Prior art keywords
detection
network
target
feature
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110085069.9A
Other languages
Chinese (zh)
Other versions
CN112686207B (en
Inventor
张逞逞
郑全新
宁志勇
张磊
刘阳
董小栋
孟祥松
刘婷婷
江龙
郭俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tongfang Software Co Ltd
Original Assignee
Beijing Tongfang Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tongfang Software Co Ltd filed Critical Beijing Tongfang Software Co Ltd
Priority to CN202110085069.9A priority Critical patent/CN112686207B/en
Publication of CN112686207A publication Critical patent/CN112686207A/en
Application granted granted Critical
Publication of CN112686207B publication Critical patent/CN112686207B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

A city street scene target detection method based on regional information enhancement relates to the field of artificial intelligence and the field of computer vision. The method comprises the following steps: 1) the daytime data was added to the training data. 2) And outputting a target position code segmask, and detecting and outputting a network Detection Block output category prediction module cls and a size regression module size. 3) Optimizing a detection algorithm, and a) initializing network model parameters. b) Forward outputting the target category and the detection frame; and filtering and outputting a final detection result. The invention designs a target detection deep learning network, trains an urban street scene detection model, and designs a detection system which is suitable for various events under the whole scene in the daytime and at night by matching with an intelligent system of a video intelligent analysis static video frame target detection technology and a target behavior analysis technology in a dynamic video, thereby accurately and rapidly completing the automatic detection of illegal events and effectively avoiding the false detection and missed detection of targets.

Description

Urban street scene target detection method based on regional information enhancement
Technical Field
The invention relates to the field of artificial intelligence and the field of computer vision, in particular to an intelligent target detection method based on an image processing technology and a video analysis technology and applied to a monitoring scene of a camera in an urban street.
Background
With the development of modern science and technology, the camera is utilized to realize high-efficiency city supervision and is applied to city management work, and the camera helps city management personnel to play a vital role in dealing with complex city street emergency. Currently, more and more researchers are focusing on addressing the automated administration of urban street scenarios. The purpose of the urban scene visual supervision is to improve the resolution capability of a scene monitoring image and endow an intelligent urban management system with the capability of correctly understanding scene information so as to improve the safety of urban streets, parking lots and communities. Meanwhile, the camera device in the night scene is affected by some uncontrollable factors such as bad weather and illuminance, and the common target detection method cannot meet all-weather scene supervision requirements of the whole scene.
In the field of computer vision, two different detection ideas of anchor-base and anchor-free can be utilized for target detection in a complex scene, a paper "CenterNet: Objects as Points" published in 2019 of Xinyi Zhou et al is utilized to convert category regression of a target detection problem to find a target central point in the first real sense, namely, a Gaussian heat point is adopted by a detector as a key point estimation to fit the target central point, the target detection problem is changed into a standard key point estimation problem, and other target attributes such as size, 3D position, direction and even posture can be derived. Compared with a BBox-based detector, the model is microminiature from end to end, the detection process is simpler, faster and more accurate, and the best balance between the detection speed and the accuracy is achieved. Therefore, the idea based on the Gaussian heat point shows obvious advantages for the problems of target segmentation, target tracking and the like. A paper published in 2019 by Kailun Yang et al, "See Clearer at Right directions Robust righttime Semantic Segmentation through Day-Night Image Conversion", proposes the use of a generative countermeasure network (GAN) to alleviate the problem of low accuracy in applying a Semantic Segmentation model to a Nighttime environment. The GAN-based night semantic segmentation framework includes two approaches. The first uses GAN to convert night images to day images and trains a robust model to perform semantic segmentation on the day dataset already used; the second method converts the daytime images in the dataset into nighttime images using GAN to produce a very robust model under nighttime conditions and to predict the nighttime images directly. In the paper experiment, the second method significantly improves the segmentation performance of the model on the night image. The method is not only beneficial to the optimization of the visual perception of the intelligent vehicle, but also can be applied to various navigation auxiliary systems. A paper published by 2019 of Yishi et al, namely a night target identification method based on infrared thermal imaging and YOLOv3, provides infrared thermal imaging image reaction object temperature information, is less influenced by environmental conditions, and has strong application value in the aspects of night security monitoring, driving assistance, shipping, military reconnaissance and the like under specific conditions. In recent years, the development of the technology for detecting and identifying the target in the image by using artificial intelligence is greatly advanced, and the method is widely applied to various fields. The thesis proposes a night target identification method combining an infrared thermal imaging image processing technology and an artificial intelligence target identification technology. The method comprises the steps of collecting a thermal imaging video in real time, preprocessing the thermal imaging video to enhance the contrast and details of the thermal imaging video, detecting a specific target in a thermal imaging image after collection and processing by using a latest target detection framework YOLOv3 based on a deep learning technology, and outputting a detection result. The test result shows that the method has high night target recognition rate and strong real-time performance, combines the advantages of infrared thermal imaging night monitoring and artificial intelligent target detection, and has great application value for night target recognition and tracking technology.
In summary, for target detection in an urban management scene, it is an effective method to reasonably utilize preprocessing of images and design a more reasonable target detection algorithm, however, while target detection in a daytime scene is solved, night target detection in a complex scene still has certain disadvantages, and a complex background is easy to cause target false detection, missed detection and the like. How to improve the detection capability and reduce false detection is also a hot spot of complex scene target detection research.
Meanwhile, the existing detection algorithm also has some defects:
based on the traditional algorithm, the requirements are difficult to meet in the aspects of identification and understanding of urban scene monitoring, and the main reasons are that the color of two scenes in the day and at night is changed greatly, the complexity of the scenes is too high, the edges of targets are fuzzy, the targets are shielded, and the like. These factors require that the algorithm has super-strong generalization capability and accuracy, the traditional algorithm cannot meet the target detection requirement on the theoretical basis, and the false detection and the missing detection of the target are difficult to eliminate in the field application.
Based on the deep learning algorithm, a method with strong generalization capability can be designed to solve the problem of various target forms at night, but the method also has high requirements on the design of a network model. In addition, most of common deep learning algorithms are applied to target detection in daytime scenes by calibrating training samples, but static characteristics of targets in night monitoring scenes are sparse, and false detection and missing detection of the targets are easy to occur.
Disclosure of Invention
In view of the above-mentioned shortcomings in the prior art, an object of the present invention is to provide an object detection network based on regional information enhancement and a detection method thereof. The invention designs a target detection deep learning network, trains an urban street scene detection model, and designs a detection system which is suitable for various events under the whole scene in the daytime and at night by matching with an intelligent system of a video intelligent analysis static video frame target detection technology and a target behavior analysis technology in a dynamic video, thereby accurately and rapidly completing the automatic detection of illegal events and effectively avoiding the false detection and missed detection of targets.
In order to achieve the above object, the technical solution of the present invention is implemented as follows:
a city street scene target Detection method based on regional information enhancement uses a target Detection network which comprises a first WiFPN1, a feature selection network Seg Block, a second WiFPN2, an up-sampling network UATB and a Detection output network Detection Block which are connected in sequence. The downsampling network backhaul comprises a feature fusion module I WiFPN1 and a feature fusion module II WiFPN 2. The method comprises the following steps:
1) scene pre-processing of image data
Adding 40% of daytime data into training data, wherein the data enhancement aspect in the model training process comprises the enhancement of turning, scaling, cutting, color brightness and chroma; the pixel size of the input image is normalized to 448 x 256.
2) Network model design
An anchor free target Detection algorithm is adopted, a down-sampling and up-sampling network structure is designed, a feature selection network Seg Block is used for outputting a target position code segmask, and an output network Detection Block output category prediction module cls and a size regression module size are detected.
3) Detection algorithm optimization
a) Training process
Initializing network model parameters, and setting a learning target, a learning rate and an attenuation coefficient; and performing iterative learning and parameter updating on the loss function through an optimization algorithm.
b) Reasoning process
Outputting the target category and the detection frame in the forward direction by utilizing the cls and the size; and filtering and outputting a final detection result by setting a threshold and a non-maximum suppression algorithm.
In the method for detecting the urban street scene target, the downsampling network Backbone extracts intermediate features, the feature selection network Seg Block optimizes the intermediate features, and the upsampling network UATB extracts predicted features.
In the above method for detecting urban street scene targets, the feature selection network Seg Block contains area information for supervised learning. Firstly, designing a learnable variable soft with the size between 0 and 1, and carrying out calibration selection fusion on three output characteristics of a characteristic fusion module WiFPN1 by using the soft; compressing the features to a one-dimensional channel by using 1 × 1 convolution operation, and outputting a target position code segmask through an activation function sigmoid; designing a target segmask _ gt with a target area scalar value of 1, optimizing a target position coding segmask through a loss function, and finally multiplying the optimized target position coding segmask by three input pixel values of a feature selection network Seg Block correspondingly to complete the selection of the bottom features of the downsampling network Back bone network.
In the method for detecting the urban street scene target, the UATBUATB is an up-sampling network with composite multi-stage semantic features, the features of each stage of the down-sampling network Backbone are used as signal input, and a double-interaction attention module C _ ATB is adopted to carry out operation layer by layer until a prediction feature is obtained.
In the above method for detecting an object in a city street scene, the dual interaction attention module C _ ATB includes learning two-stage region information. Compressing two input features into a shared single-channel attention feature AT through convolution, transposition convolution, Concat merging and a Sigmoid activation function, and multiplying the single-channel attention feature AT by an upper layer input feature element and a lower layer input feature element to complete feature interaction in a first stage, so that the adjustment of a target space position is realized; and merging the characteristics of the intermediate adjustment, sending the merged characteristics into an SEnet network, and selecting the second-stage fusion of the characteristics of the two stages at the channel level to finish the up-sampling output information.
In the method for detecting the urban street scene target, the Detection output network Detection Block takes 2-time down-sampling output as a shared feature, convolution operation of an n-dimensional convolution kernel is adopted, feature values of n channels are normalized by a Sigmoid function and mapped to a category prediction module cls, and convolution operation of the 2-dimensional convolution kernel is adopted to obtain two layers of channels which are respectively mapped to a size regression module size corresponding to the size of the Detection target.
As the detection method is adopted, compared with the prior art, the invention has the following advantages:
1. a network structure is designed based on the anchor-free target detection method, and 2 times of down-sampling output is used for prediction, so that the detection effect of the small target is improved.
2. And (3) paying attention to the spatial information of the features by using a weak supervision method, and enhancing the attention to the position information of the target in the image.
3. Two WiFPN modules are used for effectively and independently integrating shallow features and deep features.
4. And a simpler UATB module is used for finishing the upsampling work, and the characteristic information of each input layer of the pyramid is effectively utilized for prediction.
The invention is further described with reference to the following figures and detailed description.
Drawings
FIG. 1 is a schematic diagram of a detection network used in the method of the present invention;
FIG. 2 is a schematic structural diagram of a Seg Block module in an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a C _ ATB module according to an embodiment of the present invention.
Detailed Description
Referring to fig. 1, in the city street scene target Detection method based on regional information enhancement of the present invention, the target Detection network used includes a first WiFPN1, a feature selection network Seg Block, a second WiFPN2, an up-sampling network UATB, and a Detection output network Detection Block, which are connected in sequence. The method extracts intermediate features through a downsampling network Backbone, wherein the Backbone comprises a feature fusion module I WiFPN1 and a feature fusion module II WiFPN2, WiFPN1 extracts network shallow features, and WiFPN2 extracts network deep features. Designing a feature selection network Seg Block optimization intermediate feature; designing an up-sampling network UATB to extract prediction characteristics; the input video signal passes through the background, Seg Block, UATB and Detection Block in sequence to output segmask, cls and size.
The invention relates to a city street scene target detection method based on regional information enhancement, which comprises the following steps:
1) scene pre-processing of image data
The pixel size of the image in an urban scene is 1920 x 1080, and the pixel size of the image is normalized to 448 x 256 in order to reduce the occurrence of large deformation of small targets caused by excessive changes in the scale of the input dimensions of the network model. In order to supplement night target color features and target texture features, make the network model more robust to color interference and reduce the influence caused by the lack of night target features to a certain extent, 40% of day data is added into training data. The data enhancement in the process of training the model is added with the enhancement of color brightness and chroma besides conventional turning, scaling and clipping.
2) Network model design
In the invention, in order to solve the problems of false detection and missing detection of the target detection of the street scene, a new target detection network structure ADetNet is designed. ADetNet is a target detection algorithm adopting no anchor point, accelerates network convergence and increases the detection rate and accuracy of a target by three modules of a target position coding segmask, a category prediction module cls and a size regression module size, and the network structure comprises an Input and three output segmasks, cls and sizes. Adding a branch Seg Block on the basis of an ADetNet network WiFPN1 module, and outputting segmask information sampled by 8 times; after the Detection Block module of the ADetNet network, the 2-time down-sampled target class information cls and the 2-time down-sampled Detection frame information size are output.
Referring to fig. 1, the signal input represents the input of the ADetNet network, segmask, cls, size represent its three outputs, Conv represents its forward operation structure, Feature represents its operation intermediate result, dashed box WiFPN1 represents its shallow Feature fusion module, dashed box WiFPN2 represents its deep Feature fusion module, and dashed box UATB represents its attention upsampling module. In the network structure, each time of downlink operation carries out one characteristic downsampling operation, each time of uplink operation carries out one characteristic upsampling operation, the ADetNet network uses 5 times of Conv to realize the size reduction of feature map size by 32 times, and 2 times of downsampling is completed through one Conv.
Wherein, Conv represents convolution operation, Add represents element addition operation, and Max pooling represents maximum pooling operation.
The WiFPN module is a feature fusion module proposed in the thesis EfficientDet and aims to deliver better semantic features of the upper layer and the lower layer of the network for a detection part. The WiFPN module can realize weight-enhanced feature fusion for information features of different sizes, so that weighted operation of semantic features is realized by each layer of a downsampling network at the cost of small calculation. Therefore, a WiFPN module with three inputs and three outputs is built to organically integrate shallow and deep features, two groups of input features with different resolutions are fused, an extra weight is added for each input in the feature fusion process, the network is made to know the importance of each input feature, and the output O of each layer is as shown in formula (1).
Figure 7336DEST_PATH_IMAGE002
(1)
Wherein, IiRepresenting the inputs of a three-layer WiFPN module, WiAre learnable weights, which may be scalars (per feature), vectors (per channel), or multidimensional tensors (per pixel),
Figure DEST_PATH_IMAGE003
by applying at each WiThe Relu function was then applied for normalization and setting epsilon =0.000 avoids numerical instability. Similarly, the value of each normalized weight is also between 0 and 1, which is more efficient since softmax operation is not used here.
To further improve the fusion effect, we use separable deep convolutions for feature fusion and add bulk normalization and activation after each convolution. The WiFPN1 module is used for performing attention enhancement on the network output features of the network structure 2-8 times of down-sampling, optimizing a plurality of different resolution input features of the shallow network, and conveying better spatial semantic information for the deep network. The WiFPN2 module is used for carrying out attention enhancement on network output characteristics of network structure 8-32 times down-sampling, optimizing a plurality of input characteristics with different resolutions of a deep network, and accumulating better receptive field information for up-sampling operation. Various ways can be sampled during the upsampling process: (1) deconvolution, (2) linear upsampling, (3) linear upsampling in combination with 1 × 1 convolution operation can be implemented according to different requirements.
The shallow feature map has rich information semantics, and in order to fully mine texture features beneficial to target detection, weak supervision segmentation is used as a branch to complete local feature enhancement. The method takes an output feature map of a module WiFPN1 and a truth value of boundary box-level segmentation as input, generates a semantic feature mapping mask with the same dimension, multiplies the output feature map of the WiFPN1 by elements by using the semantic feature mapping mask to obtain a local feature to be concerned, and finally weights and sums the local feature and an underlying feature map by weight capable of learning and transmits the local feature and the underlying feature map downwards. Based on this idea, we designed a Seg Block module, as shown in fig. 2.
Unlike the centeret algorithm which uses 4 times of downsampled feature output as a prediction layer, a novel upsampling network UATB with composite image multi-stage semantic features is designed, considering that 2 times of downsampled feature output is more favorable for acquiring feature information of a small-size target (less than 50 x 50 pixels), and the structure of the upsampling network UATB is formed by combining sub-modules C _ ATB (double interaction attention modules). For the downsampling output characteristics of ADetNet network in each stage, starting from a 32-time downsampling layer, performing upsampling operation layer by adopting a C _ ATB module until 2-time downsampling characteristics are generated and are used as the input of a Detection Block module (prediction module), and the UATB module has the other function of enabling the output of two WiFPN modules to flow and interact characteristic information through an upsampling process.
Fig. 3 abstractly shows the size change process of each feature in the C _ ATB network, as shown in the figure, the length and width of the upper layer input is 2 times that of the lower layer input, two input features are compressed into a common single-channel attention feature AT through convolution, transposed convolution, Concat merging and Sigmoid activation functions, and then multiplied by the upper and lower layer input feature elements to complete the feature interaction in the first stage, so that the adjustment of the target spatial position is realized; and merging the characteristics of the intermediate adjustment, sending the merged characteristics into an SEnet network, and performing selective fusion on the characteristics of the two stages at the channel level to finish up-sampling output information. The C _ ATB network learns different weights on the spatial position and channels in different stages, and adaptively adjusts the spatial position information and the context semantic information of input features in different stages.
The Detection Block module serves as a functional differentiation mechanism of the ADetNet network, and takes the features before the last up-sampling (2-time down-sampling) of the UATB module as sharing features to generate final class prediction cls and size regression size. For an n-type target detection task, in order to map the class prediction cls, the convolution operation of an n-dimensional convolution kernel is adopted, the feature values of n channels are normalized by using a Sigmoid function, and each layer of channel corresponds to the corresponding value of one type of detection target. In order to map the size regression size, the convolution operation of the 2-dimensional convolution kernel is adopted, and the obtained two layers of channels respectively correspond to the size W, H of the detection target.
3) Detection algorithm optimization
a) Training process
Figure 986793DEST_PATH_IMAGE004
Initialization of model parameters:
in the design of the whole network structure, the segmask module and the detection module composed of the cls module and the size module share one network backbone structure backbone. For an n-class object detection network, order
Figure 85241DEST_PATH_IMAGE006
Is an input image, which is wide W and high H. Its position code generates vector
Figure DEST_PATH_IMAGE007
The branch cls module in the detection module outputs the vector
Figure DEST_PATH_IMAGE009
Detection of the branch size module output vector
Figure 73925DEST_PATH_IMAGE010
. Learning objectives are set for the above three network outputs,and coding to obtain a corresponding target vector, and performing iterative learning through a loss function.
Figure DEST_PATH_IMAGE011
Optimizing the network weight:
aiming at segmask module class output, the learning target is a thermodynamic diagram of a key region under R times of original encoding
Figure 685035DEST_PATH_IMAGE012
Where R is the output size scaling, with R = 8;
Figure DEST_PATH_IMAGE013
a key area indicating the presence of a target;
Figure 399176DEST_PATH_IMAGE014
representing a background area; we use the Seg Block encoding-decoding network to predict the image I
Figure 728526DEST_PATH_IMAGE016
. When training segmask key area prediction network, order
Figure DEST_PATH_IMAGE017
Is target k (class C)k) By bbox, we pass the mask _ GT key point through a rectangular thermal frame
Figure 520901DEST_PATH_IMAGE018
Is dispersed in thermodynamic diagram
Figure DEST_PATH_IMAGE019
The above. To reduce computational burden, a common segmask prediction is used for all target classes
Figure 485053DEST_PATH_IMAGE020
The training objective function is as follows, pixel-level logistic regression focal loss:
Figure 9575DEST_PATH_IMAGE022
(2)
where α and β are the hyper-parameters of the focal loss, two numbers are set to 2 and 4 respectively in the experiment, N is the number of critical regions in input I, and dividing by N is mainly to normalize all the focal losses.
For the detection module class branch cls output, the learning objective is to generate a key point Gaussian thermodynamic diagram
Figure DEST_PATH_IMAGE023
Where R is the detection output vector size scaling, we use a larger output size for better prediction of small targets, the number of downsamples R = 2;
Figure 91801DEST_PATH_IMAGE024
which is indicative of the detected key point(s),
Figure 891129DEST_PATH_IMAGE026
representing background, namely, setting a positive sample by one target; we predict image I using the ADetNet full convolutional encoder-decoder network
Figure DEST_PATH_IMAGE027
. Setting a group route (GT) key point of a classification target in an original image as C and the position of the key point as C
Figure 243875DEST_PATH_IMAGE028
Calculating to obtain corresponding key points on low resolution (after down sampling)
Figure DEST_PATH_IMAGE029
. We pass GT keypoint through the Gaussian kernel
Figure 1616DEST_PATH_IMAGE030
Is dispersed in thermodynamic diagram
Figure 774400DEST_PATH_IMAGE032
Therein are disclosed
Figure DEST_PATH_IMAGE033
Is the standard deviation of the target scale adaptation. If two gaussian functions overlap for the same class n (the same keypoint or object class), we choose the element level that is the largest. For regression class loss, focal loss of pixel-level logistic regression is used as the training objective function.
For the output of the size branch size of the detection module, the learning target directly adopts the size width and height of the target, and order
Figure 469430DEST_PATH_IMAGE034
Is target k (class C)k) Bbox having a central position of
Figure DEST_PATH_IMAGE035
We use keypoint estimation
Figure DEST_PATH_IMAGE037
All the center points are obtained. In addition, the size of the target is regressed for each target k
Figure 237535DEST_PATH_IMAGE038
. To reduce computational burden, a single size prediction is used for each target class
Figure 369439DEST_PATH_IMAGE040
We add L1 loss at the center point position:
Figure DEST_PATH_IMAGE041
(3)
in order to adjust the relationship of the three loss, the relationship is multiplied by a coefficient, and the target loss function of the whole training is:
Figure DEST_PATH_IMAGE043
(4)
in the course of the training process,
Figure DEST_PATH_IMAGE045
the entire net prediction will output 2 x n +2 values at each position (i.e. keypoint class n, critical region class n, size w, h), all outputs sharing a fully-convolved backoff-bone.
b) Reasoning process
Only the detection module of the model is needed to predict the object type and the object size
Figure 193301DEST_PATH_IMAGE006
Is an input image, which is wide W and high H. Firstly, through network forward calculation, using model class branch cls to output vector
Figure 803273DEST_PATH_IMAGE046
The width of the vector is 0.5W, the height of the vector is 0.5H, N is the number of categories, and the value of each point in the vector represents the probability of the occurrence of the target; output vector using model size branch size
Figure 566830DEST_PATH_IMAGE048
And the two layers of characteristics in the vector are respectively mapped to the width and the height of the point location detection frame corresponding to the class output. Then, aiming at the category output, a part of the prediction results with correspondingly lower values is filtered out by a method of setting a threshold value. And finally, removing redundant detection frames by using a non-maximum suppression algorithm NMS and outputting a final detection result.
The embodiment of the invention is only used for explaining the technical scheme of the application, and similar substitution is made by technical personnel in the field on the basis of the application, for example, the deep learning anchor-free detection method is replaced by other anchor-free target detection methods; in addition, the present invention uses the Conv module to perform downsampling operation on the image, and replaces the downsampling operation with other full convolution encoding-decoding networks, mathematical models, and the like, which should fall into the protection scope of the present application.

Claims (6)

1. A city street scene target Detection method based on regional information enhancement, the target Detection network that it uses includes the feature fusion module one WiFPN1, feature selection network Seg Block, feature fusion module two WiFPN2, up-sampling network UATB and Detection output network Detection Block that connect sequentially, the down-sampling network backhaul includes feature fusion module one WiFPN1 and feature fusion module two WiFPN2, its method step is:
1) scene pre-processing of image data
Adding 40% of daytime data into training data, wherein the data enhancement aspect in the model training process comprises the enhancement of turning, scaling, cutting, color brightness and chroma; the pixel size of the input image is normalized to 448 x 256;
2) network model design
Designing down-sampling and up-sampling network structures by adopting an anchor free target Detection algorithm, selecting a network Seg Block output position code segmask by utilizing characteristics, and detecting an output network Detection Block output class cls and a Detection frame size;
3) detection algorithm optimization
a) Training process
Initializing network model parameters, and setting a learning target, a learning rate and an attenuation coefficient; iterative learning and parameter updating are carried out on the loss function through an optimization algorithm;
b) reasoning process
Outputting the target category and the detection frame in the forward direction by utilizing the cls and the size; and filtering and outputting a final detection result by setting a threshold and a non-maximum suppression algorithm.
2. The urban street scene target detection method based on regional information enhancement as claimed in claim 1, wherein the downsampling network backup extracts intermediate features, the feature selection network Seg Block optimizes the intermediate features, and the upsampling network UATB extracts predicted features.
3. The urban street scene target detection method based on regional information enhancement as claimed in claim 1 or 2, wherein the feature selection network Seg Block extracts regional information by means of supervised learning, firstly, a learnable variable soft with a size between 0 and 1 is designed, and the soft is used for carrying out calibration selection fusion on three output features of a feature fusion module WiFPN 1; compressing the features to a one-dimensional channel by using 1 × 1 convolution operation, and outputting a position code segmask by activating a function sigmoid; designing a target segmask _ gt with a target area scalar value of 1, optimizing a position coding segmask through a loss function, and finally multiplying the optimized position coding segmask by the pixel values of three inputs of a feature selection network Seg Block to complete the selection of the bottom features of the downsampling network Back bone network.
4. The city street scene target detection method based on regional information enhancement as claimed in claim 1 or 2, wherein the UATB is an up-sampling network with composite multi-stage semantic features, each stage feature of the down-sampling network backhaul is used as signal input, and a double-interaction attention module C _ ATB is adopted to perform operation layer by layer until a prediction feature is obtained.
5. The urban street scene target detection method based on regional information enhancement as claimed in claim 4, wherein the double interaction attention module C _ ATB comprises learning two-stage regional information, compressing two input features into a common single-channel attention feature AT through convolution, transposition convolution, Concat merging and Sigmoid activation functions, and then multiplying the single-channel attention feature AT with upper and lower input feature elements to complete feature interaction in the first stage, thereby realizing adjustment of a target spatial position; and merging the characteristics of the intermediate adjustment, sending the merged characteristics into an SEnet network, and selecting the second-stage fusion of the characteristics of the two stages at the channel level to finish the up-sampling output information.
6. The urban street scene target Detection method based on regional information enhancement as claimed in claim 1, wherein the Detection output network Detection Block uses 2 times down-sampling output as a shared feature, and adopts convolution operation of n-dimensional convolution kernel, and uses Sigmoid function to normalize the feature value of n channels to map to class cls, and adopts convolution operation of 2-dimensional convolution kernel to obtain two layers of channels which are respectively mapped to the size of the Detection target.
CN202110085069.9A 2021-01-22 2021-01-22 Urban street scene target detection method based on regional information enhancement Active CN112686207B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110085069.9A CN112686207B (en) 2021-01-22 2021-01-22 Urban street scene target detection method based on regional information enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110085069.9A CN112686207B (en) 2021-01-22 2021-01-22 Urban street scene target detection method based on regional information enhancement

Publications (2)

Publication Number Publication Date
CN112686207A true CN112686207A (en) 2021-04-20
CN112686207B CN112686207B (en) 2024-02-27

Family

ID=75458885

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110085069.9A Active CN112686207B (en) 2021-01-22 2021-01-22 Urban street scene target detection method based on regional information enhancement

Country Status (1)

Country Link
CN (1) CN112686207B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487600A (en) * 2021-07-27 2021-10-08 大连海事大学 Characteristic enhancement scale self-adaptive sensing ship detection method
CN113837305A (en) * 2021-09-29 2021-12-24 北京百度网讯科技有限公司 Target detection and model training method, device, equipment and storage medium
CN114565860A (en) * 2022-03-01 2022-05-31 安徽大学 Multi-dimensional reinforcement learning synthetic aperture radar image target detection method
CN114581798A (en) * 2022-02-18 2022-06-03 广州中科云图智能科技有限公司 Target detection method and device, flight equipment and computer readable storage medium
CN115578615A (en) * 2022-10-31 2023-01-06 成都信息工程大学 Night traffic sign image detection model establishing method based on deep learning
CN115690704A (en) * 2022-09-27 2023-02-03 淮阴工学院 LG-CenterNet model-based complex road scene target detection method and device
CN115985102A (en) * 2023-02-15 2023-04-18 湖南大学深圳研究院 Urban traffic flow prediction method and equipment based on migration contrast learning
CN116630909A (en) * 2023-06-16 2023-08-22 广东特视能智能科技有限公司 Unmanned intelligent monitoring system and method based on unmanned aerial vehicle

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304798A (en) * 2018-01-30 2018-07-20 北京同方软件股份有限公司 The event video detecting method of order in the street based on deep learning and Movement consistency
WO2019232894A1 (en) * 2018-06-05 2019-12-12 中国石油大学(华东) Complex scene-based human body key point detection system and method
WO2020224406A1 (en) * 2019-05-08 2020-11-12 腾讯科技(深圳)有限公司 Image classification method, computer readable storage medium, and computer device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304798A (en) * 2018-01-30 2018-07-20 北京同方软件股份有限公司 The event video detecting method of order in the street based on deep learning and Movement consistency
WO2019232894A1 (en) * 2018-06-05 2019-12-12 中国石油大学(华东) Complex scene-based human body key point detection system and method
WO2020224406A1 (en) * 2019-05-08 2020-11-12 腾讯科技(深圳)有限公司 Image classification method, computer readable storage medium, and computer device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
范红超;李万志;章超权;: "基于Anchor-free的交通标志检测", 地球信息科学学报, no. 01 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487600A (en) * 2021-07-27 2021-10-08 大连海事大学 Characteristic enhancement scale self-adaptive sensing ship detection method
CN113487600B (en) * 2021-07-27 2024-05-03 大连海事大学 Feature enhancement scale self-adaptive perception ship detection method
CN113837305A (en) * 2021-09-29 2021-12-24 北京百度网讯科技有限公司 Target detection and model training method, device, equipment and storage medium
US11823437B2 (en) 2021-09-29 2023-11-21 Beijing Baidu Netcom Science Technology Co., Ltd. Target detection and model training method and apparatus, device and storage medium
CN114581798A (en) * 2022-02-18 2022-06-03 广州中科云图智能科技有限公司 Target detection method and device, flight equipment and computer readable storage medium
CN114565860A (en) * 2022-03-01 2022-05-31 安徽大学 Multi-dimensional reinforcement learning synthetic aperture radar image target detection method
CN115690704A (en) * 2022-09-27 2023-02-03 淮阴工学院 LG-CenterNet model-based complex road scene target detection method and device
CN115690704B (en) * 2022-09-27 2023-08-22 淮阴工学院 LG-CenterNet model-based complex road scene target detection method and device
CN115578615A (en) * 2022-10-31 2023-01-06 成都信息工程大学 Night traffic sign image detection model establishing method based on deep learning
CN115985102A (en) * 2023-02-15 2023-04-18 湖南大学深圳研究院 Urban traffic flow prediction method and equipment based on migration contrast learning
CN116630909A (en) * 2023-06-16 2023-08-22 广东特视能智能科技有限公司 Unmanned intelligent monitoring system and method based on unmanned aerial vehicle
CN116630909B (en) * 2023-06-16 2024-02-02 广东特视能智能科技有限公司 Unmanned intelligent monitoring system and method based on unmanned aerial vehicle

Also Published As

Publication number Publication date
CN112686207B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN112686207B (en) Urban street scene target detection method based on regional information enhancement
CN112347859B (en) Method for detecting significance target of optical remote sensing image
CN113902915B (en) Semantic segmentation method and system based on low-light complex road scene
CN113052210A (en) Fast low-illumination target detection method based on convolutional neural network
CN113158738A (en) Port environment target detection method, system, terminal and readable storage medium based on attention mechanism
CN114565860A (en) Multi-dimensional reinforcement learning synthetic aperture radar image target detection method
CN110490155B (en) Method for detecting unmanned aerial vehicle in no-fly airspace
CN116311254B (en) Image target detection method, system and equipment under severe weather condition
CN112651423A (en) Intelligent vision system
CN114821018B (en) Infrared dim target detection method for constructing convolutional neural network by utilizing multidirectional characteristics
CN112883887B (en) Building instance automatic extraction method based on high spatial resolution optical remote sensing image
CN116229452B (en) Point cloud three-dimensional target detection method based on improved multi-scale feature fusion
CN114220126A (en) Target detection system and acquisition method
CN112560865A (en) Semantic segmentation method for point cloud under outdoor large scene
CN116258940A (en) Small target detection method for multi-scale features and self-adaptive weights
CN117157679A (en) Perception network, training method of perception network, object recognition method and device
CN116740516A (en) Target detection method and system based on multi-scale fusion feature extraction
CN115527096A (en) Small target detection method based on improved YOLOv5
Pashaei et al. Fully convolutional neural network for land cover mapping in a coastal wetland with hyperspatial UAS imagery
CN116977866A (en) Lightweight landslide detection method
CN115731517A (en) Crowd detection method based on Crowd-RetinaNet network
CN113642676B (en) Regional power grid load prediction method and device based on heterogeneous meteorological data fusion
CN113763356A (en) Target detection method based on visible light and infrared image fusion
Zhu et al. Small target detection algorithm based on multi-target detection head and attention mechanism
Alshammari et al. Multi-task learning for automotive foggy scene understanding via domain adaptation to an illumination-invariant representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant