WO2022111352A1

WO2022111352A1 - Target detection method and apparatus, storage medium, and terminal

Info

Publication number: WO2022111352A1
Application number: PCT/CN2021/131132
Authority: WO
Inventors: 陈圣卫
Original assignee: 展讯通信（上海）有限公司
Priority date: 2020-11-30
Filing date: 2021-11-17
Publication date: 2022-06-02
Also published as: CN112446378B; CN112446378A

Abstract

A target detection method and apparatus, a storage medium, and a terminal. The method comprises: acquiring a sample image; extracting an initial feature map of the sample image, and performing semantic information enhancement on the initial feature map, so as to obtain a first prediction map of the sample image, the first prediction map being used to indicate a target area and a background area of the sample image, and the target area being an area that comprises a preset target, and the background area being an area that does not comprise the preset target; and training a detection network model according to the first prediction map and a second prediction map, to obtain a trained detection network model, and detecting an image to be tested by using the trained detection network model, to obtain a detection result of the preset target in the image to be tested. In the technical solution of the present invention, a preset target in an image to be tested can be efficiently and accurately detected.

Description

Target detection method and device, storage medium and terminal

This application claims the priority of the Chinese patent application filed on November 30, 2020 with the application number 202011373448.X and the invention titled "target detection method and device, storage medium, terminal", the entire contents of which are by reference Incorporated in this application.

technical field

The present invention relates to the field of computer vision, and in particular, to a target detection method and device, a storage medium and a terminal.

Background technique

Currently, object detection is a challenging subject in the field of computer vision, which is widely used in many fields such as robot navigation, intelligent video surveillance, industrial inspection, aerospace, and autonomous driving. Due to the development of related technologies and the needs of the industry, the current requirements for the efficiency and accuracy of target detection are getting higher and higher.

With the rapid development of deep learning technology, more and more target detection is done using Convolutional Neural Networks (CNN), which gradually replaces traditional image processing algorithms. In the target detection task, although the detection accuracy of the convolutional neural network has repeatedly reached new highs, it is not suitable for small-sized targets (for example: targets that do not exceed the preset size, or the proportion of the size to the size of the image to which they belong does not exceed The detection accuracy of the target with a preset ratio) is not high, and the detection accuracy of the existing small-sized target is usually only half of the detection accuracy of the normal-sized target.

Therefore, it is urgent to propose an efficient and accurate target detection method to improve the detection accuracy of small-sized targets.

SUMMARY OF THE INVENTION

The technical problem solved by the present invention is to provide an efficient and accurate target detection method, so as to improve the detection accuracy of small-sized targets.

In order to solve the above technical problem, an embodiment of the present invention provides a target detection method. The method includes: acquiring a sample image, where the sample image includes a preset target; extracting an initial feature map of the sample image, and analyzing the initial The feature map is subjected to semantic information enhancement processing to obtain a first prediction map of the sample image, the first prediction map is used to indicate the target area and background area of the sample image, and the target area contains the preset The target area, the background area is an area that does not contain the preset target; the detection network model is trained according to the first prediction map and the second prediction map to obtain a trained detection network model, wherein the The second prediction map is obtained by calculating the sample image according to the initial feature map, and the second prediction map is used to indicate the bounding box of the preset target; the trained detection network model is used to detect the to-be-tested image to obtain the detection result of the preset target in the image to be tested.

Optionally, each pixel of the initial feature map includes multiple channels, and the initial feature map is obtained by down-sampling the sample image, and performing semantic information enhancement processing on the initial feature map includes: : Step 1: Upsampling the initial feature map by 2 times to obtain the first feature map; Step 2: Process the first feature map according to the channel attention mechanism to obtain the second feature map; Step 3 : take the second feature map as a new initial feature map; repeat steps 1 to 3 until the upsampling multiple is equal to the downsampling multiple; step 4: perform the first convolution on the initial feature map operation to obtain the first prediction map, wherein the number of channels of the first prediction map is 2.

Optionally, before processing the first feature map according to the channel attention mechanism, the method further includes: performing multiple second convolution operations on the first feature map, and each second convolution operation adopts A 1×1 convolution kernel; the results of multiple second convolution operations are added to the first feature map to obtain a new first feature map.

Optionally, before processing the first feature map according to the channel attention mechanism, the method further includes: performing multiple third convolution operations on the first feature map, and performing multiple third convolution operations on the first feature map. A 3×3 convolution kernel is used at least once; the results of multiple third convolution operations are added to the first feature map to obtain a new first feature map.

Optionally, performing multiple second convolution operations on the first feature map includes: in other second convolution operations other than the first second convolution operation, first perform on the first feature map. Batch normalization, and then use the relu activation function.

Optionally, the loss function of the detection network model is Loss=λ _semantic L _semantic +λ _model L _model , where Loss is the loss function of the detection network model, L _semantic is the first loss function, and L _model is the first loss function. Two loss functions, λ _semantic is the weight coefficient of the first loss function, and λ _model is the weight coefficient of the second loss function; then training the detection network model according to the first prediction map includes: Step A: According to the calculating a first loss function value from the first prediction map, the sample image and the first loss function, and calculating a second loss function value according to the second prediction map, the sample image and the second loss function; Step B: Calculate the loss function value of the detection network model according to the first loss function value and the second loss function value; Step C: Determine that the loss function value exceeds a preset threshold, and if so, adjust the The parameters of the module for extracting the initial feature map in the detection network model, if not, then end the training of the detection network model; Step D: using the detection network model after adjusting the parameters to extract the initial features of the sample image performing semantic information enhancement processing on the initial feature map to obtain the first prediction map, and predicting the sample image according to the initial feature map to obtain the first prediction map; repeat the steps Step A goes to step E until it is determined in step C that the loss function value does not exceed the preset threshold.

Optionally, the first loss function is a FocalLoss function.

Optionally, the weight coefficient of the first loss function is determined by the ratio of the preset target to all the targets in the sample image. The larger the weight coefficient of the function.

Optionally, before extracting the initial feature map of the sample image, the method further includes: performing data enhancement on the sample image, where the data enhancement includes one or more of the following: adjusting the brightness of the sample image and/or contrast, rotating the sample image by a preset angle, and adding noise to the sample image.

Optionally, the detection network model is a single-step detection network model.

In order to solve the above technical problem, an embodiment of the present invention further provides a target detection device, the device includes: an acquisition module for acquiring a sample image, the sample image including a preset target; a processing module for extracting the sample The initial feature map of the image, and the semantic information enhancement processing is performed on the initial feature map to obtain the first prediction map of the sample image, and the first prediction map is used to indicate the target area and background area of the sample image. , the target area is an area that includes the preset target, and the background area is an area that does not include the preset target; the training module is used for the detection network according to the first prediction map and the second prediction map. The model is trained to obtain a trained detection network model, wherein the second prediction map is obtained by calculating the sample image according to the initial feature map, and the second prediction map is used to indicate the prediction map. The bounding box of the target is set; the detection module is used to detect the image to be tested by using the trained detection network model, so as to obtain the detection result of the preset target in the image to be tested.

An embodiment of the present invention further provides a storage medium, on which a computer program is stored, and the computer program executes the steps of the above target detection method when the computer program is run by a processor.

An embodiment of the present invention further provides a terminal, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes the target detection in the claim when running the computer program steps of the method.

Compared with the prior art, the technical solutions of the embodiments of the present invention have the following beneficial effects:

An embodiment of the present invention provides a target detection method, the method includes: acquiring a sample image, where the sample image includes a preset target; extracting an initial feature map of the sample image, and performing semantic information enhancement on the initial feature map process to obtain a first prediction map of the sample image, the first prediction map is used to indicate the target area and background area of the sample image, the target area is the area containing the preset target, the The background area is an area that does not contain the preset target; the detection network model is trained according to the first prediction map and the second prediction map to obtain a trained detection network model, wherein the second prediction map is Calculated on the sample image according to the initial feature map, the second prediction map is used to indicate the bounding box of the preset target; the trained detection network model is used to detect the image to be tested to obtain the The detection result of the preset target in the image to be tested. In the solution of the embodiment of the present invention, when using the sample image to train the detection network model, the initial feature map of the sample image is first extracted, and then the semantic information enhancement processing is performed on the initial feature map to obtain the target area and the background area in the sample image. the first prediction map, and then train the detection network model according to the first prediction map and the second prediction map capable of indicating the preset target bounding box, so as to obtain the trained detection network model. Since the first prediction map can indicate the target area and the background area in the sample image, the first prediction map can contain more semantic information of the preset target in the sample image. Using the first prediction map to train the detection network model can make The detection network model learns the semantic information of the preset target well, and when the image to be tested is subsequently detected, the detection result of the preset target can be obtained more accurately, and the detection efficiency is higher.

Further, in the embodiment of the present invention, when the semantic information enhancement processing is performed on the initial feature map, the initial feature map is up-sampled and processed according to the channel attention mechanism, until the up-sampling multiple is equal to the down-sampling multiple when extracting the initial feature map. same. Through the upsampling process, the size of the first prediction map can be made the same as the sample image, so that the loss function can be subsequently trained according to the first prediction map and the sample image. Each pixel in the initial feature map contains multiple channels, and the channel attention mechanism is used to enhance semantic information, strengthen the channel with high correlation with the preset target, weaken the channel with small correlation with the preset target, and finally analyze the initial feature. Convolution operation is performed on the image, and the number of channels of the first prediction image is adjusted to 2, so that the first prediction image can intuitively reflect the semantic information of the preset target in a two-class manner (ie, indicating the target area and the background area).

Further, in this embodiment of the present invention, the weight coefficient of the first loss function is determined by the ratio of the preset target to all the targets in the sample image, and the larger the ratio of the preset target, the larger the weight coefficient of the first loss function, that is, , the first prediction map obtained by the semantic information enhancement process plays a greater role in training the detection network model, so that the detection network model has good performance in detecting the preset target.

Description of drawings

FIG. 1 is a schematic flowchart of a target detection method in an embodiment of the present invention.

FIG. 2 is a schematic structural diagram of a detection network model applicable to a target detection method in an embodiment of the present invention.

FIG. 3 is a schematic structural diagram of the semantic information enhancement module in FIG. 2 .

FIG. 4 is a schematic structural diagram of the first residual module in FIG. 3 .

FIG. 5 is a schematic diagram of a first prediction map in an embodiment of the present invention.

FIG. 6 is a schematic structural diagram of a target detection apparatus in an embodiment of the present invention.

Detailed ways

As described in the background art, there is an urgent need to propose an efficient and accurate target detection method to improve the detection accuracy of small-sized targets.

The inventor of the present invention found through research that, in the prior art, the convolutional neural networks used for small-size target detection mainly include Feature Pyramid Networks (FPN), Generative Adversarial Networks (GAN), and Generative Adversarial Networks (GAN). Or use an image pyramid size normalization network (Scale Normalization for Image Pyramids, SNIP), etc. Among them, FPN obtains more information of small-sized objects in the image by fusing features of different scales, GAN improves the detection accuracy by restoring the image information of small objects, and SNIP is based on multi-scale training only for the same scale as the pre-training scale. The matched targets are gradient backhauled to improve detection accuracy.

No matter what kind of convolutional neural network structure is used, in order to improve the detection performance of small-sized objects, in the process of network training, it is necessary to make the network fully learn the semantic information of small-sized objects. However, due to the small proportion of small-size objects in the image, the image is blurry, and the resolution is low. In the process of learning the feature information of small-size objects in the image, the semantic information of small-size objects that can be extracted by the convolutional neural network is very limited. Therefore, the convolutional neural network model has a weak ability to express the feature information of small-sized targets.

In order to make the convolutional neural network obtain more semantic information of small-sized objects, the method of deepening the network depth is usually adopted, that is, by increasing the number of convolutional layers, the network can obtain more semantic information of small-sized objects during the training process. This method needs to greatly increase the number of convolutional layers in the convolutional neural network, which will result in a complex network structure and a deep depth, and it will take a long time to detect small-size targets in the subsequent detection, resulting in the convolutional neural network in the actual detection of small size. The performance of the target application is not high.

In order to solve the above technical problem, an embodiment of the present invention provides a target detection method. The method includes: acquiring a sample image, where the sample image includes a preset target; extracting an initial feature map of the sample image, and analyzing the initial The feature map is subjected to semantic information enhancement processing to obtain a first prediction map of the sample image, the first prediction map is used to indicate the target area and background area of the sample image, and the target area contains the preset The target area, the background area is an area that does not contain the preset target; the detection network model is trained according to the first prediction map and the second prediction map to obtain a trained detection network model, wherein the The second prediction map is obtained by calculating the sample image according to the initial feature map, and the second prediction map is used to indicate the bounding box of the preset target; the trained detection network model is used to detect the to-be-tested image to obtain the detection result of the preset target in the image to be tested. In the solution of the embodiment of the present invention, when using the sample image to train the detection network model, the initial feature map of the sample image is first extracted, and then the semantic information enhancement processing is performed on the initial feature map to obtain the target area and the background area in the sample image. the first prediction map, and then train the detection network model according to the first prediction map and the second prediction map capable of indicating the preset target bounding box, so as to obtain the trained detection network model. Since the first prediction map can indicate the target area and the background area in the sample image, the first prediction map can contain more semantic information of the preset target in the sample image. Using the first prediction map to train the detection network model can make The detection network model learns the semantic information of the preset target well, and can quickly and accurately obtain the detection result of the preset target when the image to be tested is subsequently detected.

In order to make the above objects, features and beneficial effects of the present invention more clearly understood, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Referring to FIG. 1 , FIG. 1 is a schematic flowchart of a target detection method in an embodiment of the present invention. The target detection method may be performed by a terminal, and the terminal may be various appropriate terminals, such as a mobile phone, a computer, an Internet of Things device, etc., but is not limited thereto. The method can be used to detect whether the image to be tested contains a preset target, and can also be used to detect the specific position and category of the preset target in the image to be tested, but it is not limited thereto. The image to be tested may be an image collected in real time by the terminal, an image pre-stored on the terminal, or an image received from the outside by the terminal, but is not limited thereto. The preset target may be determined by the terminal according to an instruction received from the outside in advance, or may be determined by the terminal recognizing the sample image through various appropriate models.

The target detection method shown in FIG. 1 may specifically include the following steps:

Step S101: acquiring a sample image, where the sample image includes a preset target;

Step S102: Extract the initial feature map of the sample image, and perform semantic information enhancement processing on the initial feature map to obtain a first prediction map of the sample image, where the first prediction map is used to indicate the sample The target area and the background area of the image, the target area is an area that includes the preset target, and the background area is an area that does not include the preset target;

Step S103: Train the detection network model according to the first prediction map and the second prediction map to obtain a trained detection network model, wherein the second prediction map is based on the initial feature map for the sample. The image is calculated and obtained, and the second prediction map is used to indicate the bounding box of the preset target;

Step S104: Use the trained detection network model to detect the image to be tested, so as to obtain a detection result of the preset target in the image to be tested.

In the specific implementation of step S101, the terminal may acquire sample images from outside, or may select at least a part of the training set stored locally as sample images, and the sample images may include preset targets.

Further, the preset target refers to a specific target object, for example, a traffic sign, a license plate, a human face, etc. The preset target may be determined by the terminal according to an instruction received from the outside in advance, or may be determined by the terminal through A variety of appropriate models are used to identify sample images.

In addition, other conditions may be added to the preset target, for example, based on a specific target object, the size may not exceed the preset size, or the size may not exceed the preset proportion of the size of the image, but It is not limited to this. The preset size and preset ratio may be preset.

Further, the sample image may include an identification graphic, and the identification graphic is used to indicate the position of the preset target in the sample image, and may also be used to indicate the category of the preset target in the sample image. For example, in the scenario of multi-target detection (that is, there are multiple preset targets), identification graphics of different shapes may be used to represent different types of preset targets.

In the specific implementation of step S102, before training the detection network model, the detection network model needs to be constructed first, and the detection network model may have various appropriate structures.

Referring to FIG. 2, FIG. 2 is a schematic structural diagram of a detection network model applicable to a target detection method in an embodiment of the present invention. A detection network model applicable to a target detection method in an embodiment of the present invention is described below in a non-limiting manner with reference to FIG. 2 .

The detection network model shown in FIG. 2 may include a feature extraction module 21 , a prediction module 22 , and a semantic information enhancement module 23 .

Further, the detection network model may be a single-step detection network model (single-step detection network model means that only after the image to be tested needs to be sent to the network, it does not need to go through the candidate region suggestion stage, and the prediction can be directly obtained through a single stage. Set the network model of the detection result of the target), or it can be a two-step detection network model (two-step detection network model refers to the network model that generates multiple candidate regions based on the input image to be tested, and then classifies the candidate regions) , or any other appropriate network model, which is not limited here. As a non-limiting example, the detection network model is a single-step detection network model.

Further, the feature extraction module 21 can be used to extract features in the sample image to obtain an initial feature map of the sample image. The initial feature map may include position information of the preset target in the sample image, and may also include semantic information of the preset target.

Further, the feature extraction module 21 can obtain an initial feature map by performing down-sampling on the sample image multiple times. For example, the feature extraction module 21 downsamples the sample image by ²ⁿ times to obtain an initial feature map, where n is a positive integer.

Further, the initial feature map extracted by the feature extraction module 21 may be transmitted to the prediction module 22, and the prediction module 22 may perform calculation according to the initial feature map to obtain the second prediction map. Since the initial feature map contains the position information and semantic information of the preset target in the sample image, the second prediction map can be used to indicate the bounding box of the preset target in the sample image. Specifically, the prediction module 22 may calculate, but not limited to, the key point position, offset, and size of the preset target in the sample image according to the initial feature map.

Further, the initial feature map extracted by the feature extraction module 21 can also be input to the semantic information enhancement module 23, and the semantic information enhancement module 23 performs semantic information enhancement processing on the initial feature map to obtain the first prediction map. Among them, the semantic information enhancement module 23 includes at least one sub-module and a two-class prediction module, wherein the sub-module is used for up-sampling the initial feature map and extracting the semantic information of the preset target in the initial feature map, the two The classification prediction module is used to indicate the target area and the background area in the second prediction map by a binary classification method, wherein the target area refers to the area containing the target, and the background area refers to the area that does not contain the target, so as to enhance the second prediction Semantic information of the graph.

Further, when constructing the detection network model, the number of the sub-modules also needs to be determined, and the number of the sub-modules is determined by the above-mentioned downsampling multiple. Specifically, when the feature extraction module 21 extracts the initial feature map of the sample image, it downsamples the sample image by ²ⁿ times, so the number of sub-modules is n, and n is a positive integer.

Referring to FIG. 3 , FIG. 3 shows a schematic structural diagram of the semantic information enhancement module 23 in FIG. 2 . A non-limiting description of the semantic enhancement module in FIG. 2 is given below in conjunction with FIG. 3 .

3 shows a schematic structural diagram of the semantic information enhancement module when the feature extraction module 21 performs 4 times downsampling on the sample image, which includes a first sub-module 31, a second sub-module 32 and a two-class prediction module 33.

Further, both the first sub-module 31 and the second sub-module 32 include an upsampling module 34 , a first residual module 35 , a second residual module 36 , and a channel attention module 37 .

Further, the upsampling module 34 can be used to upsample the initial feature map, so that the size of the first prediction map obtained by the semantic information enhancement module is consistent with the size of the sample image, so as to be used for training the loss function subsequently.

Further, the first residual module 35 can be used to extract more feature information of the preset target in the images output by the upsampling module 34, and can also avoid gradient disappearance while extracting the feature information. The first residual module 35 may include multiple convolutional layers. If each convolutional layer in the first residual module 35 adopts a 1×1 convolution kernel, the first residual module 35 may be more Well extract the features of each pixel itself.

FIG. 4 shows a schematic structural diagram of the first residual module 35 in FIG. 3 . A non-limiting description of the first residual module in FIG. 3 is given below in conjunction with FIG. 4 .

The first residual module 35 includes a first convolutional layer 41, a second convolutional layer 42 and a third convolutional layer 43. The image output by the upsampling module 34 can be input to the first convolutional layer 41. The first convolutional layer 41 can use a 1×1 convolution kernel. The image output by the first convolution layer 41 can be input to the second convolution layer 42, and the second convolution layer 42 can use a 1×1 convolution kernel. As a non-limiting example, the second convolutional layer 42 may be a grouped convolutional layer. The image output by the second convolution layer 42 can be input to the third convolution layer 43, and the third convolution layer 43 can use a 1×1 convolution kernel.

Further, the output of the third convolution layer 43 can be added to the output of the first convolution layer 41 to obtain the output of the first residual module 35, which can avoid the problem of gradient disappearance in the detection network model.

Further, referring to FIG. 3, the second residual module 36 can also be used to extract more feature information of the preset target in the images output by the upsampling module 34, and can also extract more feature information while avoiding gradients. disappear. The second residual module 36 may include multiple convolution layers, wherein at least one convolution layer includes a 3×3 convolution kernel, so the second residual module 36 can better extract each pixel point. Corresponding receptive field characteristics. In one non-limiting embodiment, the second residual module 36 may include a fourth convolutional layer, a fifth convolutional layer, and a sixth convolutional layer. The image output by the upsampling module 34 can be input to the fourth convolution layer, and the fourth convolution layer can use a 1×1 convolution kernel. The image output by the fourth convolutional layer can be input to the fifth convolutional layer, and the fifth convolutional layer can use a 3×3 convolution kernel. As a non-limiting example, the fifth convolutional layer may be a grouped convolutional layer. The image output by the fifth convolutional layer can be input to the sixth convolutional layer, and the sixth convolutional layer can use a 1×1 convolution kernel.

Further, the output of the sixth convolutional layer can be added with the output of the fourth convolutional layer to obtain the output of the second residual module 36, which can avoid the problem of gradient disappearance in the detection network model.

Further, the channel attention module 37 can be used to process the image output by the upsampling module 34 according to the channel attention mechanism. Specifically, for each pixel in the image, the activation function can be used to determine the weight value of each channel. The more relevant the channel is to the preset target, the greater the weight value of the channel, and then the upsampling is based on the weight value of each channel. The image output by the module 34 is subjected to weighted calculation processing, so that the image output by the channel attention module 37 can well represent the semantic information of the preset target and has a stronger directivity.

It should be noted that the above-mentioned first residual module 35 and second residual module 36 are optional, that is, the first sub-module 31 or the second sub-module 32 may only include the upsampling module 34 and the channel attention The module 37 may also include an upsampling module 34, a channel attention module 37, and/or a first residual module 35, and/or a second residual module 36, but is not limited thereto.

Further, the two-class prediction module 33 can be used to adjust the channel value of the image to 2, the two-class prediction module can include a seventh convolution layer, and the number of filters in the seventh convolution layer is 2, so that the two-class prediction module The outputted first prediction map may be used to indicate a target area and a background area, where the target area is an area containing a preset target, and the background area is an area that does not contain a preset target.

Continuing to refer to Figure 1, after the detection network model is constructed, the terminal can input the sample image to the detection network model, and the detection network model can extract the initial feature map of the sample image, for example, the sample image can be down-sampled to extract the initial feature map . Wherein, each pixel of the initial feature map includes multiple channels. The initial feature map may include position information of the preset target in the sample image, and may also include semantic information of the preset target.

It should be noted that the downsampling multiple is determined by the structure of the module for extracting the initial feature map, and the downsampling multiple for the sample image is the same as the downsampling multiple for the image to be tested. For another example, the initial feature map can be extracted by using the feature extraction module 21 in FIG. 2 .

Further, before extracting the initial feature map of the sample image, data enhancement may also be performed on the sample image, and the data enhancement includes one or more of the following: adjusting the brightness and/or contrast of the sample image; The sample image is rotated by a preset angle to add noise to the sample image, but it is not limited thereto.

Further, the terminal can calculate the key point position, offset and preset target size of the preset target in the sample image according to the initial feature map of the sample image, so as to obtain the second prediction map of the sample image, and the second prediction map can be used as Indicates the bounding box of the preset object in the sample image.

Further, semantic information enhancement processing may be performed on the initial feature map to obtain the first prediction map of the sample image. For example, the initial feature map can be transmitted to the semantic information enhancement module 23 in FIG. 2 for semantic information enhancement processing.

In a non-limiting embodiment, the initial feature map is obtained by downsampling the sample image, and performing semantic information enhancement processing on the initial feature map may include: Step 1: The image is upsampled twice to obtain the first feature map; Step 2: Process the first feature map according to the channel attention mechanism to obtain the second feature map; Step 3: Use the second feature map as a new initial feature map; repeat steps 1 to 3 until the upsampling multiple is equal to the downsampling multiple; step 4: perform a first convolution operation on the initial feature map to obtain the first prediction Figure, wherein the number of channels of the first prediction map is 2.

Specifically, for each up-sampling image, the channel attention mechanism is used for processing, until the up-sampling factor of the initial feature map is the same as the down-sampling factor of the sample image. The number of upsampling or the number of processing according to the channel attention mechanism is determined by the multiple of downsampling. If the downsampling multiple is 2 ⁿ , the number of upsampling or the number of processing using the channel attention mechanism is n, where n is positive integer.

Further, when using the attention channel mechanism to process the first feature map, for each pixel in the first feature map, an activation function can be used to determine the weight value of each channel of the pixel, and the more relevant the channel is to the preset target, The larger the weight value of the channel, the weighted calculation processing is performed on the first feature map based on the weight value of each channel to obtain the second feature map. Therefore, the second feature map can clearly represent the semantic information of the preset target. more directivity.

Further, when the upsampling multiple is equal to the downsampling multiple, a first convolution operation may be performed on the initial feature map, for example, the first convolution may be performed by using the two-class prediction module 33 in FIG. 3 . operation. During the first convolution operation, the number of filters used is 2, and the channels of the obtained first prediction map are 2. Therefore, the first prediction map can indicate the target area and the background area, and the target area refers to The sample image includes the area of the preset target, and the background area refers to the area of the sample image that does not include the preset target. Referring to FIG. 5 , FIG. 5 is a schematic diagram of a first prediction map according to an embodiment of the present invention, in which the target area 51 and the target area 52 are areas including the preset target, and the background area 53 does not include the preset target.

Therefore, through the semantic information enhancement processing, the size of the initial feature map can be restored to the size of the sample image, and at the same time, the important channels in each pixel can be screened out, and the important channels refer to the correlation with the preset target. Larger channel, and then obtain the first prediction map through the first convolution operation, so that the first prediction map can intuitively reflect the target area and background area in the sample image in a two-class manner, so that the first prediction map can include Rich semantic information.

Further, continuing to refer to FIG. 1 , before processing the first feature map according to the channel attention mechanism, multiple second convolution operations may be performed on the first feature map, wherein each second convolution operation may A 1×1 convolution kernel is used; the final result of multiple second convolution operations is added to the first feature map to obtain a new first feature map.

Further, in other second convolution operations other than the first second convolution operation, batch normalization is performed on the first feature map first, and then the relu activation function is used. In order to make the detection network model more optimized.

In one non-limiting embodiment, the number of second convolution operations is three. Wherein, when performing the second second convolution operation on the first feature map, the calculation may be performed in a grouped convolution manner.

In another non-limiting embodiment, the first residual module 35 shown in FIG. 4 may be used to perform multiple second convolution operations.

Further, before the first feature map is processed according to the channel attention mechanism, multiple third convolution operations may be performed on the first feature map, and at least one of the multiple third convolution operations uses a 3×3 convolution operation. Convolution kernel: adding the final result of multiple third convolution operations to the first feature map to obtain a new first feature map.

Further, in the process of performing multiple third convolution operations, other second convolution operations other than the first third convolution operation can use the relu activation function, and perform batch normalization processing, so as to detect the network model. more optimized.

In a non-limiting embodiment, the number of times of the third convolution operation is 3, wherein, the second third convolution operation adopts a 3×3 convolution kernel, and other third convolution operations other than the second third convolution operation The triple convolution operation uses a 1×1 convolution kernel. In addition, the second third convolution operation is calculated by using the grouped convolution method, and the other third convolution operations other than the second third convolution operation are calculated by the ordinary convolution method.

In another non-limiting embodiment, the second residual module 36 shown in FIG. 3 may be used to perform multiple third convolution operations.

In the specific implementation of step S103, the detection network model is trained according to the first prediction map and the second prediction map to obtain a trained detection network model.

Specifically, the loss function of the detection network model consists of a first loss function and a second loss function. Specifically, the following formula can be used to express the loss function of the detection network model:

Loss=λ _semantic L _semantic +λ _model L _model ,

Wherein, Loss is the loss function of the detection network model, L _semantic is the first loss function, L _model is the second loss function, λ _semantic is the weight coefficient of the first loss function, and λ _model is the weight coefficient of the second loss function ; where, λ _semantic + λ _model =1.

Further, training the detection network model may include the following steps: Step A: Calculate a first loss function value according to the first prediction map, the sample image and the first loss function, and calculate the value of the first loss function according to the second prediction Figure, the sample image and the second loss function to calculate the second loss function value; Step B: Calculate the loss function value of the detection network model according to the first loss function value and the second loss function value; Step C: determine that the loss function value exceeds a preset threshold, if so, adjust the parameters of the module used to extract the initial feature map in the detection network model, if not, end the training of the detection network model Step D: Extract the initial feature map of the sample image using the detection network model after adjusting the parameters, and carry out semantic information enhancement processing to the initial feature map to obtain the first prediction map, and according to the initial feature map performing prediction on the sample image to obtain the first prediction map;

Steps A to E are repeatedly performed until it is determined in step C that the loss function value does not exceed the preset threshold, that is, until step C jumps to ending the training of the detection network model. The preset threshold value may be received by the terminal from the outside, or may be determined by the terminal calculation. It can be seen from this that in the process of training the detection image, by adjusting the parameters of the module (for example, the feature extraction module 21 in FIG. 2 ) in the detection network model for extracting the initial feature map for many times, the detection network can be The model learns sufficient semantic information of the preset target. Since the first prediction map indicates the target area and the background area by means of binary classification, which contains rich semantic information, when calculating the first loss function value according to the first prediction map and the sample image, there is no need to deepen the depth of the detection network model. To make the detection network model learn more semantic information.

As a non-limiting example, the first loss function may be the FocalLoss function, that is,

Among them, y=1 indicates that the sample image is a positive sample, that is, the sample image contains a preset target, y=0 indicates that the sample image is a negative sample, that is, the sample image does not contain a preset target; p is the detection The predicted probability of the network model for the preset target. It should be noted that, when the first loss function is the FocalLoss function, the problem of unbalanced positive and negative samples in the sample image can be solved. When the preset target is a small-sized target, since the number of small-sized targets in the sample image is generally small, using the FocalLoss function as the first training function can avoid the problem of insufficient training due to a small number of small-sized targets.

Further, the weight coefficient of the first loss function is determined by the ratio of the preset target to all the targets in the sample image. The larger the ratio of the preset target to all the targets, the greater the weight of the first loss function. The larger the coefficient, that is, the greater the effect of the first prediction map obtained through semantic information enhancement processing in training the detection network model, so that the detection network model has good performance in detecting preset targets. In a non-limiting embodiment, the preset target is a target whose size is less than 32×32, and the proportion of the preset target in the sample image is 10%, then λ _semantic =0.1, λ _model =0.9.

In the specific implementation of step S104, the terminal may first obtain the image to be measured, and the image to be measured may be the image to be measured collected by the terminal in real time, or the image to be measured received from the outside in real time, or stored in advance. The image to be tested locally, but not limited to this.

Further, the trained detection network model is used to detect the image to be tested. If it is detected that the image to be tested contains the preset target, the position and range of the preset target can be output. The Bounding Box marks the position and range of the preset target.

Further, when performing multi-target detection, that is, when the preset target has multiple categories, the terminal can also identify category information of the preset target at the same time.

Since the trained detection network model can fully learn the semantic information of the preset target, when detecting the image to be tested, the extracted feature map (for example, can be extracted by the feature extraction module 21 in FIG. 2 ) can contain rich semantic information , so that the calculation is performed according to the feature map of the image to be tested (for example, the prediction module 22 in FIG. 2 can be used to perform the calculation), and the detection result of the preset target can be obtained.

It should be noted that, when detecting the image to be tested, the feature map of the image to be tested is not subjected to semantic information enhancement processing, that is, the prediction map used to indicate the bounding box of the preset target is directly calculated according to the feature map. That's it.

Referring to FIG. 6 , FIG. 6 is a target detection apparatus in an embodiment of the present invention. The target detection apparatus in an embodiment of the present invention may include an acquisition module 61 , a processing module 62 , a training module 63 , and a detection module 64 .

Wherein, the acquisition module 61 is used to acquire a sample image, and the sample image includes a preset target; the processing module 62 is used to extract the initial feature map of the sample image, and perform semantic information enhancement processing on the initial feature map, To obtain the first prediction map of the sample image, the first prediction map is used to indicate the target area and the background area of the sample image, the target area is the area containing the preset target, the background area is an area that does not contain the preset target; the training module 63 is used to train the detection network model according to the first prediction map and the second prediction map, so as to obtain a trained detection network model, wherein the The second prediction map is obtained by calculating the sample image according to the initial feature map, and the second prediction map is used to indicate the bounding box of the preset target; the detection module 64 is used for using the trained detection network The model detects the image to be tested to obtain the detection result of the preset target in the image to be tested.

For more details on the working principle, working mode, and beneficial effects of the target detection apparatus in the embodiment of the present invention, reference may be made to the above-mentioned related descriptions of FIGS. 1 to 5 , which will not be repeated here.

An embodiment of the present invention further provides a storage medium on which a computer program is stored, and when the computer program is run by a processor, the steps of the target detection method described in FIG. 1 are executed. The storage medium may be a computer-readable storage medium, for example, may include non-volatile memory (non-volatile) or non-transitory (non-transitory) memory, and may also include optical disks, mechanical hard disks, solid-state disks, and the like.

An embodiment of the present invention further provides a terminal, including a memory and a processor, where the memory stores computer instructions that can be run on the processor, and the processor executes the computer instructions described in FIG. 1 when the processor runs the computer instructions. The steps of the object detection method. The terminal may be a terminal device such as a computer, a tablet computer, a mobile phone, etc., but is not limited thereto.

Specifically, in the embodiment of the present invention, the processor may be a central processing unit (central processing unit, CPU for short), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP for short) ), application specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be understood that the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM for short), programmable read-only memory (PROM for short), erasable programmable read-only memory (EPROM for short) , Electrically Erasable Programmable Read-Only Memory (electrically EPROM, EEPROM for short) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of random access memory (RAM) are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous Dynamic random access memory (synchronous DRAM, referred to as SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, referred to as DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, referred to as ESDRAM), Synchronous connection dynamic random access memory (synchlink DRAM, referred to as SLDRAM) and direct memory bus random access memory (direct rambus RAM, referred to as DR RAM).

The "plurality" in the embodiments of the present application refers to two or more.

The descriptions of the first, second, etc. appearing in the embodiments of the present application are only used for illustration and distinguishing the description objects, and have no order. any limitations of the examples.

Although the present invention is disclosed above, the present invention is not limited thereto. Any person skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the protection scope of the present invention should be based on the scope defined by the claims.

Claims

A target detection method, characterized in that the method comprises:

acquiring a sample image, the sample image includes a preset target;

Extract the initial feature map of the sample image, and perform semantic information enhancement processing on the initial feature map to obtain a first prediction map of the sample image, where the first prediction map is used to indicate the target of the sample image an area and a background area, the target area is an area that includes the preset target, and the background area is an area that does not include the preset target;

The detection network model is trained according to the first prediction map and the second prediction map to obtain a trained detection network model, wherein the second prediction map is calculated on the sample image according to the initial feature map obtained, the second prediction map is used to indicate the bounding box of the preset target;

The trained detection network model is used to detect the image to be tested, so as to obtain the detection result of the preset target in the image to be tested.
The target detection method according to claim 1, wherein each pixel of the initial feature map includes multiple channels, and the initial feature map is obtained by down-sampling the sample image, then The semantic information enhancement processing of the initial feature map includes:

Step 1: Upsampling the initial feature map by 2 times to obtain a first feature map;

Step 2: Process the first feature map according to the channel attention mechanism to obtain a second feature map;

Step 3: take the second feature map as a new initial feature map;

Repeat steps 1 to 3 until the upsampling multiple is equal to the downsampling multiple;

Step 4: Perform a first convolution operation on the initial feature map to obtain the first prediction map, wherein the number of channels of the first prediction map is 2.
The target detection method according to claim 2, wherein before processing the first feature map according to a channel attention mechanism, the method further comprises:

Perform multiple second convolution operations on the first feature map, and use a 1×1 convolution kernel for each second convolution operation;

The results of multiple second convolution operations are added to the first feature map to obtain a new first feature map.
The target detection method according to claim 2 or 3, wherein before processing the first feature map according to a channel attention mechanism, the method further comprises:

performing multiple third convolution operations on the first feature map, and at least one of the multiple third convolution operations uses a 3×3 convolution kernel;

The results of multiple third convolution operations are added to the first feature map to obtain a new first feature map.
The target detection method according to claim 3, wherein performing multiple second convolution operations on the first feature map comprises:

In other second convolution operations other than the first second convolution operation, batch normalization is performed on the first feature map, and then the relu activation function is used.
The target detection method according to claim 1, wherein the loss function of the detection network model is Loss=λ semantic L semantic +λ model L model , wherein Loss is the loss function of the detection network model, L semantic is the first loss function, L model is the second loss function, λ semantic is the weight coefficient of the first loss function, and λ model is the weight coefficient of the second loss function;

Then, training the detection network model according to the first prediction map includes:

Step A: Calculate a first loss function value according to the first prediction map, the sample image and the first loss function, and calculate the value of the first loss function according to the second prediction map, the sample image and the second loss function The second loss function value;

Step B: Calculate the loss function value of the detection network model according to the first loss function value and the second loss function value;

Step C: determine that the loss function value exceeds a preset threshold, if so, adjust the parameters of the module used to extract the initial feature map in the detection network model, if not, end the training of the detection network model ;

Step D: Extract the initial feature map of the sample image by using the detection network model after adjusting the parameters, and perform semantic information enhancement processing on the initial feature map to obtain the first prediction map, and according to the initial feature map. performing prediction on the sample image to obtain the first prediction map;

Repeat steps A to E until it is determined in step C that the loss function value does not exceed a preset threshold.
The target detection method according to claim 6, wherein the first loss function is a FocalLoss function.
The target detection method according to claim 6, wherein the weight coefficient of the first loss function is determined by the ratio of the preset target to all the targets in the sample image, and the preset target accounts for all the targets. The larger the ratio of , the larger the weight coefficient of the first loss function.
The target detection method according to claim 1, wherein before extracting the initial feature map of the sample image, the method further comprises:

performing data enhancement on the sample image, the data enhancement including one or more of the following: adjusting the brightness and/or contrast of the sample image, rotating the sample image by a preset angle, adding noise to the sample image .
The target detection method according to claim 1, wherein the detection network model is a single-step detection network model.
A target detection device, characterized in that the device comprises:

an acquisition module for acquiring a sample image, the sample image including a preset target;

A processing module, configured to extract the initial feature map of the sample image, and perform semantic information enhancement processing on the initial feature map to obtain a first prediction map of the sample image, where the first prediction map is used to indicate the the target area and the background area of the sample image, the target area is an area that includes the preset target, and the background area is an area that does not include the preset target;

A training module is used to train the detection network model according to the first prediction map and the second prediction map to obtain a trained detection network model, wherein the second prediction map is based on the initial feature map. Calculated from the sample image, the second prediction map is used to indicate the bounding box of the preset target;

The detection module is used to detect the image to be tested by using the trained detection network model, so as to obtain the detection result of the preset target in the image to be tested.
A storage medium having a computer program stored thereon, characterized in that, when the computer program is run by a processor, the steps of the target detection method according to any one of claims 1 to 10 are executed.
A terminal, comprising a memory and a processor, the memory stores a computer program that can run on the processor, wherein the processor executes any one of claims 1 to 10 when the processor runs the computer program The steps of the target detection method described in item.