CN114581652A

CN114581652A - Target object detection method and device, electronic equipment and storage medium

Info

Publication number: CN114581652A
Application number: CN202011381081.6A
Authority: CN
Inventors: 韩健稳; 陶永俊
Original assignee: Navinfo Co Ltd
Current assignee: Navinfo Co Ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2022-06-03

Abstract

The embodiment of the application provides a target object detection method, a target object detection device, electronic equipment and a storage medium, wherein a feature map of an input image is obtained, the input image comprises a target object, and the feature map of the input image is processed according to a trained anchor frame generation model to obtain an anchor frame of the input image; the size and the position of the anchor frame of the input image are predicted through the anchor frame generation model, and the feature diagram of the input image and the anchor frame of the input image are processed according to the trained output model to obtain the image of the target object. According to the method and the device, the optimal model parameters can be obtained through training, the anchor frame is generated according to the trained anchor frame generation model, the generated anchor frame is more consistent with the target distribution in the input image, the detection model is rapidly converged, and the accuracy of the detection result is improved.

Description

Target object detection method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of target detection, in particular to a target object detection method and device, electronic equipment and a storage medium.

Background

The target detection means that an image area with a target is determined from an image, the target detection result is used for target classification, and the target class in the image is determined.

Object detection typically samples a large number of regions in an input image and then determines whether these regions contain objects of interest. In the area sampling process, a plurality of bounding boxes with different sizes and aspect ratios are generated by taking each pixel as a center. These bounding boxes are called anchor boxes. In the target detection model, several anchor frames with fixed dimensions and aspect ratios are generally predefined, then the anchor frames are used as windows, the step size is set, and the anchor frames which are full of the whole image are generated by a sliding window method. And outputting a detection frame of the target by optimizing a loss function between the anchor frame and the real boundary frame of the target.

However, generating the anchor frame by the sliding window method requires manually defining the anchor frame, and if the definition is not appropriate, convergence of the network model is affected, and accuracy of the detection result is reduced.

Disclosure of Invention

The embodiment of the application provides a target object detection method and device, electronic equipment and a storage medium, so as to provide a target detection scheme with higher accuracy.

In a first aspect, a method for detecting a target object in the present application includes:

acquiring a feature map of an input image, wherein the input image comprises a target object;

processing the characteristic diagram of the input image according to the trained anchor frame generation model to obtain an anchor frame of the input image; the size and the position of an anchor frame are predicted by an anchor frame generation model of an input image;

and processing the characteristic diagram of the input image and the anchor frame of the input image according to the trained output model to obtain the image of the target object.

Optionally, processing the feature map of the input image according to the trained anchor frame generation model to obtain an anchor frame of the input image, specifically including:

determining the central positions of a plurality of anchor frames according to the feature map of the input image and the trained position prediction module;

determining the aspect ratios of a plurality of anchor frames according to the feature map of the input image and the trained aspect ratio prediction module;

generating a plurality of anchor frames of the input image according to the central positions of the anchor frames, the aspect ratios of the anchor frames and the trained combination module;

the anchor frame generation model comprises a position prediction module, an aspect ratio prediction module and a combination module.

Optionally, determining the center positions of the plurality of anchor frames according to the feature map of the input image and the trained position prediction module specifically includes:

performing convolution processing on the characteristic diagram of the input image according to the first convolution function to obtain a first intermediate result;

activating the first intermediate result according to a first activation function to obtain the central positions of a plurality of anchor frames;

wherein the position prediction module comprises a first convolution function and a first activation function.

Optionally, determining aspect ratios of the plurality of anchor frames according to the feature map of the input image and the trained aspect ratio prediction module specifically includes:

performing convolution processing on the characteristic diagram of the input image according to a second convolution function to obtain a second intermediate result;

activating the second intermediate result according to a second activation function to obtain the aspect ratios of the plurality of anchor frames;

wherein the aspect ratio prediction module includes a second convolution function and a second activation function.

Optionally, the method further comprises:

processing the training image according to the untrained feature extraction model to obtain a training feature map;

processing the training characteristic diagram according to the untrained anchor frame generation model to obtain a training anchor frame;

processing the training characteristic diagram and the training anchor frame according to the untrained output model to obtain a training target area;

obtaining a model loss value according to the training target area and the marking target area;

judging whether the model loss value meets a preset convergence condition, if so, outputting a trained feature extraction model, a trained anchor frame generation model and a trained output model;

if not, updating the model parameters of the feature extraction model, the model parameters of the anchor frame generation model and the model parameters of the output model by using the model loss value, and returning to execute processing on the training image according to the untrained feature extraction model;

and marking the target area to obtain the training target in the training image.

Optionally, obtaining a model loss value according to the training target region and the labeling target region specifically includes:

respectively calculating to obtain an anchor frame loss value, a classification loss value and a regression loss value according to the training target area and the marking target area;

and calculating according to the classification loss, the regression loss and the anchor frame loss to obtain a model loss value.

Optionally, the obtaining of the anchor frame loss value by respectively calculating according to the training target area and the labeled target area specifically includes:

calculating the position offset between the central point of the boundary frame of the training target and the central point of the anchor frame corresponding to the training target area;

calculating to obtain the intersection ratio between the boundary frame and the anchor frame of the training target according to the parameters of the boundary frame of the training target and the parameters of the anchor frame corresponding to the training target area;

obtaining the loss of the anchor frame according to the position offset and the intersection ratio; wherein the bounding box of the training target is generated from the labeled target region.

In a second aspect, the present application provides an apparatus for detecting a target object, comprising:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a feature map of an input image, and the input image comprises a target object;

the processing module is used for processing the characteristic diagram of the input image according to the trained anchor frame generation model to obtain an anchor frame of the input image; the size and the position of an anchor frame are predicted by an anchor frame generation model of an input image;

the processing module is further used for processing the feature map of the input image and the anchor frame of the input image according to the trained output model to obtain an image of the target object.

In a third aspect, the present application provides an electronic device, comprising:

a memory for storing a program;

a processor for executing the program stored in the memory, the processor being adapted to perform the method of detecting a target object according to the first aspect and the alternative when the program is executed.

In a fourth aspect, the present application provides a computer-readable storage medium having computer-executable instructions stored thereon, which, when executed by a processor, are configured to implement the method for detecting a target object according to the first aspect and the alternative.

The embodiment of the application provides a target object detection method, a target object detection device, an electronic device and a storage medium. The sliding step length and the size of the anchor frame do not need to be manually set, and the problems of low model recall rate and low precision caused by improper setting of parameters such as the sliding step length, the size of the anchor frame and the like do not exist. And the model is generated by training the anchor frame, so that the anchor frame generated by the model is closer to the region where the target is located, and the accuracy of the target detection result is improved.

Drawings

FIG. 1 is a schematic diagram of the distribution of manually set anchor frames in the prior art;

fig. 2 is a schematic flowchart of a target object detection method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a target object detection method according to another embodiment of the present application;

FIG. 4 is a schematic diagram of an architecture of a detection model according to another embodiment of the present application;

FIG. 5 is an architectural diagram of an anchor frame generation model according to another embodiment of the present application;

FIG. 6 is a schematic diagram of a distribution of anchor frames generated using an anchor frame generation model according to another embodiment of the present application;

fig. 7 is a schematic structural diagram of a target object detection apparatus according to another embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The target detection means that an image area with a target is determined from an image, the target detection result is used for target classification, and the target class in the image is determined. Object detection typically samples a large number of regions in an input image and then determines whether these regions contain objects of interest. In performing area sampling, a plurality of bounding boxes of different sizes and aspect ratios are generated centered on each pixel. These bounding boxes are called anchor boxes.

In the prior art, as shown in fig. 1, several anchor frames with fixed dimensions and aspect ratios are generally predefined, and then the anchor frames are used as windows, step sizes are set, and anchor frames which are full of the whole image are generated by a sliding window method. And outputting a detection frame of the target by optimizing a loss function between the anchor frame and the real boundary frame of the target.

However, the dimension and the aspect ratio of the anchor frame are manually defined when the anchor frame is generated by the sliding window method, and if the anchor frame is not properly defined, the anchor frame cannot be reasonably matched with the real boundary frame of the target, so that a large number of negative samples are generated, imbalance of the positive and negative samples is caused, convergence of a network model is influenced, and the accuracy and the recall rate of a detection result are reduced.

The embodiment of the application provides a target object detection method and device, electronic equipment and a storage medium, and aims to provide a more accurate target object detection method. The inventive concept of the present application is: the anchor frame generation model is built, the optimal model parameters of the anchor frame generation model are obtained through training, manual setting of the sliding step length and the size of the anchor frame is not needed, and the problems of low model recall rate and low precision caused by improper setting of the parameters such as the sliding step length, the size of the anchor frame and the like do not exist. And generating an anchor frame according to the trained anchor frame generation model, wherein the generated anchor frame is more consistent with the target distribution in the input image, so that the detection model is rapidly converged, and the accuracy of the detection result is improved.

As shown in fig. 2, the present application provides a target object detection method, where the main implementation subject of the target object detection method is an electronic device, for example: a computer device. The target object detection method comprises the following steps:

and S101, acquiring a characteristic diagram of the input image.

The input image is an image for detecting a target, and the input image includes an image of the target object. The scheme is to accurately extract the image of the target object from the input image.

After the input image is obtained, the input image is processed by using a convolution network to obtain a feature map of the input image.

And S102, processing the characteristic diagram of the input image according to the trained anchor frame generation model to obtain an anchor frame of the input image.

The anchor frame generation model predicts the size and the position of an anchor frame according to the characteristic diagram of the input image, and then obtains the anchor frame of the input image according to the size and the position of the anchor frame. By training the model parameters in the anchor frame generation model, the anchor frame generation model can accurately predict the anchor frame of each input image. The anchor frame generation model outputs the optimal anchor frame of the input image through multiple cycles.

S103, processing the feature map of the input image and the anchor frame of the input image according to the trained output model to obtain the image of the target object.

And the output model fuses the characteristic diagram of the input image and the anchor frame to obtain the image of the target object. The specific fusion process is as follows: and matching the feature map of the input image with the anchor frame of the input image according to the position information of the feature map and the position information of the anchor frame to obtain the feature map of the input image related to the anchor frame. And determining the position and the size of the image area of the target object according to the position and the size of the anchor frame, and determining the image of the target object according to the characteristic diagram of the input image associated with the anchor frame and the position and the size of the image area of the target object. Only one cycle of the fusion process is described here, and the image of the target object is output after obtaining the optimal fusion result through multiple cycles.

In the method for detecting a target object provided in the embodiment of the present application, a feature map of an input image is processed according to a trained anchor frame generation model to obtain an anchor frame of the input image, and compared with a method in which a center of a sliding window is used as an anchor point and a number of anchor frames with different sizes are fixed by a center of each anchor point in the prior art, a sliding step size and an anchor frame size do not need to be manually set in the embodiment, and problems of low model recall rate and low precision due to improper setting of parameters such as the sliding step size and the anchor frame size do not exist. By training the anchor frame to generate the model, the anchor frame generated by the model is closer to the region where the target is located, and the accuracy of the target detection result is improved.

The application provides a target object detection method, the execution subject of the target object detection method is an electronic device, for example: a computer device. The target object detection method comprises the following steps:

s201, processing the input image according to the trained feature extraction model to obtain a feature map of the input image.

The feature extraction model comprises a convolution network, and parameters in the convolution network are trained so that the convolution network can extract a feature map of an input image from the input image.

S202, processing the feature map of the input image according to the trained anchor frame generation model to obtain an anchor frame of the input image.

The anchor frame generation model comprises a position prediction module, an aspect ratio prediction module and a combination module. And obtaining the trained position prediction module, the trained aspect ratio prediction module and the trained combination module by training parameters of the position prediction module, the aspect ratio prediction module and the combination module in the anchor frame generation model.

The center positions of the anchor boxes are determined according to the feature map of the input image and the trained position prediction module. The aspect ratios of the plurality of anchor boxes are determined from the feature map of the input image and the trained aspect ratio prediction module. And generating a plurality of anchor frames of the input image according to the central positions of the anchor frames, the aspect ratios of the anchor frames and the trained combination module.

More specifically, the location prediction module includes a first convolution function and a first activation function. The parameters of the location prediction module include parameters of a first convolution function and parameters of a first activation function. And determining the optimal parameters of the first convolution function and the optimal parameters of the first activation function through training, and further obtaining the trained first activation function and the trained first convolution function. And performing convolution processing on the feature map of the input image by using the trained first convolution function to obtain a first intermediate result, and performing activation processing on the first intermediate result by using the trained first activation function to obtain the central positions of the anchor frames.

More specifically, the aspect ratio prediction module includes a second convolution function and a second activation function. The parameters of the aspect ratio prediction module include parameters of a second convolution function and parameters of a second activation function. And determining the optimal parameters of the second convolution function and the optimal parameters of the second activation function through training, and further obtaining the trained second activation function and the trained second convolution function. And performing convolution processing on the feature map of the input image by using the trained second convolution function to obtain a second intermediate result. And performing activation processing on the second intermediate result by using the trained second activation function to obtain the aspect ratios of the plurality of anchor frames.

After determining the center positions of the anchor frames and the aspect ratios of the anchor frames, combining the center positions and the aspect ratios of the anchor frames by using a combination model to obtain the anchor frames of the input image.

Only the process of the anchor frame generation model is described above, and it should be noted that the anchor frame generation model outputs the optimal anchor frame of the input image through multiple cycles.

And S203, processing the feature map of the input image and the anchor frame of the input image according to the trained output model to obtain the image of the target object.

After obtaining the anchor frame of the input image, inputting the feature map of the input image and the anchor frame of the input image into the output model, and fusing the feature of the input image and the anchor frame by the output model to obtain the image of the target object.

In the method for detecting the target object provided by the embodiment of the application, the position prediction module is used for predicting the central position of the anchor frame, the height-width ratio prediction module is used for predicting the size of the anchor frame, the combination module is used for combining the central position of the anchor frame and the size of the anchor frame to obtain the anchor frame of the input image, the generated anchor frame is closer to the area where the target is located, the accuracy of a target detection result is improved, and compared with a mode that parameters such as sliding step length or the size of the anchor frame need to be manually set in the prior art, the method for detecting the target object is free of the problems that the model recall rate is low and the precision is reduced due to the fact that the parameters are manually set improperly.

As shown in fig. 3, the present application provides a target object detection method, where the main implementation subject of the target object detection method is an electronic device, for example: a computer device. The target object detection method comprises the following steps:

s301, acquiring a training image and a target area of a training target in the training image.

The training images and the target areas of the training targets in the training images are used as training samples for training the detection model. The training image is used as input data of the detection model, and the target area of the training target is used as output data of the detection model.

S302, the training image is processed by an untrained feature extraction model, an untrained anchor frame generation model and an untrained output model to obtain a training target area.

As shown in fig. 4, the feature extraction model is configured to perform feature extraction on an input image to obtain a feature map of the input image, the anchor frame generation model is configured to predict the feature map of the input image to obtain an anchor frame of the input image, and the output model is configured to fuse the feature map of the input image and the anchor frame to obtain an image of the target object.

And inputting the training image into an untrained feature extraction model for feature extraction to obtain a training feature map. Inputting the training characteristic diagram into an untrained anchor frame generation model for prediction to obtain a training anchor frame. And inputting the training characteristic diagram and the training anchor frame into an untrained output model for fusion processing to obtain a training target area.

More specifically, the anchor block generation model includes a position prediction module, an aspect ratio prediction module, and a combination module. The prediction process of the training feature map in the anchor frame generation model specifically comprises the following steps: inputting the training feature map into an untrained position prediction module to carry out position prediction to obtain the central positions of a plurality of training anchor frames. Inputting the training feature map into an untrained aspect ratio prediction module to predict the aspect ratio to obtain the aspect ratios of a plurality of training anchor boxes. And inputting the central positions and the height-width ratios of the training anchor frames into an untrained combination module for fusion and outputting the training anchor frames.

As shown in fig. 5, the position prediction module includes a first convolution function and a first activation function, and the prediction process of the training feature map in the position prediction module specifically includes: and carrying out convolution processing on the training feature map according to the untrained first convolution function to obtain a first intermediate result. And performing activation processing on the first intermediate result according to the untrained first activation function to obtain the central positions of the training anchor frames.

The aspect ratio prediction module comprises a second convolution function and a second activation function, and the prediction process of the training feature map in the aspect ratio prediction module specifically comprises the following steps: and carrying out convolution processing on the characteristic diagram of the input image according to an untrained second convolution function to obtain a second intermediate result. And performing activation processing on the second intermediate result according to the untrained second activation function to obtain the aspect ratios of the multiple training anchor frames.

And S303, respectively calculating to obtain class loss, regression loss and anchor frame loss according to the training target region and the target region.

The category loss and the regression loss are calculated according to the training target region and the target region by using a method in the prior art, and are not described herein again.

The following describes the process of calculating and obtaining the anchor frame loss according to the training target area and the target area:

the anchor frame losses include anchor frame position losses and aspect ratio losses. And firstly, respectively calculating to obtain the position loss and the height-to-width ratio loss of the anchor frame, and then superposing the position loss and the height-to-width ratio loss of the anchor frame to obtain the loss of the anchor frame.

The process of calculating the position loss of the anchor frame is as follows: and determining a real boundary box of the training target according to the target area. And calculating the position offset between the middle position of the real boundary frame and the middle position of the anchor frame, and obtaining the position loss of the anchor frame according to the position offset.

The process of calculating the loss of the anchor frame aspect ratio is as follows: and calculating according to the boundary frame parameters of the real boundary frame and the anchor frame parameters to obtain the intersection ratio between the real boundary frame and the anchor frame. And obtaining the height-width ratio loss of the anchor frame according to the intersection ratio between the real boundary frame and the anchor frame.

Preferably, the result of multiplying the position offset by the first penalty factor is taken as an anchor frame position penalty, and the result of matching the intersection ratio between the true bounding box and the anchor frame with the second penalty factor is taken as an anchor frame aspect ratio penalty.

Wherein the first loss factor and the second loss factor are determined based on a specified value of the anchor frame position loss and a specified value of the anchor frame aspect ratio loss, the first loss factor and the second loss factor being used to convert the anchor frame position loss and the anchor frame aspect ratio loss to an order of magnitude.

In addition, when the anchor frame is set, the anchor frame can be defined on a single characteristic diagram, namely, all the anchor frames and the real boundary frame are directly matched on the most layer characteristic diagram of the network, and classification and regression loss are calculated; the method can also be defined on Feature maps of multiple scales, that is, the anchor frame and the real bounding box are respectively matched on the multi-scale Feature map or the multi-scale Feature map fused by a Feature Pyramid Network (FPN), and classification and regression loss are calculated.

Wherein, a smaller anchor frame is arranged on the characteristic diagram of a shallower layer, and a larger anchor frame is arranged on the characteristic diagram of a higher layer, thereby utilizing the multi-scale characteristic information. The high-level features have richer semantic information, the low-level features have richer content detail information, and the detection capability of the small target and the accuracy of the network for target position regression are improved.

S304, judging whether the model loss value meets a preset convergence condition, if so, entering S305, otherwise, entering S306.

And judging whether the model loss value is smaller than a preset value or not. And if so, determining that the model loss value meets the convergence condition. If not, determining that the model loss value does not meet the convergence condition.

S305, outputting the trained feature extraction model, the trained anchor box generation model and the trained output model, and exiting the loop.

If the model loss value meets the convergence condition, it can be determined that the model parameters of the current feature extraction model, the model parameters of the anchor frame generation model and the model parameters of the output model are optimal, and then the models under the current parameters are used as the trained feature extraction model, the trained anchor frame generation model and the trained output model.

S306, using the category loss, the regression loss and the anchor frame loss to optimize the characteristic extraction model parameters in the model, the model parameters in the anchor frame generation model and the model parameters in the output model, and returning to S301 to continue execution.

If the model loss value cannot meet the convergence condition, model parameters of the feature extraction model, model parameters of the anchor frame generation model and model parameters of the output model need to be optimized. And reversely propagating the category loss, the regression loss and the anchor frame loss to the feature extraction model, the anchor frame generation model and the output model so as to optimize model parameters in the feature extraction model, model parameters in the anchor frame generation model and model parameters in the output model.

The model parameters in the anchor frame generation model specifically include parameters of the first convolution function, parameters of the first activation function, parameters of the second convolution function, parameters of the second activation function, and parameters in the combination module.

The steps of processing the input image and outputting the image of the target object by using the trained feature extraction model, the anchor frame generation model and the output model have been described in detail in the above embodiments, and are not described herein again.

According to the detection method of the target object, the feature extraction model, the anchor frame generation model and the output model are trained simultaneously, the integral category loss, the regression loss and the anchor frame loss of the three models are calculated, parameters of the feature extraction model, the anchor frame generation model and the output model are optimized, the three models can be converged quickly, and the accuracy of the output target object image is improved.

The following describes, with reference to specific examples, a method for detecting a target object provided in an embodiment of the present application, where the method for detecting a target object includes:

s401, obtaining a training image and a target area of a training target in the training image.

The streetscape image is used as a training image, and the person and the vehicle are used as training targets. And marking the positions of the characters and the vehicles in the street view image. And label (x, y, w, h, c) is used. Wherein, x and y represent coordinate values of the center point of the target, w and h represent the width and height of the boundary box of the target respectively, and c represents the identification of the target category.

S402, processing the training image through an untrained feature extraction model, an untrained anchor frame generation model and an untrained output model to obtain a training target area.

The feature extraction model can be composed of a backbone network and an enhancement module. Training images through a backbone network to obtain a plurality of original feature layers { f₁，f₂，…f_nWhere the original feature layer is in dimensions W × H × C, with larger subscripts indicating a higher number of layers. The backbone network may select a pre-trained network, such as Resnet, Mobilene, Shufflenet, VGG, and the like. From the highest layer feature layer f_nAnd starting to sequentially enhance the characteristic layer to obtain an enhanced characteristic layer { ef₁，ef₂，…ef_nAnd the enhanced feature layer also has dimensions of W × H × C. The feature enhancement adopts FPN network and Face detection network (Single Stage Face Detector, abbreviated as SSH).

In the embodiment, the backbone network selects a Resnet network, the feature enhancement adopts an FPN and an SSH network, the FPN is used for fusing multi-scale information, the SSH is used for fusing context information, and the enhanced feature (ef) is obtained₁，ef₂，…ef_n}。

The anchor frame generation module is respectively arranged at the enhanced feature layers { ef₁，ef₂，…ef_nAnd predicting the position and scale information of the anchor frame, and combining the position and size information to obtain the anchor frame.

With continued reference to FIG. 5, the specific process of generating anchor frame features is as follows: enhanced feature layer ef_iAnd after 1 × 1 × H × 1 convolution and sigmoid function processing, outputting a W × H × N dimensional feature map. The feature map is used to represent the positional confidence of the anchor box, i.e. the probability that each point is predicted to be the anchor box center, N represents the target type, for example: when N is 1, the vehicle is represented, and when N is 2, the person is represented. Enhanced feature layer ef_iAnd then, after 1 × 1 × H × 2 convolution and Relu function processing, a W × H × 2 dimensional feature map is output. The characteristic diagram being used to represent anchorsThe width and height of the frame. And combining the characteristic diagram representing the position confidence coefficient and the characteristic diagram representing the width to obtain the information characteristic diagram of the anchor frame. And then, after convolution processing of 1 × 1 × 3 × C, changing the number of channels of the information characteristic layer of the anchor frame to C.

The processing procedure of the output module specifically includes: the information characteristic diagram of the anchor frame and the enhanced characteristic layer ef_iConvolution addition is carried out, so that a characteristic layer a _ ef fused with anchor frame information is obtained_iAnd outputting the image as the target object.

As shown in FIG. 6, the generation of the anchor frame by using the anchor frame generation model is closer to the target, i.e. closer to the feature map output by the extraction model, which is beneficial to the convergence of the detection model.

And S403, respectively calculating to obtain category loss, regression loss and anchor frame loss according to the training target area and the target area.

And determining a real bounding box of the training target according to the target area. The label (obj) ═ x, y, W, H, c is mapped to the feature map of W × H size to obtain label ' (obj) ═ x ', y ', W ', H ', c).

The anchor frame position loss la (conf) is calculated in such a manner that when the position of the anchor frame is within (x ', y', 0.3w ', 0.3 h'), the region is set to 1, indicating a positive sample, and the region other than (x ', y', 0.7w ', 0.7 h', c) is set to 0, indicating a negative sample.

Matching the anchor frame and the boundary frame, and comparing the width and the height of the anchor frame with those of the boundary frame by using the intersection ratio La (wh).

At the feature layer a _ ef_iWhen classification and regression prediction are performed, a classification loss L (score) and a regression loss L (box) are calculated, a model is trained based on the model loss L ═ la (wh) + la (conf) + L (score) + L (box), and the detection model parameters are updated by back propagation until the loss values converge.

And S404, judging whether the model loss value meets a preset convergence condition, if so, entering S405, and otherwise, entering S406.

S405, outputting the trained feature extraction model, the trained anchor frame generation model and the trained output model.

S406, model parameters in the model, model parameters in the anchor frame generation model and model parameters in the output model are extracted by using the category loss, the regression loss and the anchor frame loss optimization characteristics.

In the detection method of the target object provided by the embodiment of the application, an anchor frame generation module is introduced into a plurality of feature layers, the position and the size of an anchor frame are learned online through a network, the learned anchor frame information is fused into a feature map output by a feature extraction model, the matching degree of the feature map output by the feature extraction model and the anchor frame is improved, and the problem of imbalance of a positive sample in the output model is relieved.

As shown in fig. 7, the present application provides a target object detection apparatus 500, including:

an obtaining module 501, configured to obtain a feature map of an input image, where the input image includes a target object;

a processing module 502, configured to process the feature map of the input image according to the trained anchor frame generation model, to obtain an anchor frame of the input image; the size and the position of an anchor frame are predicted by an anchor frame generation model of an input image;

the processing module 502 is further configured to process the feature map of the input image and the anchor box of the input image according to the trained output model to obtain an image of the target object.

Optionally, the processing module 502 is specifically configured to:

Optionally, the processing module 502 is further configured to:

judging whether the model loss value meets a preset convergence condition, if so, outputting model parameters of the feature extraction model, model parameters of the anchor frame generation model and model parameters of the output model;

respectively calculating according to the training target area and the marking target area to obtain an anchor frame loss value, a classification loss value and a regression loss value;

Optionally, the processing module 502 is specifically configured to:

As shown in fig. 8, an electronic device 600 provided in another embodiment of the present application includes: a transmitter 601, a receiver 602, a memory 603, and a processor 604.

A transmitter 601 for transmitting instructions and data;

a receiver 602 for receiving instructions and data;

a memory 603 for storing computer-executable instructions;

a processor 604 for executing computer executable instructions stored in the memory to implement the steps performed by the method for detecting a target object in the above embodiments. Reference may be made specifically to the related description in the foregoing embodiment of the target object detection method.

Alternatively, the memory 603 may be separate or integrated with the processor 604. When the memory 603 is separately provided, the electronic device further includes a bus for connecting the memory 603 and the processor 604.

An embodiment of the present application further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the method for detecting a target object executed by the electronic device is implemented as above executed by the electronic device.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of detecting a target object, comprising:

acquiring a feature map of an input image, wherein the input image comprises the target object;

processing the characteristic diagram of the input image according to the trained anchor frame generation model to obtain an anchor frame of the input image; the size and the position of an anchor frame of the input image are predicted through the anchor frame generation model;

and processing the characteristic diagram of the input image and the anchor frame of the input image according to the trained output model to obtain an image of a target object.

2. The method according to claim 1, wherein processing the feature map of the input image according to the trained anchor frame generation model to obtain the anchor frame of the input image specifically includes:

determining the aspect ratios of the anchor frames according to the feature map of the input image and the trained aspect ratio prediction module;

wherein the anchor frame generation model comprises the position prediction module, the aspect ratio prediction module, and the combination module.

3. The method according to claim 2, wherein determining the center positions of the plurality of anchor frames according to the feature map of the input image and the trained position prediction module specifically comprises:

performing convolution processing on the characteristic diagram of the input image according to a first convolution function to obtain a first intermediate result;

performing activation processing on the first intermediate result according to a first activation function to obtain the central positions of the anchor frames;

wherein the location prediction module comprises the first convolution function and the first activation function.

4. The method according to claim 2 or 3, wherein determining the aspect ratios of the plurality of anchor frames according to the feature map of the input image and the trained aspect ratio prediction module specifically comprises:

activating the second intermediate result according to a second activation function to obtain the aspect ratios of the anchor frames;

wherein the aspect ratio prediction module comprises the second convolution function and the second activation function.

5. The method according to any one of claims 1 to 3, further comprising:

processing the training image according to an untrained feature extraction model to obtain a training feature map;

processing the training characteristic diagram according to an untrained anchor frame generation model to obtain a training anchor frame;

processing the training feature diagram and the training anchor frame according to an untrained output model to obtain the training target area;

if not, updating the model parameters of the feature extraction model, the model parameters of the anchor frame generation model and the model parameters of the output model by using the model loss value, and returning to execute the untrained feature extraction model to process the training image;

wherein the labeled target region is obtained by labeling a training target in the training image.

6. The method according to claim 5, wherein obtaining a model loss value according to the training target region and the labeled target region specifically comprises:

respectively calculating and obtaining an anchor frame loss value, a classification loss value and a regression loss value according to the training target area and the marking target area;

and calculating to obtain a model loss value according to the classification loss, the regression loss and the anchor frame loss.

7. The method according to claim 6, wherein obtaining anchor frame loss values by calculating according to the training target area and the labeled target area respectively comprises:

calculating to obtain the intersection ratio between the boundary frame of the training target and the anchor frame according to the parameters of the boundary frame of the training target and the parameters of the anchor frame corresponding to the training target area;

obtaining the anchor frame loss according to the position offset and the intersection ratio; wherein the bounding box of the training target is generated from the labeled target region.

8. An apparatus for detecting a target object, comprising:

the acquisition module is used for acquiring a feature map of an input image, wherein the input image comprises the target object;

the processing module is used for processing the characteristic diagram of the input image according to the trained anchor frame generation model to obtain an anchor frame of the input image; the size and the position of an anchor frame of the input image are predicted through the anchor frame generation model;

9. An electronic device, comprising:

a memory for storing a program;

a processor for executing the program stored in the memory, the processor being configured to perform the method of detecting a target object according to any one of claims 1 to 7 when the program is executed.

10. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, are configured to implement the method of detecting a target object according to any one of claims 1 to 7.