WO2022000855A1

WO2022000855A1 - Target detection method and device

Info

Publication number: WO2022000855A1
Application number: PCT/CN2020/121337
Authority: WO
Inventors: 李翔; 杨志雄; 李亚; 王文海; 李俊
Original assignee: 魔门塔(苏州)科技有限公司
Priority date: 2020-06-29
Filing date: 2020-10-16
Publication date: 2022-01-06
Also published as: CN113935386A

Abstract

Disclosed in an embodiment of the present invention are a target detection method and device, the method comprising: acquiring an image to be detected; and using a pre-established target detection model and the image to be detected to determine a target detection result corresponding to the image to be detected. The target detection result comprises the position information of a target detection box in the image to be detected corresponding to a detection target, and the quality information of the target box corresponding thereto. The pre-established target detection model is a model trained based on sample images, calibration information corresponding thereto and the quality information of a corresponding sample box. The quality information of the sample box corresponding to the sample image is information determined based on the calibration box position information in the calibration information corresponding to the sample image and based on the prediction box position information corresponding to the sample image detected by an initial target detection model corresponding to the pre-established target detection model. The aim is to determine the accuracy of the detection box in the image corresponding to the target, and obtain a more accurate detection box corresponding to the target.

Description

A target detection method and device

technical field

The present invention relates to the technical field of target detection, and in particular, to a target detection method and device.

Background technique

In the current target detection technology, the target detection model is used to detect the image to be detected, and the obtained detection results generally include the position information of the target detection frame corresponding to the detection target and the corresponding category probability information, and the target detection model is used in the target detection. The model selects the target detection frame position information corresponding to the final output target from the position information of the multiple candidate detection frames predicted by the image to be detected, and uses the category probability information corresponding to the position information of each target detection frame to screen. The target detection frame position information corresponding to the final output target is obtained, wherein the category probability information is a confidence level indicating that the corresponding target is a certain category.

However, at present, most of the scenes applying target detection technology, such as vehicle detection scenes, pedestrian detection scenes, etc., need to detect the target detection frame position information corresponding to the target with a more accurate position from the to-be-detected image based on the target detection model, that is, The target detection frame position information corresponding to the target with higher corresponding frame quality information needs to be obtained. However, the current target detection technology cannot realize the determination of the quality information corresponding to the position information of the detection frame.

Then, how to provide a method for determining the quality information of the detection frame corresponding to the target becomes an urgent problem to be solved.

SUMMARY OF THE INVENTION

The invention provides a target detection method and device, so as to realize the determination of the accuracy of the detection frame corresponding to the target in the image, and then obtain the detection frame corresponding to the target with better accuracy. The specific technical solutions are as follows:

In a first aspect, an embodiment of the present invention provides a target detection method, the method comprising:

Obtain the image to be detected;

A target detection result corresponding to the to-be-detected image is determined by using a pre-established target detection model and the to-be-detected image, wherein the target detection result includes: target detection frame position information corresponding to the detected target in the to-be-detected image and the target frame quality information corresponding to the target detection frame position information, the pre-established target detection model is: a model obtained by training based on the sample image and its corresponding calibration information and the corresponding sample frame quality information, the sample image corresponding The sample frame quality information is: information determined based on the calibration frame position information in the calibration information corresponding to the sample image and the predicted frame position information corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model.

Optionally, the sample frame quality information corresponding to the sample image is: the sample image detected based on the calibration frame position information in the calibration information corresponding to the sample image and the initial target detection model corresponding to the pre-established target detection model. The ratio information of the intersection area and the union area between the corresponding prediction frame position information.

Optionally, the target detection result further includes: detection category information corresponding to the detection target in the to-be-detected image.

Optionally, in the use of the pre-established target detection model and the to-be-detected image, from the to-be-detected image, the target detection frame position information corresponding to the to-be-detected target and the corresponding target detection frame position information are detected. Before the step of the target frame quality information, the method further includes:

The process of obtaining the pre-established target detection model by training, wherein the process includes:

obtaining the initial target detection model, wherein the initial target detection model includes a feature extraction layer, a feature classification layer and a feature regression layer;

Obtaining a plurality of sample images and calibration information corresponding to the sample images, wherein the calibration information includes: calibration frame position information and calibration category information corresponding to the sample targets contained in the corresponding sample images;

For each sample image, input the sample image into the feature extraction layer, and extract the sample image feature corresponding to the sample image;

For each sample image, input the sample image feature corresponding to the sample image into the feature regression layer, and obtain the prediction frame position information corresponding to the sample target in the sample image;

For each sample target in each sample image, calculate the ratio information of the intersection area and the union area between the calibration frame position information corresponding to the sample target and the corresponding prediction frame position information, and determine it as the real frame corresponding to the sample target quality information;

For each sample image, the sample image feature corresponding to the sample image and the prediction frame position information corresponding to the sample target in the sample image are input into the feature classification layer, and the prediction category information and prediction frame corresponding to the sample target in the sample image are determined. quality information;

For each sample image, focus on the loss function based on the preset positioning quality, the predicted frame quality information and the real frame quality information corresponding to the sample target in the sample image, as well as the preset category loss function, the predicted category corresponding to the sample target in the sample image Information and calibration category information to determine the current loss value;

Determine whether the current loss value exceeds the preset loss value threshold;

If the judgment result is yes, then adjust the model parameters of the feature extraction layer, the feature regression layer and the feature classification layer, return to execute for each sample image, input the sample image into the feature extraction layer, and extract the result. The steps of the sample image feature corresponding to the sample image;

If the judgment result is no, it is determined that the initial target detection model has reached a convergence state, and a pre-established target detection model is obtained.

Optionally, the expression of the preset positioning quality focusing loss function is:

LFL (i) = - (( 1-p i) log (1-q i) + p i log (q i)) | p i -q i | γ;

Wherein the LFL (i) denotes the i th value of loss between the first information and the real information of the sample corresponding to a target frame quality predicted mass of the sample image frame, P _i represents the i-th sample image sample corresponding to a target The quality information of the real frame, q _i represents the quality information of the predicted frame corresponding to the ith sample target in the sample image, and γ represents the preset parameter.

Optionally, the sample frame quality information and sample category information corresponding to the sample image exist in the form of preset soft one-hot encoding, and the position of the sample frame quality information corresponding to the sample image in the preset soft one-hot encoding is indicated. The sample category information corresponding to the sample image.

Optionally, the step of using a pre-established target detection model and the to-be-detected image to determine the target detection result corresponding to the to-be-detected image includes:

Inputting the to-be-detected image into a feature extraction layer of a pre-established target detection model, and extracting to-be-detected image features corresponding to the to-be-detected image;

Inputting the feature of the image to be detected into the feature regression layer of the pre-established target detection model, and determining the position information of the candidate frame corresponding to the image to be detected;

Input the feature of the image to be detected and the position information of the candidate frame into the feature classification layer of the pre-established target detection model, and determine the position information of each candidate frame corresponding to each detection target in the image to be detected. Detection category information and target frame quality information;

For each detection target in the to-be-detected image, based on the preset suppression algorithm, the target frame quality information corresponding to the position information of each candidate frame corresponding to the detection target, and from the position information of all candidate frames corresponding to the detection target, determine The position information of the candidate frame that satisfies the preset screening condition is obtained as the position information of the target detection frame corresponding to the detection target, so as to obtain the target detection result corresponding to the image to be detected, wherein the preset screening condition is: limit the detection The condition that the quality information of the corresponding target frame in the position information of the candidate frame corresponding to the target is the largest.

In a second aspect, an embodiment of the present invention provides a target detection device, the device comprising:

an obtaining module, configured to obtain an image to be detected;

A determination module, configured to use a pre-established target detection model and the to-be-detected image to determine a target detection result corresponding to the to-be-detected image, wherein the target detection result includes: the detected target in the to-be-detected image corresponds to The target detection frame position information and the target frame quality information corresponding to the target detection frame position information, the pre-established target detection model is: a model obtained by training based on the sample image and its corresponding calibration information and the corresponding sample frame quality information, The sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the predicted frame position corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model information, definite information.

Optionally, the device further includes:

The model training module is configured to use the pre-established target detection model and the to-be-detected image to detect, from the to-be-detected image, the target detection frame position information and target detection frame corresponding to the target to be detected. Before obtaining the target frame quality information corresponding to the position information, the pre-established target detection model is obtained by training, wherein the model training module is specifically configured to obtain the initial target detection model, wherein the initial target detection model includes Feature extraction layer, feature classification layer and feature regression layer;

LFL (i) = - (( 1-p i) log (1-q i) + p i log (q i)) | p i -q i | γ;

Optionally, the determining module is specifically configured to input the to-be-detected image into a feature extraction layer of a pre-established target detection model, and extract the to-be-detected image feature corresponding to the to-be-detected image;

It can be seen from the above that a target detection method and device provided in the embodiments of the present invention obtain an image to be detected; a target detection result corresponding to the to-be-detected image is determined by using a pre-established target detection model and the to-be-detected image, wherein the target detection The results include: the target detection frame position information corresponding to the detection target in the image to be detected and the target frame quality information corresponding to the target detection frame position information. The pre-established target detection model is: based on the sample image and its corresponding calibration information and the corresponding sample The model obtained by training the frame quality information, the sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the sample detected based on the initial target detection model corresponding to the pre-established target detection model The position information of the prediction frame corresponding to the image, the determined information.

Applying the embodiment of the present invention, the pre-established target detection model obtained by training based on the sample image and its corresponding calibration information and the corresponding sample frame quality information has the function of predicting the quality corresponding to the target detection frame corresponding to the detection target in the image, and The sample frame quality information is the information determined based on the position information of the calibration frame in the calibration information corresponding to the sample image and the position information of the prediction frame corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model, Through the frame quality information corresponding to the frame position information corresponding to the target in the prediction image of the pre-established target detection model, the frame position information with better frame quality information corresponding to the detection target can be screened out as the target detection frame position information, so as to realize the target detection frame position information. The accuracy of the detection frame corresponding to the target in the image is determined, and then the detection frame corresponding to the target with better accuracy is obtained. Of course, it is not necessary for any product or method of the present invention to achieve all of the advantages described above at the same time.

The innovative points of the embodiments of the present invention include:

1. The pre-established target detection model trained based on the sample image and its corresponding calibration information and the corresponding sample frame quality information has the function of predicting the quality corresponding to the target detection frame corresponding to the detection target in the image, and the sample frame quality The information is the position information of the calibration frame based on the calibration information corresponding to the sample image and the position information of the prediction frame corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model. The determined information is obtained through the pre-established target detection model. The frame quality information corresponding to the frame position information corresponding to the target in the predicted image of the target detection model can be used to screen out the frame position information with better frame quality information corresponding to the detection target as the target detection frame position information, so as to realize the corresponding target in the image. The accuracy of the detection frame is determined, and then the detection frame corresponding to the target with better accuracy is obtained.

2. Compare the area of intersection between the position information of the calibration frame in the calibration information corresponding to the sample image and the position information of the prediction frame corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model. The ratio information of the set area is used as the quality information of the sample frame corresponding to the sample image, so that the pre-established target detection model can learn the prediction function that is more in line with the actual frame quality, which is the prediction function of the frame quality information corresponding to the subsequent frame position information, and Provides a basis for the screening of frame position information based on frame quality information.

3. The process of training to obtain a pre-established target detection model, through the feature extraction layer, feature regression layer and sample image of the initial target detection model, to obtain the position information of the prediction frame corresponding to the sample target in the sample image, and then calculate the corresponding prediction frame position information for each sample image. For each sample target in the sample target, calculate the ratio information of the intersection area and the union area between the calibration frame position information corresponding to the sample target and the corresponding prediction frame position information, and determine it as the real frame quality information corresponding to the sample target, and then, Determine the prediction category information and prediction frame quality information corresponding to the sample target in the sample image through the feature classification layer of the initial target detection model, the sample image feature corresponding to the sample image, and the prediction frame position information corresponding to the sample target in the sample image; Using the preset positioning quality focusing loss function, the predicted frame quality information and the real frame quality information corresponding to the sample target in the sample image, and the preset category loss function, the predicted category information and calibration category information corresponding to the sample target in the sample image, Determine the current loss value, if the current loss value exceeds the preset loss value threshold, adjust the model parameters of the feature extraction layer, feature regression layer and feature classification layer, and return for each sample image, input the sample image into the feature The extraction layer extracts and obtains the sample image features corresponding to the sample image; if the current loss value does not exceed the preset loss value threshold, it is determined to obtain a pre-established target detection model, and the training of the model is realized, so that the pre-established target detection model has The ability to predict the quality corresponding to the target detection frame corresponding to the detection target in the image provides a basis for the subsequent determination of the position information of the detection frame.

4. Set a preset positioning quality focusing loss function that can support frame quality information prediction training, so as to support the training of the pre-established target detection model on the quality of the target detection frame corresponding to the detection target in the image.

5. The sample frame quality information and sample category information corresponding to the sample image exist in the form of preset soft one-hot encoding, and the position of the sample frame quality information corresponding to the sample image in the preset soft one-hot encoding indicates the sample corresponding to the sample image. Category information, realizes the joint representation of category information and frame quality information.

6. In the process of detecting the image to be detected through the pre-established target detection model, the position information of the candidate frame corresponding to the image to be detected is determined through the feature extraction layer and the feature regression layer of the pre-established target detection model, and then, combined with the pre-established The feature classification layer of the target detection model, determines the detection category information and target frame quality information corresponding to each candidate frame position information corresponding to each detection target in the image to be detected; The quality information of the target frame corresponding to the position information of the candidate frame, from the position information of all the candidate frames corresponding to the detection target, determine the position information of the candidate frame that satisfies the preset screening conditions, as the target detection frame position information corresponding to the detection target, In order to obtain the target detection result corresponding to the to-be-detected image. By comparing the accuracy of the corresponding candidate frame position information represented by the frame quality information, the selection and determination of the corresponding candidate frame position information is completed, so as to obtain frame position information with better detection position accuracy.

Description of drawings

In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that are required in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only some embodiments of the invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

1 is a schematic flowchart of a target detection method provided by an embodiment of the present invention;

Fig. 2 is a kind of schematic flowchart of the process of training to obtain a pre-established target detection model;

FIG. 3 is a schematic diagram of category information and frame quality information jointly represented;

FIG. 4 is a schematic structural diagram of a target detection apparatus provided by an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

It should be noted that the terms "comprising" and "having" in the embodiments of the present invention and the accompanying drawings, as well as any modifications thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the steps or units listed, but optionally also includes steps or units not listed, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.

The invention provides a target detection method and device, so as to realize the determination of the accuracy of the detection frame corresponding to the target in the image, and then obtain the detection frame corresponding to the target with better accuracy. The embodiments of the present invention will be described in detail below.

FIG. 1 is a schematic flowchart of a target detection method provided by an embodiment of the present invention. The method may include the following steps:

S101: Obtain an image to be detected.

The target detection method provided by the embodiment of the present invention can be applied to any electronic device with computing capability, and the electronic device can be a terminal or a server. In an implementation manner, the electronic device may be an in-vehicle device, installed on the vehicle, and the vehicle may also be provided with an image acquisition device, which may collect images for the environment in which the vehicle is located, and the electronic device is connected to the image acquisition device, The image collected by the image collection device can be obtained as the image to be detected. In another implementation manner, the electronic device may be a non-vehicle device, and the electronic device may be connected to an image acquisition device that captures the target scene to obtain an image captured by the image capture device for the target scene as an image to be detected. , the target scene can be a road scene or a square scene or an indoor scene, which is all possible.

The to-be-detected image may be an RGB (Red Green Blue, red, green, blue) image or an infrared image, which are all possible. The embodiment of the present invention does not limit the type of the image to be detected.

S102: Determine a target detection result corresponding to the to-be-detected image by using the pre-established target detection model and the to-be-detected image.

The target detection result includes: target detection frame position information corresponding to the detection target in the to-be-detected image and target frame quality information corresponding to the target detection frame position information, and the pre-established target detection model is: based on the sample image and its corresponding calibration information and the model obtained by training the corresponding sample frame quality information, the sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the initial target detection model based on the pre-established target detection model corresponding to the detection model The position information of the prediction frame corresponding to the sample image obtained is determined information.

The electronic device or the connected storage device locally stores a pre-established target detection model trained based on the sample image and its corresponding calibration information and the corresponding sample frame quality information, wherein the pre-established target detection model is in the training process. The preset positioning quality focusing loss function is used to adjust the corresponding model parameters. The sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the predicted frame position information corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model, definitive information. The pre-established target detection model trained based on the sample images and their corresponding calibration information and the corresponding sample frame quality information has the ability to predict the quality corresponding to each detection frame, that is, the frame quality information corresponding to the position information of each detection frame. The frame quality information can represent the accuracy of the corresponding detection frame position information obtained by detection. In one case, the frame quality information corresponding to the detection frame position information can be represented by numerical values. The more consistent the characterizing location area is with the location area where the target is located. Among them, for the layout situation, the training process of the pre-established target detection model will be described later.

The electronic device inputs the image to be detected into the pre-established target detection model, uses the pre-established target detection model to extract the image features of the image to be detected, and obtains the image features to be detected; and uses the pre-established target detection model to regress the image features to be detected. , multiple candidate detection frames are regressed from the image to be detected, and their position information is obtained; the position information of each candidate detection frame is predicted by using the pre-established target detection model, the position information of multiple candidate detection frames and the characteristics of the image to be detected The corresponding frame quality information, and then use the frame quality information to filter the position information of multiple candidate detection frames, and screen out the frame quality information corresponding to each detection target to represent the candidate detection frame position information of the corresponding candidate detection frame. Obtain the target frame quality information including the target detection frame position information corresponding to the detection target in the to-be-detected image and the target frame quality information corresponding to the target detection frame position information.

Applying the embodiment of the present invention, the pre-established target detection model obtained by training based on the sample image and its corresponding calibration information and the corresponding sample frame quality information has the function of predicting the quality corresponding to the target detection frame corresponding to the detection target in the image, and The sample frame quality information is the information determined based on the position information of the calibration frame in the calibration information corresponding to the sample image and the position information of the prediction frame corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model, Through the frame quality information corresponding to the frame position information corresponding to the target in the prediction image of the pre-established target detection model, the frame position information with better frame quality information corresponding to the detection target can be screened out as the target detection frame position information, so as to realize the target detection frame position information. The accuracy of the detection frame corresponding to the target in the image is determined, and then the detection frame corresponding to the target with better accuracy is obtained.

In another embodiment of the present invention, the sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the initial target detection model corresponding to the pre-established target detection model. The ratio information of the intersection area and the union area between the position information of the prediction frame corresponding to the sample image obtained. The intersection area of the prediction frame position information corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model and the calibration frame position information in the calibration information corresponding to the sample image, and the combination of the above two The ratio information of the collection area is determined as the quality information of the sample frame corresponding to the sample image. It can be understood that, the larger the intersection area between the position information of the prediction frame corresponding to the sample image detected by the initial target detection model corresponding to the pre-established target detection model, the smaller the union area, and the corresponding Representation of sample frame quality information: the closer the position information of the prediction frame corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model and the position information of the calibration frame in the calibration information corresponding to the sample image, is closer. That is, the accuracy of the position information of the prediction frame position information corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model is higher.

In one case, the quality information of the sample frame corresponding to the sample image can be represented by a numerical value, and the value range of the numerical value can be [0, 1].

In another embodiment of the present invention, the target detection result may further include: detection category information corresponding to the detection target in the image to be detected. Correspondingly, the calibration information corresponding to the sample image may also include calibration category information, so that the pre-established target detection model obtained by training has the ability to predict the category of the target in the image.

In another embodiment of the present invention, before the S102, the method may further include:

The process of training to obtain a pre-established target detection model, wherein, as shown in Figure 2, the process includes the following steps:

S201: Obtain an initial target detection model.

The initial target detection model includes a feature extraction layer, a feature classification layer and a feature regression layer;

S202: Obtain multiple sample images and calibration information corresponding to the sample images.

The calibration information includes: calibration frame position information and calibration category information corresponding to the sample target contained in the corresponding sample image.

S203: For each sample image, input the sample image into the feature extraction layer, and extract the sample image feature corresponding to the sample image.

S204: For each sample image, input the sample image feature corresponding to the sample image into the feature regression layer, and obtain the prediction frame position information corresponding to the sample target in the sample image.

S205: For each sample target in each sample image, calculate the ratio information of the intersection area and the union area between the calibration frame position information corresponding to the sample target and the corresponding prediction frame position information, and determine as the sample target corresponding Ground truth box quality information.

S206: For each sample image, input the sample image feature corresponding to the sample image and the prediction frame position information corresponding to the sample target in the sample image into the feature classification layer, and determine the prediction category information and the prediction frame corresponding to the sample target in the sample image quality information.

S207: For each sample image, focus on the loss function based on the preset positioning quality, the predicted frame quality information and the real frame quality information corresponding to the sample target in the sample image, and the preset category loss function, the corresponding sample target in the sample image. Prediction category information and calibration category information to determine the current loss value.

S208: Determine whether the current loss value exceeds a preset loss value threshold.

S209: If the judgment result is yes, adjust the model parameters of the feature extraction layer, the feature regression layer, and the feature classification layer, and return to S203.

S210: If the judgment result is no, it is determined that the initial target detection model has reached a convergence state, and a pre-established target detection model is obtained.

In this implementation manner, before determining the target detection result corresponding to the image to be detected, the electronic device may further include a process of training to obtain a pre-established target detection model. Correspondingly, the electronic device obtains a plurality of sample images and their corresponding calibration information, the sample images may include sample objects, and the calibration information corresponding to the sample images may include calibration frame position information corresponding to the sample objects in the sample image. Obtain an initial target detection model including a feature extraction layer, a feature regression layer and a feature classification layer; for each sample image, input the sample image into the feature extraction layer, and extract the sample image features corresponding to the sample image; the sample image corresponds to The sample image features are input into the feature regression layer, and the prediction frame position information corresponding to the sample target in the sample image is obtained; then, for each sample target in each sample image, the calibration frame position information corresponding to the sample target and the corresponding prediction are calculated. The ratio information of the intersection area and the union area between the frame position information is determined as the real frame quality information corresponding to the sample target; and for each sample image, the sample image feature corresponding to the sample image and the sample in the sample image. The position information of the prediction frame corresponding to the target is input into the feature classification layer, and the prediction category information and prediction frame quality information corresponding to the sample target in the sample image are determined; the real frame quality information corresponding to the sample target in the sample image is used as a kind of calibration information. A sample image, the current first loss value is determined based on the preset positioning quality focusing loss function, the predicted frame quality information and the real frame quality information corresponding to the sample target in the sample image; and based on the preset category loss function, the sample image The prediction category information and the calibration category information corresponding to the sample target in the middle, determine the current second loss value; and then determine the current loss value based on the current first loss value and the current second loss value.

If it is judged that the current loss value exceeds the preset loss value threshold, it is determined that the initial target detection model has not reached the convergence state, and the preset optimization algorithm is used to adjust the model parameters of the feature extraction layer, feature regression layer and feature classification layer, and return to execute the For each sample image, the sample image is input into the feature extraction layer, and the sample image features corresponding to the sample image are extracted and obtained. If it is judged that the current loss value does not exceed the preset loss value threshold, it is determined that the initial target detection model has reached a convergence state, and an accurate image of the location area, category information and location information representing the location area of the detected target in the image can be detected. A pre-built object detection model with characteristic box quality information.

In one case, the frame quality loss value corresponding to each sample target may be determined based on the preset positioning quality focusing loss function, the predicted frame quality information and the real frame quality information corresponding to each sample target in the sample image, and then the sample target The sum or average value of the frame quality loss values corresponding to all sample targets in the image is determined as the current first loss value; and based on the preset category loss function, the predicted category information and calibration category information corresponding to each sample target in the sample image , determine the category loss value corresponding to each sample target; and then determine the current second loss value by the sum or average value of the category loss values corresponding to all sample targets in the sample image; and then use the current first loss value and its The sum of the product of the corresponding weight value and the current second loss value and the product of the corresponding weight value determines the current loss value.

The preset optimization algorithm may include, but is not limited to, the gradient descent method. In one case, the sample targets may be vehicles, pedestrians, and traffic signs. The initial target detection model may be a deep learning-based neural network model. The preset category loss function may be any type of loss function in the related art that can calculate a loss value between category information, which is not limited in the embodiment of the present invention.

In one case, the above-mentioned current loss value may also be determined in combination with the preset position loss function, the position information of the prediction frame corresponding to the sample target in the sample image, and the position information of the calibration frame. The preset position loss function may be any type of loss function in the related art that can calculate a loss value between frame position information, which is not limited in the embodiment of the present invention.

In another embodiment of the present invention, the expression of the preset localization quality focus loss function (LFL, Localization Focal Loss) may be:

LFL (i) = - (( 1-p i) log (1-q i) + p i log (q i)) | p i -q i | γ;

Among them, the LFL(i) represents the first loss value between the predicted frame quality information corresponding to the ith sample target in the sample image and the real frame quality information, and p _i represents the ith sample target in the sample image corresponding to the first loss value The real frame quality information, q _i represents the predicted frame quality information corresponding to the ith sample target in the sample image, and γ represents the preset parameter.

In an implementation manner, the electronic device may also use batch sample images to calculate the current loss value, that is, use the preset positioning quality focus loss function, the predicted frame quality information and the real frame quality information corresponding to the sample targets in the multiple sample images , and the preset category loss function, the predicted category information and the calibration category information corresponding to the sample targets in the multiple sample images, to determine the current loss value, which is also possible.

In another embodiment of the present invention, the category information and the frame quality information can be jointly represented. Correspondingly, the sample frame quality information and the sample category information corresponding to the sample image exist in the form of preset soft one-hot encoding, and the sample image corresponding to The position of the sample frame quality information in the preset soft one-hot encoding represents the sample category information corresponding to the sample image. As shown in FIG. 3, it is an example diagram of the category information and frame quality information jointly represented, in which the frame quality information is represented by numerical values, and the value range is [0, 1]. As shown in FIG. Set the number of category information of detectable targets corresponding to the established target detection model to 5, 0.9 represents the frame quality information corresponding to the position information of the corresponding detection frame, and 0.9 in the second frame can represent the preset target detection. The model predicts that the target corresponding to the location information of the detection frame belongs to the second category.

In another embodiment of the present invention, the S102 may include the following steps:

Input the to-be-detected image into the feature extraction layer of the pre-established target detection model, and extract the to-be-detected image features corresponding to the to-be-detected image;

The feature of the image to be detected is input into the feature regression layer of the pre-established target detection model, and the position information of the candidate frame corresponding to the image to be detected is determined;

Input the image features to be detected and the candidate frame position information into the feature classification layer of the pre-established target detection model, and determine the detection category information and target frame corresponding to each candidate frame position information corresponding to each detection target in the to-be-detected image quality information;

For each detection target in the image to be detected, based on the preset suppression algorithm, the target frame quality information corresponding to the position information of each candidate frame corresponding to the detection target, and from the position information of all candidate frames corresponding to the detection target, it is determined to satisfy the The position information of the candidate frame of the preset screening condition is used as the position information of the target detection frame corresponding to the detection target, so as to obtain the target detection result corresponding to the image to be detected, wherein the preset screening condition is: limit the position of the candidate frame corresponding to the detection target The condition for the maximum quality information of the corresponding target frame in the information.

The preset suppression algorithm may be NMS (Non Maximum Suppression, non-maximum suppression algorithm).

In this implementation manner, the electronic device determines the position information of the candidate frame corresponding to the image to be detected through the feature extraction layer and the feature regression layer of the pre-established target detection model, and then uses the feature classification layer of the pre-established target detection model and the feature classification layer to be detected. Detect image features and candidate frame position information, and determine the detection category information and target frame quality information corresponding to each candidate frame position information corresponding to each detection target in the image to be detected; The quality information of the target frame corresponding to the position information of the candidate frame, from the position information of all the candidate frames corresponding to the detection target, determine the position information of the candidate frame that satisfies the preset screening conditions, as the target detection frame position information corresponding to the detection target, In order to obtain the target detection result corresponding to the image to be detected. By comparing the accuracy of the corresponding candidate frame position information represented by the frame quality information, the selection and determination of the corresponding candidate frame position information is completed, so as to obtain frame position information with better detection position accuracy.

Corresponding to the foregoing method embodiments, an embodiment of the present invention provides a target detection apparatus. As shown in FIG. 4 , the apparatus may include:

an obtaining module 410, configured to obtain an image to be detected;

The determination module 420 is configured to use a pre-established target detection model and the to-be-detected image to determine a target detection result corresponding to the to-be-detected image, wherein the target detection result includes: a detected target in the to-be-detected image Corresponding target detection frame position information and target frame quality information corresponding to the target detection frame position information, the pre-established target detection model is: a model obtained by training based on the sample image and its corresponding calibration information and the corresponding sample frame quality information , the sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the prediction frame corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model Location information, definite information.

In another embodiment of the present invention, the sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the initial target detection model corresponding to the pre-established target detection model The ratio information of the intersection area and the union area between the detected prediction frame position information corresponding to the sample image.

In another embodiment of the present invention, the target detection result further includes: detection category information corresponding to the detection target in the to-be-detected image.

In another embodiment of the present invention, the device further includes:

The model training module (not shown in the figure) is configured to use the pre-established target detection model and the to-be-detected image to detect the target detection corresponding to the to-be-detected target from the to-be-detected image Before the frame position information and the target frame quality information corresponding to the target detection frame position information, the pre-established target detection model is obtained by training, wherein the model training module is specifically configured to obtain the initial target detection model, wherein, The initial target detection model includes a feature extraction layer, a feature classification layer and a feature regression layer;

In another embodiment of the present invention, the expression of the preset positioning quality focus loss function is:

LFL (i) = - (( 1-p i) log (1-q i) + p i log (q i)) | p i -q i | γ;

In another embodiment of the present invention, the sample frame quality information and sample category information corresponding to the sample image exist in the form of preset soft one-hot encoding, and the sample frame quality information corresponding to the sample image is stored in the preset soft one-hot encoding. The position in one-hot encoding represents the sample category information corresponding to the sample image.

In another embodiment of the present invention, the determining module 410 is specifically configured to input the to-be-detected image into a feature extraction layer of a pre-established target detection model, and extract the to-be-detected image corresponding to the to-be-detected image feature;

The foregoing system and device embodiments correspond to the system embodiments, and have the same technical effects as the method embodiments. For specific descriptions, refer to the method embodiments. The apparatus embodiment is obtained based on the method embodiment, and the specific description can refer to the method embodiment section, which will not be repeated here. Those of ordinary skill in the art can understand that the accompanying drawing is only a schematic diagram of an embodiment, and the modules or processes in the accompanying drawing are not necessarily necessary to implement the present invention.

Those skilled in the art may understand that: the modules in the apparatus in the embodiment may be distributed in the apparatus in the embodiment according to the description of the embodiment, and may also be located in one or more apparatuses different from this embodiment with corresponding changes. The modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present invention.

Claims

A target detection method, characterized in that the method comprises:

Obtain the image to be detected;

A target detection result corresponding to the to-be-detected image is determined by using a pre-established target detection model and the to-be-detected image, wherein the target detection result includes: target detection frame position information corresponding to the detected target in the to-be-detected image and the target frame quality information corresponding to the target detection frame position information, the pre-established target detection model is: a model obtained by training based on the sample image and its corresponding calibration information and the corresponding sample frame quality information, the sample image corresponding The sample frame quality information is: information determined based on the calibration frame position information in the calibration information corresponding to the sample image and the predicted frame position information corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model.
The method according to claim 1, wherein the sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the initial value corresponding to the pre-established target detection model The ratio information of the intersection area and the union area between the position information of the prediction frame corresponding to the sample image detected by the target detection model.
The method according to claim 1 or 2, wherein the target detection result further comprises: detection category information corresponding to the detection target in the to-be-detected image.
The method according to any one of claims 1-3, characterized in that, in using the pre-established target detection model and the to-be-detected image, the to-be-detected target is detected from the to-be-detected image. Before the step of the corresponding target detection frame position information and the target frame quality information corresponding to the target detection frame position information, the method further includes:

The process of obtaining the pre-established target detection model by training, wherein the process includes:

obtaining the initial target detection model, wherein the initial target detection model includes a feature extraction layer, a feature classification layer and a feature regression layer;

Obtaining a plurality of sample images and calibration information corresponding to the sample images, wherein the calibration information includes: calibration frame position information and calibration category information corresponding to the sample targets contained in the corresponding sample images;

For each sample image, input the sample image into the feature extraction layer, and extract the sample image feature corresponding to the sample image;

For each sample image, input the sample image feature corresponding to the sample image into the feature regression layer, and obtain the prediction frame position information corresponding to the sample target in the sample image;

For each sample target in each sample image, calculate the ratio information of the intersection area and the union area between the calibration frame position information corresponding to the sample target and the corresponding prediction frame position information, and determine it as the real frame corresponding to the sample target quality information;

For each sample image, the sample image feature corresponding to the sample image and the prediction frame position information corresponding to the sample target in the sample image are input into the feature classification layer, and the prediction category information and prediction frame corresponding to the sample target in the sample image are determined. quality information;

For each sample image, focus on the loss function based on the preset positioning quality, the predicted frame quality information and the real frame quality information corresponding to the sample target in the sample image, as well as the preset category loss function, the predicted category corresponding to the sample target in the sample image Information and calibration category information to determine the current loss value;

Determine whether the current loss value exceeds the preset loss value threshold;

If the judgment result is yes, then adjust the model parameters of the feature extraction layer, the feature regression layer and the feature classification layer, return to execute for each sample image, input the sample image into the feature extraction layer, and extract the result. The steps of the sample image feature corresponding to the sample image;

If the judgment result is no, it is determined that the initial target detection model has reached a convergence state, and a pre-established target detection model is obtained.
The method according to claim 4, wherein the expression of the preset positioning quality focusing loss function is:

LFL (i) = - (( 1-p i) log (1-q i) + p i log (q i)) | p i -q i | γ;

Wherein the LFL (i) denotes the i th value of loss between the first information and the real information of the sample corresponding to a target frame quality predicted mass of the sample image frame, P i represents the i-th sample image sample corresponding to a target The quality information of the real frame, q i represents the quality information of the predicted frame corresponding to the ith sample target in the sample image, and γ represents the preset parameter.
The method according to claim 4, wherein the sample frame quality information and the sample category information corresponding to the sample image exist in the form of preset soft one-hot encoding, and the sample frame quality information corresponding to the sample image is stored in the preset soft one-hot encoding. Let the position in the soft one-hot encoding represent the sample category information corresponding to the sample image.
The method according to any one of claims 1-3, wherein the step of determining the target detection result corresponding to the to-be-detected image by using a pre-established target detection model and the to-be-detected image comprises:

Inputting the to-be-detected image into a feature extraction layer of a pre-established target detection model, and extracting to-be-detected image features corresponding to the to-be-detected image;

Inputting the feature of the image to be detected into the feature regression layer of the pre-established target detection model, and determining the position information of the candidate frame corresponding to the image to be detected;

Input the feature of the image to be detected and the position information of the candidate frame into the feature classification layer of the pre-established target detection model, and determine the position information of each candidate frame corresponding to each detection target in the image to be detected. Detection category information and target frame quality information;

For each detection target in the to-be-detected image, based on the preset suppression algorithm, the target frame quality information corresponding to the position information of each candidate frame corresponding to the detection target, and from the position information of all candidate frames corresponding to the detection target, determine The position information of the candidate frame that satisfies the preset screening condition is obtained as the position information of the target detection frame corresponding to the detection target, so as to obtain the target detection result corresponding to the image to be detected, wherein the preset screening condition is: limit the detection The condition that the quality information of the corresponding target frame in the position information of the candidate frame corresponding to the target is the largest.
A target detection device, characterized in that the device comprises:

an obtaining module, configured to obtain an image to be detected;

A determination module, configured to use a pre-established target detection model and the to-be-detected image to determine a target detection result corresponding to the to-be-detected image, wherein the target detection result includes: the detected target in the to-be-detected image corresponds to The target detection frame position information and the target frame quality information corresponding to the target detection frame position information, the pre-established target detection model is: a model obtained by training based on the sample image and its corresponding calibration information and the corresponding sample frame quality information, The sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the predicted frame position corresponding to the sample image detected based on the initial target detection model corresponding to the pre-established target detection model information, definite information.
The device according to claim 8, wherein the sample frame quality information corresponding to the sample image is: based on the calibration frame position information in the calibration information corresponding to the sample image and the initial value corresponding to the pre-established target detection model The ratio information of the intersection area and the union area between the position information of the prediction frame corresponding to the sample image detected by the target detection model.
The apparatus according to claim 8 or 9, wherein the target detection result further comprises: detection category information corresponding to the detection target in the to-be-detected image.