WO2020151329A1

WO2020151329A1 - Target detection based identification box determining method and device and terminal equipment

Info

Publication number: WO2020151329A1
Application number: PCT/CN2019/118131
Authority: WO
Inventors: 徐锐杰
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-01-23
Filing date: 2019-11-13
Publication date: 2020-07-30
Also published as: CN109886997A; CN109886997B

Abstract

A target detection based identification box determining method and device, a terminal equipment, and a computer nonvolatile readable storage medium. The method comprises: acquiring an image to be detected comprising a target and performing identification box analysis on the image to be detected, to obtain at least one identification box to be detected; carrying out, according to each of the identification boxes to be detected, image capturing on the image to be detected to obtain captured images; and sequentially inputting the image to be detected and all the captured images into a pretrained identification box optimizing model to obtain a target identification box, wherein the target identification box is used for indicating a region where the target is located. A target identification box is generated by using the method and image features of the image to be detected as a combination, thus improving the accuracy of determining the target identification box and thereby improving the accuracy of target detection.

Description

Method, device and terminal equipment for determining recognition frame based on target detection

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on January 23, 2019, the application number is 201910064290.9, and the invention title is "Method, Apparatus, and Terminal Equipment for Identifying Frame Based on Target Detection". The reference is incorporated in this application.

Technical field

The present invention belongs to the field of data processing technology, and in particular relates to a method, a device, a terminal device, and a computer-readable storage medium for determining a recognition frame based on target detection.

Background technique

With the development of computer technology and the widespread application of computer vision principles, the use of computer technology to detect and track targets has shown a trend of heating up. Depending on the scene, the target can be a human face, a vehicle or a building, etc. How to accurately locate the target in the image is an urgent problem in target detection.

At present, images are usually detected by deep convolutional neural network algorithms. Based on the characteristics of the algorithm, multiple recognition frames are usually obtained after detection. Therefore, it is necessary to determine the optimal recognition frame from multiple recognition frames. In the prior art, when determining the optimal recognition frame, usually only the recognition frame is calculated, and the cross-to-combination algorithm is specifically applied to multiple recognition frames, and the recognition frame with the largest cross-to-combination ratio is regarded as the optimal recognition Frame, because the actual image is not combined in the calculation process, it is easy to lead to unreliable results. In summary, since the result of the cross-union operation in the prior art is not reliable, the determined optimal recognition frame cannot fit the target well, and the accuracy of target detection is low.

technical problem

The embodiments of the present application provide a method, a device, a terminal device, and a computer non-volatile readable storage medium for determining a recognition frame based on target detection to solve the problem of the low accuracy of the recognition frame determined in the prior art, which leads to target detection. The problem of low accuracy.

Technical solutions

The first aspect of the embodiments of the present application provides a method for determining a recognition frame based on target detection, including:

Acquiring a to-be-detected image containing a target, and performing identification frame analysis on the to-be-detected image to obtain at least one identification frame to be detected;

Intercepting the image to be detected according to each identification frame to be detected to obtain a intercepted image;

The image to be detected and all the intercepted images are sequentially input into the pre-trained recognition frame optimization model, and the sub-recognition frame output by the recognition frame optimization model corresponding to each input image is obtained, and the sub-recognition frame corresponding to each input image is obtained according to the previous one. The sub-recognition frame corrects the latter sub-recognition frame, and determines the last sub-recognition frame after the correction is completed as the target recognition frame, wherein the recognition frame optimization model is based on a preset sample image and corresponding The target recognition frame is obtained by manual labeling frame training, and the target recognition frame is used to indicate the area where the target is located in the image to be detected.

A second aspect of the embodiments of the present application provides an apparatus for determining a recognition frame based on target detection, including:

An analysis unit, configured to obtain an image to be detected containing a target, and perform identification frame analysis on the image to be detected to obtain at least one identification frame to be detected;

An interception unit, configured to intercept the image to be detected according to each identification frame to be detected to obtain a intercepted image;

The input unit is configured to sequentially input the image to be detected and all the intercepted images into a pre-trained recognition frame optimization model, and obtain the sub-recognition frame output by the recognition frame optimization model corresponding to each input image, and Correct the latter sub-recognition frame according to the previous sub-recognition frame, and determine the last sub-recognition frame after the correction is completed as the target recognition frame, wherein the recognition frame optimization model is preset The sample image and the corresponding manual labeling frame are trained, and the target recognition frame is used to indicate the area where the target is located in the image to be detected.

A third aspect of the embodiments of the present application provides a terminal device. The terminal device includes a memory, a processor, and computer-readable instructions that are stored in the memory and run on the processor. The processor The following steps are implemented when the computer-readable instruction is executed:

The fourth aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When implementing the following steps:

Beneficial effect

In the embodiment of the present application, at least one recognition frame to be detected is obtained by analyzing the image to be detected, the image to be detected is intercepted according to each recognition frame to be detected to obtain a captured image, and then the image to be detected and all the captured images are input into the pre-trained recognition The frame optimization model obtains the target recognition frame. This application analyzes the image to be detected and all intercepted images through the recognition frame optimization model, and generates the target recognition frame by combining the image characteristics, so that the generated target recognition frame is more suitable for the target in the image to be detected. The accuracy of the determined target recognition frame and the accuracy of target detection are improved.

Description of the drawings

FIG. 1 is a flowchart of an implementation of a method for determining a recognition frame based on target detection according to Embodiment 1 of the present application;

FIG. 2 is a flowchart of an implementation of a method for determining a recognition frame based on target detection according to Embodiment 2 of the present application;

FIG. 3 is a flowchart of an implementation of a method for determining a recognition frame based on target detection according to Embodiment 3 of the present application;

FIG. 4 is a flowchart of an implementation of a method for determining a recognition frame based on target detection according to Embodiment 4 of the present application;

FIG. 5 is a flowchart of an implementation of a method for determining a recognition frame based on target detection according to Embodiment 5 of the present application;

FIG. 6 is a network structure diagram of the inception framework provided by Embodiment 6 of the present application;

FIG. 7 is a structural diagram of the first inception structure provided by Embodiment 7 of the present application;

FIG. 8 is a structural diagram of a second inception structure provided by Embodiment 8 of the present application;

FIG. 9 is a structural diagram of a third inception structure provided by Embodiment 9 of the present application;

FIG. 10 is a structural block diagram of an apparatus for determining a recognition frame based on target detection according to Embodiment 10 of the present application;

FIG. 11 is a schematic diagram of a terminal device provided in Embodiment 11 of the present application.

Embodiments of the invention

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, those skilled in the art should be clear that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.

In order to illustrate the technical solution described in the present application, specific embodiments are used for description below.

Fig. 1 shows an implementation process of a method for determining a recognition frame based on target detection provided by an embodiment of the present application, which is detailed as follows:

In S101, an image to be detected containing a target is acquired, and an identification frame analysis is performed on the image to be detected to obtain at least one identification frame to be detected.

Target detection is one of the core technologies in the field of computer vision. The purpose is to detect all targets in the image and determine the location of each target. In view of the situation that the recognition frame determined in the target detection process cannot fit the target well, in the embodiment of the present application, the image to be detected containing the target is first obtained, and the recognition frame analysis of the image to be detected is performed to obtain at least one recognition frame to be detected . Recognition frame analysis can be implemented based on open source target detection models, such as regional convolutional neural networks (Region- Convolutional Neural Network, R-CNN) model or Single Shot MultiBox Detector (SSD) model, etc., based on the target detection model, when the image to be detected is analyzed based on the recognition frame, the sliding window algorithm or selective search is first performed The image to be detected is divided into at least two recognition frames by the method, etc., and the target detection model calculates each of the separated recognition frames to obtain the confidence level of each recognition frame. The confidence level indicates that the target is located in the recognition frame. The higher the confidence, the higher the probability that the target is located in the recognition frame. The calculated confidence value depends on the architecture and weight of the target detection model in the actual application scenario, which is not done in the embodiment of this application. Repeat. After the confidence of each recognition frame is obtained, the recognition frame corresponding to the confidence higher than the confidence threshold is determined as the recognition frame to be detected. On the one hand, the amount of subsequent calculations is reduced, and on the other hand, the recognition frame to be detected and the target The correlation degree of is higher, so the accuracy of the subsequent determination of the target recognition frame can be improved. The confidence threshold can be determined according to the accuracy requirements for determining the target recognition frame. The higher the accuracy requirement, the greater the confidence threshold. For example, it can be set to 60%. It is worth mentioning that the specific types of targets to be detected in the embodiments of this application are not limited. For example, the targets can be human faces, vehicles, or trees, etc., but after the target type is selected, all targets in the embodiments of this application refer to The target type has been selected, and the recognition frame to be detected and other recognition frames in the embodiment of the present application are preferably rectangular frames.

Optionally, the image to be detected is normalized to a preset size, and the normalized image to be detected is zero-averaged. Before analyzing the recognition frame of the image to be detected, in order to improve the analysis effect, the size of the image to be detected is normalized to a preset size in advance. The preset size can be set freely, such as 299 (pixel length) × 299 (pixel width) ). On this basis, the normalized image to be detected is zero-averaged, which is convenient to improve the effect of the recognition frame analysis. Specifically, the average value of the original values of all pixels in the image to be detected is calculated, and then the image to be detected The value of each pixel in the image is updated to the difference between the original value of the pixel and the above average value (that is, the result of subtracting the average value from the original value). After the values of all pixels in the image to be detected are updated, Recognition frame analysis can be performed on the image to be detected. Through the above method, the unity of the image to be detected is improved, and the subsequent analysis effect is improved.

In S102, the image to be detected is intercepted according to each identification frame to be detected to obtain a intercepted image.

Each obtained recognition frame to be detected corresponds to a coordinate set. Each coordinate in the coordinate set is the coordinate of a corner of the recognition frame to be detected in the image to be detected. Therefore, after obtaining the recognition frame to be detected, according to each The to-be-detected recognition frame is intercepted from the to-be-detected image to obtain the intercepted image, which facilitates the subsequent determination of the best recognition frame.

In S103, the to-be-detected image and all the intercepted images are sequentially input into a pre-trained recognition frame optimization model, and the sub-recognition frame output by the recognition frame optimization model corresponding to each input image is obtained, and according to The former sub-recognition frame corrects the latter sub-recognition frame, and the last sub-recognition frame after the correction is determined as the target recognition frame, wherein the recognition frame optimization model is based on a preset sample The image and the corresponding manual labeling frame are trained, and the target recognition frame is used to indicate the area where the target is located in the image to be detected.

Compared with the traditional method using the cross-and-comparison algorithm to determine the target recognition frame, in the embodiment of the present application, the target recognition frame is corrected and determined in combination with the image characteristics of the image to be detected. Specifically, the image to be detected and all the intercepted images are sequentially input into the pre-trained recognition frame optimization model. For each input image, the sub-recognition frame corresponding to each image output by the recognition frame optimization model is obtained. The accuracy of the sub-recognition frame may be low and it cannot fit the target well. Therefore, the sub-recognition frame of the input next image is corrected according to the input sub-recognition frame of the previous image (according to the image input time from front to back) Order) until the image to be detected and all intercepted images are input to the recognition frame optimization model, and the last sub-recognition frame after the correction is determined as the target recognition frame, which is the area where the target is located. It is worth mentioning that the recognition frame optimization model is trained through preset sample images and corresponding manually labeled frames, so that sub-recognition frames can be obtained according to the specific characteristics of the image. The specific training process will be explained later.

In addition, the embodiment of the present application does not limit the modification method of the sub-recognition frame. For example, the coordinate set of the previous sub-recognition frame and the coordinate set of the current sub-recognition frame can be averaged (calculated separately for each corner). ), and update the calculated coordinate set to the coordinate set of the current sub-recognition frame, so as to realize the correction of the current sub-recognition frame. It is worth mentioning that the coordinate set of each sub-recognition frame is obtained by placing the sub-recognition frame in the image to be detected, that is, the coordinate set of the sub-recognition frame is relative to the image to be detected. For example, the image set to be detected includes and only includes the image to be detected PictureA, the images PictureB and PictureC are intercepted, and the coordinate set obtained after the neural network operation on PictureA is [top left point (100, 100), bottom left point ( 100, 50), the lower right point (200, 50), the upper right point (200, 100)], the coordinate set of the sub-recognition frame obtained after the neural network operation of PictureB is [upper left point (90, 90), The lower left corner point (90, 40), the lower right corner point (190, 40), the upper right corner point (190, 90)], the coordinate set of the sub-recognition frame obtained after the neural network operation of PictureC is [upper left corner point (85 , 90), the lower left corner point (85, 50), the lower right corner point (180, 50), the upper right corner point (180, 90)], then the sub-recognition frame in PictureB is first corrected based on the sub-recognition frame in PictureA , The coordinate set of the sub-recognition frame of the revised PictureB is [top left point (95, 95), bottom left point (95, 45), bottom right point (195, 45), top right point (195, 95) ], and then modify the sub-recognition frame of PictureC based on the sub-recognition frame of PictureB after the correction, and the coordinate set of the sub-recognition frame of the corrected PictureC is [upper left point (90, 92.5), lower left point (90, 47.5), the lower right corner point (187.5, 47.5), the upper right corner point (187.5, 92.5)], the revised PictureC sub-identification box is the target identification box. Of course, the above is only an example of modifying the sub-identification frame, and does not constitute a limitation to the embodiment of the present application.

It can be seen from the embodiment shown in FIG. 1 that, in the embodiment of the present application, at least one identification frame to be detected is obtained by analyzing the identification frame of the image to be detected, and the image to be detected is intercepted according to each identification frame to be detected to obtain the intercepted image, and then The image to be detected and all captured images are sequentially input into the preset recognition frame optimization model to obtain the target recognition frame. The embodiment of the present application determines the target recognition frame based on the image characteristics of the image to be detected, so that the target recognition frame can better fit the image to be detected The target in, improves the accuracy of the determined target recognition frame.

FIG. 2 shows a method based on the first embodiment of the present application, and is a method obtained by expanding the process before sequentially inputting the image to be detected and all the intercepted images into the pre-trained recognition frame optimization model. The embodiment of the present application provides an implementation flowchart of a method for determining a recognition frame based on target detection. As shown in FIG. 2, the method for determining a recognition frame may include the following steps:

In S201, acquire at least the two sample images and the corresponding manual annotation frame, wherein the sample image is an image containing the target, and the manual annotation frame is an image in the manually annotated sample image. State the area where the target is located.

In the embodiment of this application, for the training process of the optimization model of the recognition frame, first obtain at least two sample images and a manually labeled frame corresponding to each sample image. The sample image contains the target to be detected in the embodiment of this application. For images, the manual labeling box is the area where the target is located in the manually labelled sample image. The sample images can be freely selected by the user or can be directly retrieved from the open source image library. In order to improve the training effect of the model, the number of sample images should be large. For example, the number of sample images obtained in the embodiment of this application can be 1000 Zhang.

In S202, the identification frame analysis is performed on each of the sample images to obtain at least one sample identification frame, and the sample image is intercepted according to each sample identification frame to obtain a sample intercepted image.

For each obtained sample image, the recognition frame analysis is performed on it, and the method of performing recognition frame analysis in this step is the same as that in step S101. After the identification frame analysis, each sample image corresponds to at least one sample identification frame. In order to train the model, the corresponding sample image is intercepted according to each sample identification frame obtained to obtain the sample intercepted image.

In S203, a sample image set is constructed based on the sample image and all the corresponding sample intercepted images, and at least two of the sample image sets constructed are sequentially input to a preset basic model, and according to the input sample The manual annotation frame corresponding to the image set adjusts the weight of the basic model until the identification frame output by the adjusted basic model matches the manual annotation frame, and the adjusted basic model is determined The model is the recognition frame optimization model; wherein, each of the sample image sets constructed corresponds to one sample image.

A sample image set is constructed based on each sample image and all sample intercepted images corresponding to the sample image. After the construction is completed, the number of sample image sets is equal to the number of sample images, that is, each sample image set corresponds to a sample image. The sample image set only refers to the sample images and all sample intercepted images into a specific set, and does not specifically refer to a specific storage form. Then, the constructed at least two sample image sets are successively used as input parameters and input to the preset basic model. In the embodiment of the present application, the basic model can be implemented based on the inception framework, and the network structure of the inception framework is shown in FIG. 6. In Figure 6, "conv" refers to the convolution kernel, which is a filter matrix used to perform convolution operations on different windows in the image; "patch size" refers to the receptive field (equivalent to size) of the convolution kernel, If the receptive field of a certain convolution kernel is 3×3, it means that the length of the convolution kernel is 3 elements and the width is 3 elements; “stride” refers to the step length, when performing convolution operation, convolution The kernel slides on the images on the three channels (red channel, green channel and blue channel), and uses the weighted sum of the original values of the image pixels on the three channels and the convolution kernel as the convolution kernel The step length is the number of steps per slide of the convolution kernel (the number of pixels in the path); "input "size" refers to the size of the image as the input parameter of this layer, the last parameter in "input size" refers to the depth of the image, for example, "input size" is 299×299×3, it is limited as the input parameter of this layer The depth of the image is 3. In addition, "conv "Padded" refers to the convolution kernel that contains the boundary filling function; "pool" refers to the pooling layer, which is used to reduce the data volume of the input parameters of this layer, prevent overfitting, and keep the depth of the image unchanged; "linear "Refers to the linear layer, whose input parameter is the calculated unnormalized probability of each coordinate set; "softmax" refers to the classification output layer, which applies the softmax function in the neural network to calculate the The probability is normalized to complete the classification (in actual training, the coordinate set with the highest probability output by the "softmax" layer can be determined as the coordinate set output by the basic model this time).

In addition, in FIG. 6, “figure 5” refers to the first inception structure. The first inception structure does not limit the receptive field of the convolution kernel. In the entire basic model, the number of first inception structures is limited to three. Fig. 7 is a structural diagram of the first inception structure, where "Base" is the input layer of the first inception structure, and "Filter Concat" is the output layer of the first inception structure.

"Figure 6" in FIG. 6 refers to the second inception structure, and the specific structure diagram of the second inception structure is shown in FIG. 8. The second inception structure splits the 5*5 convolution kernel on the basis of the first inception structure, and specifically splits it into two 3*3 convolution kernels to reduce the amount of calculation and improve training efficiency. In the basic model, the number of second inception structures is limited to 5.

"Figure 7" in FIG. 6 refers to the third inception structure, and the specific structure diagram of the third inception structure is shown in FIG. 9. On the basis of the second inception structure, the third inception structure splits the n*n convolution kernel, which is specifically split into a 1*n convolution kernel and an n*1 convolution kernel, further reducing Small amount of calculation, where n is an integer greater than 1. In the basic model, the number of third inception structures is limited to two.

When constructing the basic model in advance, the weights in the basic model can be initialized (values can be set randomly within a preset range), and the weights are the specific values of each level (including each convolution kernel) in the basic model. After the sample image set is input to the basic model as input parameters, the basic model will perform calculations and output a recognition frame, which is the area where the predicted target is located in the sample image. In order to realize the training of the basic model and make the calculation results of the basic model more accurate, after each sample image set is input to the basic model, the weight of the basic model is adjusted according to the manual labeling box corresponding to the sample image set until the adjustment The recognition frame output by the latter basic model matches the manual labeling frame, and the specific weight adjustment method is described in detail later.

After the basic model training is completed, that is, the weight adjustment is completed, the basic model is used as the recognition frame optimization model, and the image to be detected and all intercepted images are input as input parameters into the recognition frame optimization model. Because the recognition frame optimization model has been The target to be detected has a good recognition effect, so the recognition frame output by the recognition frame optimization model is directly determined as the target recognition frame.

It can be seen from the embodiment shown in FIG. 2 that, in the embodiment of the present application, a preset basic model is trained based on manually annotated sample images, the trained basic model is used as the recognition frame optimization model, and the model is trained based on manual supervision to improve The fit of the recognition frame optimization model and the analysis method of the recognition frame is improved, so that the recognition frame optimization model has a better recognition effect on the image to be detected and the captured image, and further improves the accuracy of the determined target recognition frame.

As shown in Fig. 3, on the basis of the second embodiment of the present application, the constructed at least two sample image sets are sequentially input into the preset basic model, and the basic model is adjusted according to the manual annotation frame corresponding to the input sample image set. The weight adjustment is performed until the recognition frame output by the adjusted basic model matches the manual label frame. A method obtained after refinement. The embodiment of the present application provides an implementation flowchart of a method for determining a recognition frame based on target detection. As shown in FIG. 3, the method for determining a recognition frame may include the following steps:

In S301, the recognition frame corresponding to each input sample image set output by the basic model is determined as a basic recognition frame, and the difference between the basic recognition frame and the corresponding manual labeling frame is calculated parameter.

After each sample image set is input to the basic model, the recognition frame output by the basic model corresponding to the sample image set is obtained, and the recognition frame is determined as the basic recognition frame, and then the basic recognition frame is calculated corresponding to the sample image set The parameters of the difference between the manually labeled boxes. The embodiment of the application does not limit the calculation method of the difference parameter. For example, the difference between the coordinates of the basic identification frame and the manually marked frame at the four corner points can be calculated separately, and the average value of the four differences can be used as the difference parameter. It is also possible to import the difference between the coordinates of the basic identification box and the manual labeling box at the four corner points into a preset loss function (such as a quadratic cost function) to obtain the difference parameters. For ease of explanation, the following text contains the loss function Examples of ways.

In S302, if the difference parameter is greater than or equal to the preset expected parameter, weight adjustment is performed on the basic model based on the difference parameter, and the next basic recognition frame output by the basic model is repeatedly obtained, and The weight adjustment is performed on the basic model based on the updated difference parameter until the difference parameter is smaller than the expected parameter.

In order to measure the training situation of the basic model, the expected parameters are preset in the embodiments of this application. If the obtained difference parameter is greater than or equal to the expected parameter, then the basic model is weighted based on the difference parameter. The weight adjustment operation can be based on the neural network Open source gradient descent algorithm or back propagation algorithm and other algorithms are implemented. For example, since the basic recognition frame is calculated by multiple convolution kernels in the basic model, and each convolution kernel includes weights, the difference can be calculated when the difference parameter is greater than or equal to the expected parameter The partial derivative of the parameter and the weight of the convolution kernel constitutes a gradient vector of the difference parameter with respect to the weight, and then the value of the weight of the convolution kernel is adjusted based on the gradient vector, so that the difference parameter is as small as possible.

After completing the weight adjustment based on a difference parameter, input the next sample image set to the basic model, obtain the basic recognition frame output by the basic model, and update the value of the difference parameter. If the updated difference parameter is greater than or equal to the expected parameter, Then adjust the weight and repeat the above operation until the difference parameter is less than the expected parameter. Of course, when the difference parameter is less than the expected parameter, it can be determined that the basic recognition frame output by the basic model matches the manual labeling frame. Of course, the basic model can continue to be trained until all the sample image sets are input.

It can be seen from the embodiment shown in FIG. 3 that, in the embodiment of the present application, after the sample image set is input to the basic model, the difference parameter is calculated, and when the difference parameter is greater than or equal to the preset expected parameter, the difference parameter is compared to the basic model. The model performs weight adjustment and repeats the above operation until the difference parameter is less than the expected parameter, which improves the analysis accuracy of the identification frame optimization model.

As shown in Figure 4, on the basis of the second embodiment of the present application, at least two targets are obtained after the recognition frame analysis of the image to be detected, and each target corresponds to at least one recognition frame to be detected. The detection image and all intercepted images are sequentially input into the pre-trained recognition frame optimization model, and the sub recognition frame corresponding to each input image from the recognition frame optimization model is obtained, and the next sub recognition frame is performed according to the previous sub recognition frame Correction is a method obtained by refining the process of determining the last sub-recognition frame after correction as the target recognition frame. The embodiment of the present application provides an implementation flowchart of a method for determining a recognition frame based on target detection. As shown in FIG. 4, the method for determining a recognition frame may include the following steps:

In S401, the to-be-detected image is cut according to all the to-be-detected recognition frames corresponding to each of the targets to obtain a cut image, and the size of the cut image is scaled, wherein each cut image corresponds to a cut image. Mentioned goals.

In actual application scenarios, the image to be detected may contain at least two targets. For example, if the target is a vehicle and the image to be detected is a captured image of an intersection, the image to be detected may contain at least two vehicles. In view of the above situation, in the embodiment of the present application, the target recognition frame is determined separately for each target. Specifically, since each target identified corresponds to at least one recognition frame to be detected, all the target recognition frames corresponding to each target are calculated. The maximum coverage area of the frame is identified, and the image to be detected is cut according to the maximum coverage area to obtain a cut image, and each cut image corresponds to an identified target. Among them, when calculating the maximum coverage area, first obtain the union area of all the identification frames to be detected corresponding to the target, and then take the horizontal coordinate range covered by the union area as the horizontal coordinate range of the maximum coverage area, and then the union area The vertical coordinate range of the coverage is used as the vertical coordinate range of the maximum coverage area to construct the maximum coverage area. It is worth mentioning that the maximum coverage area is rectangular, so the maximum coverage area is obtained based on the union area. In fact, the shape may be irregular The union area of is completed to the maximum coverage area of the rectangle.

After the cut image is obtained, since the size of the cut image may not meet the standard size of the input parameters of the recognition frame optimization model (for example, the standard size is the preset size, specifically 299×299, and the size of the cut image is 199×199), Therefore, the size of each cut image obtained is scaled until the cut image reaches the standard size. It is worth mentioning that, unlike the cut image, if the above-mentioned intercepted image and sample intercepted image do not meet the standard size of the recognition frame optimization model, the area where the excess size in the intercepted image and sample intercepted image is located can be blanked. Processing or gray-scale processing to prevent the subsequent size confusion when determining the recognition frame.

In S402, the scaled cut image and all the corresponding intercepted images are sequentially input to the recognition frame optimization model, and the recognition frame output by the recognition frame optimization model is restored in size, and the size is restored The identification frame is determined as the target identification frame corresponding to the cut image.

After the cut image is scaled, input the scaled cut image and all the corresponding cut images (here the cut image is obtained by cutting the cut image according to the recognition frame to be detected corresponding to the cut image). The weight-adjusted recognition frame optimization model will restore the size of the recognition frame output by the recognition frame optimization model (here the recognition frame is the last corrected recognition frame output by the recognition frame optimization model). The scale and steps of the size restoration In S401, the scale of the size of the cut image is reversed. For example, when the size of the cut image is reduced, the cut image is reduced to one-half. In this step, the recognition frame is expanded to the center of the recognition frame. Twice the original. The recognition frame after the size restoration is determined as the target recognition frame corresponding to the cut image, and the target recognition frame is re-placed in the initial image to be detected, so as to facilitate subsequent target detection operations. The number of target recognition frames finally obtained in the embodiment of the present application is equal to the number of recognized targets, that is, one target recognition frame corresponds to one target.

It can be seen from the embodiment shown in FIG. 4 that in the embodiment of the present application, the image to be detected is cut according to each identified target to obtain a cut image, and the cut image is scaled in size, and the cut image after the scale is scaled and the corresponding All the intercepted images of, are sequentially input to the recognition frame optimization model, and the size of the recognition frame output by the recognition frame optimization model is restored to obtain the target recognition frame. The embodiment of the present application determines the target recognition frame separately for each recognized target, which improves target recognition The pertinence and accuracy of the frame determination prevents poor calculation results after inputting the image to be detected with at least two targets into the recognition frame optimization model.

As shown in FIG. 5, it is based on the second embodiment of the present application, and includes at least two recognition frame optimization models and at least two preset attribute features, and each recognition frame optimization model is composed of corresponding attribute features. Based on the training of the sample image set, the process of sequentially inputting the image to be detected and all the intercepted images into the pre-trained optimization model of the recognition frame is refined. The embodiment of the present application provides an implementation flowchart of a method for determining a recognition frame based on target detection. As shown in FIG. 5, the method for determining a recognition frame may include the following steps:

In S501, the attribute feature corresponding to the image to be detected is determined as a target feature.

In the embodiment of this application, at least two attribute features can be set in advance, and the basic model is separately trained according to each attribute feature to obtain the recognition frame optimization model. Preferably, the number of attribute features and the final recognition frame optimization model Are equal in number. When training each basic model, only at least two sample image sets corresponding to the same attribute feature are input to the basic model. For ease of explanation, assuming that the attribute features include male and female, two basic models are set in advance, at least two sample image sets corresponding to males are input into one of the basic models for weight adjustment, and the final weight-adjusted basic model is used as The recognition frame optimization model inputs at least two sample image sets corresponding to women to another basic model for weight adjustment, and the final weight adjustment is performed on the basic model as another recognition frame optimization model. After obtaining the trained recognition frame optimization model, obtain the attribute feature corresponding to the image to be detected, and determine the attribute feature as the target feature. The target feature can be customized in advance by the user, or open source analysis components can be introduced to analyze Target characteristics.

In S502, the image to be detected and all the captured images are sequentially input to the recognition frame optimization model corresponding to the target feature.

After determining the target feature, input the image to be detected and all the intercepted images into the weight-adjusted recognition frame optimization model corresponding to the target feature, because the recognition frame optimization model has a good analysis effect on the image corresponding to the target feature Therefore, the accuracy of the target recognition frame determined subsequently is improved.

It can be seen from the embodiment shown in FIG. 5 that in the embodiment of the present application, the attribute feature corresponding to the image to be detected is determined as the target feature, and the image set to be detected is input into the weight-adjusted recognition frame optimization corresponding to the target feature The model, through targeted training of the recognition frame optimization model, and targeted input of the image set to be detected into the corresponding recognition frame optimization model, improves the accuracy of target detection.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

Corresponding to the method for determining a recognition frame based on target detection in the above embodiment, FIG. 10 shows a structural block diagram of a device for determining a recognition frame based on target detection provided by an embodiment of the present application. Referring to FIG. 10, the device for determining a recognition frame include:

The analyzing unit 101 is configured to obtain a to-be-detected image containing a target, and perform an identification frame analysis on the to-be-detected image to obtain at least one to-be-detected identification frame;

The interception unit 102 is configured to intercept the image to be detected according to each identification frame to be detected to obtain a intercepted image;

The input unit 103 is configured to sequentially input the image to be detected and all the intercepted images into a pre-trained recognition frame optimization model, and obtain sub-recognition frames output by the recognition frame optimization model corresponding to each input image, The latter sub-recognition frame is corrected according to the previous sub-recognition frame, and the last sub-recognition frame after the correction is determined to be the target recognition frame, wherein the recognition frame optimization model is a preset The target recognition frame is used to indicate the area where the target is located in the image to be detected.

Optionally, the input unit 103 further includes:

The acquiring unit is configured to acquire at least the two sample images and the corresponding manual labeling frame, wherein the sample image is an image containing the target, and the manual labeling frame is the sample image manually labeled The area where the target is located;

A sample interception unit, configured to perform identification frame analysis on each sample image to obtain at least one sample identification frame, and intercept the sample image according to each sample identification frame to obtain a sample intercepted image;

The construction unit is configured to construct a sample image set based on the sample image and all the corresponding sample intercepted images, input at least two of the constructed sample image sets into a preset basic model in sequence, and according to the input The manual labeling frame corresponding to the sample image set adjusts the weight of the basis until the identification frame output by the adjusted basic model matches the manual labeling frame, and the adjusted basis is determined The model is the recognition frame optimization model; wherein, each of the sample image sets constructed corresponds to one sample image.

Optionally, the building unit includes:

The calculation unit is configured to determine the recognition frame corresponding to each input sample image set output by the basic model as a basic recognition frame, and calculate the difference between the basic recognition frame and the corresponding manual annotation frame Difference parameter

The weight adjustment unit is configured to, if the difference parameter is greater than or equal to a preset expected parameter, adjust the weight of the basic model based on the difference parameter, and repeatedly obtain the next basic recognition frame output by the basic model , And adjust the weight of the basic model based on the updated difference parameter until the difference parameter is smaller than the expected parameter.

Optionally, if at least two targets are obtained after the recognition frame analysis of the image to be detected, and each target corresponds to at least one recognition frame to be detected, the input unit includes:

The cutting unit is used to cut the to-be-detected image according to all the to-be-detected recognition frames corresponding to each of the targets to obtain a cut image, and perform size scaling on the cut image, wherein each cut image corresponds to one The target

The determining unit is configured to input the scaled cut image and all the corresponding intercepted images into the recognition frame optimization model in sequence, and restore the size of the recognition frame output by the recognition frame optimization model to restore the size The subsequent recognition frame is determined as the target recognition frame corresponding to the cut image.

Optionally, at least two recognition frame optimization models and at least two preset attribute features are included, and each recognition frame optimization model is obtained by training a sample image set corresponding to the same attribute feature, the input unit includes:

A feature determining unit, configured to determine the attribute feature corresponding to the image to be detected as a target feature;

For the input unit, it is used to input the to-be-detected image and all the intercepted images into the recognition frame optimization model corresponding to the target feature in sequence.

Optionally, the analysis unit 101 further includes:

The normalization unit is used to normalize the to-be-detected image to a preset size, and perform zero-averaging on the normalized to-be-detected image.

Therefore, the device for determining a recognition frame based on target detection provided by the embodiment of the present application inputs the image to be detected and all intercepted images into a pre-trained recognition frame optimization model to obtain the target recognition frame, which improves the accuracy of the determined target recognition frame. And the accuracy of target detection.

Fig. 11 is a schematic diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 11, the terminal device 11 of this embodiment includes: a processor 110, a memory 111, and computer-readable instructions 112 stored in the memory 111 and running on the processor 110, for example, based on target detection The identification box determines the program. When the processor 110 executes the computer-readable instruction 112, the steps in the above-mentioned embodiments of the method for determining a recognition frame based on target detection are implemented, such as steps S101 to S103 shown in FIG. 1. Alternatively, when the processor 110 executes the computer-readable instruction 112, the functions of the units in the foregoing embodiments of the device for determining a recognition frame based on target detection are implemented, for example, the functions of the units 101 to 103 shown in FIG.

Exemplarily, the computer-readable instruction 112 may be divided into one or more units, and the one or more units are stored in the memory 111 and executed by the processor 110 to complete the application . The one or more units may be an instruction segment of a series of computer-readable instructions capable of completing specific functions, and the instruction segment is used to describe the execution process of the computer-readable instruction 112 in the terminal device 11. For example, the computer-readable instruction 112 may be divided into an analysis unit, an interception unit, and an input unit, and the specific functions of each unit are as follows:

An analysis unit, configured to obtain an image to be detected, and analyze the recognition frame of the image to be detected to obtain at least one recognition frame to be detected;

The terminal device 11 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device may include, but is not limited to, a processor 110 and a memory 111. Those skilled in the art can understand that FIG. 11 is only an example of the terminal device 11, and does not constitute a limitation on the terminal device 11. It may include more or less components than shown in the figure, or a combination of certain components, or different components. For example, the terminal device may also include input and output devices, network access devices, buses, etc.

The so-called processor 110 may be a central processing unit (Central Processing Unit, CPU), it can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit (ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The memory 111 may be an internal storage unit of the terminal device 11, such as a hard disk or a memory of the terminal device 11. The memory 111 may also be an external storage device of the terminal device 11, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) equipped on the terminal device 11. Card, Flash Card (Flash Card, FC), etc. Further, the memory 111 may also include both an internal storage unit of the terminal device 11 and an external storage device. The memory 111 is used to store the computer-readable instructions and other programs and data required by the terminal device. The memory 111 can also be used to temporarily store data that has been output or will be output.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (Read-Only Memory, ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A method for determining a recognition frame based on target detection is characterized in that it includes:

Acquiring a to-be-detected image containing a target, and performing identification frame analysis on the to-be-detected image to obtain at least one identification frame to be detected;

Intercepting the image to be detected according to each identification frame to be detected to obtain a intercepted image;

The image to be detected and all the intercepted images are sequentially input into the pre-trained recognition frame optimization model, and the sub-recognition frame output by the recognition frame optimization model corresponding to each input image is obtained, and the sub-recognition frame corresponding to each input image is obtained according to the previous one. The sub-recognition frame corrects the latter sub-recognition frame, and determines the last sub-recognition frame after the correction is completed as the target recognition frame, wherein the recognition frame optimization model is based on a preset sample image and corresponding The target recognition frame is obtained by manual labeling frame training, and the target recognition frame is used to indicate the area where the target is located in the image to be detected.
5. The method for determining a recognition frame according to claim 1, wherein before inputting the to-be-detected image and all the intercepted images sequentially into a pre-trained recognition frame optimization model, the method further comprises:

Acquire at least two of the sample images and the corresponding manual labeling frame, wherein the sample image is an image containing the target, and the manual labeling frame is the manually labelled sample image where the target is located area;

Performing identification frame analysis on each of the sample images to obtain at least one sample identification frame, and intercepting the sample image according to each of the sample identification frames to obtain a sample intercepted image;

A sample image set is constructed based on the sample image and all the corresponding sample intercepted images, and at least two of the sample image sets constructed are sequentially input into a preset basic model, and according to the input sample image set corresponding to The manual labeling frame adjusts the weight of the basic model until the recognition frame output by the adjusted basic model matches the manual labeling frame, and it is determined that the adjusted basic model is the Recognition frame optimization model; wherein, each of the sample image sets constructed corresponds to one sample image.
The method for determining a recognition frame according to claim 2, wherein the constructed at least two sample image sets are sequentially input to a preset basic model, and all the sample image sets corresponding to the input The manual labeling frame adjusts the weight of the basic model until the identification frame output by the adjusted basic model matches the manual labeling frame, including:

Determining a recognition frame corresponding to each input sample image set output by the basic model as a basic recognition frame, and calculating a difference parameter between the basic recognition frame and the corresponding manual labeling frame;

If the difference parameter is greater than or equal to the preset expected parameter, the basic model is weighted based on the difference parameter, and the next basic recognition frame output by the basic model is repeatedly obtained, and based on the updated The difference parameter adjusts the weight of the basic model until the difference parameter is smaller than the expected parameter.
The method for determining the recognition frame of claim 2, wherein if at least two of the targets are obtained after the recognition frame analysis of the image to be detected, and each target corresponds to at least one of the recognition to be detected Box, the image to be detected and all the intercepted images are sequentially input into the pre-trained recognition frame optimization model, and the sub-recognition frame output by the recognition frame optimization model corresponding to each input image is obtained, and Correcting the latter sub-recognition frame according to the previous sub-recognition frame, and determining the last sub-recognition frame after the correction as the target recognition frame includes:

Cutting the to-be-detected image according to all to-be-detected recognition frames corresponding to each of the targets to obtain a cut image, and scaling the cut image, wherein each cut image corresponds to one target;

The cut image after the size scaling and all the corresponding intercepted images are sequentially input to the recognition frame optimization model, the recognition frame output by the recognition frame optimization model is restored to the size, and the recognition after the size is restored The frame is determined as the target recognition frame corresponding to the cut image.
The method for determining a recognition frame according to claim 2, wherein if it includes at least two optimization models of the recognition frame and at least two preset attribute characteristics, and each of the recognition frame optimization models corresponds to the same If the sample image set of the attribute characteristics is obtained through training, then the step of sequentially inputting the image to be detected and all the intercepted images into a pre-trained recognition frame optimization model includes:

Determining the attribute feature corresponding to the image to be detected as a target feature;

The image to be detected and all the intercepted images are sequentially input to the recognition frame optimization model corresponding to the target feature.
8. The method for determining a recognition frame according to claim 1, characterized in that, before the performing recognition frame analysis on the image to be detected, the method further comprises:

The image to be detected is normalized to a preset size, and the normalized image to be detected is zero-averaged.
A device for determining a recognition frame based on target detection is characterized in that it comprises:

An analysis unit, configured to obtain an image to be detected containing a target, and perform identification frame analysis on the image to be detected to obtain at least one identification frame to be detected;

An interception unit, configured to intercept the image to be detected according to each identification frame to be detected to obtain a intercepted image;

The input unit is configured to sequentially input the image to be detected and all the intercepted images into a pre-trained recognition frame optimization model, and obtain the sub-recognition frame output by the recognition frame optimization model corresponding to each input image, and Correct the latter sub-recognition frame according to the previous sub-recognition frame, and determine the last sub-recognition frame after the correction is completed as the target recognition frame, wherein the recognition frame optimization model is preset The sample image and the corresponding manual labeling frame are trained, and the target recognition frame is used to indicate the area where the target is located in the image to be detected.
The device for determining a recognition frame based on target detection according to claim 7, wherein the input unit comprises:

The acquiring unit is configured to acquire at least the two sample images and the corresponding manual labeling frame, wherein the sample image is an image containing the target, and the manual labeling frame is the sample image manually labeled The area where the target is located;

A sample interception unit, configured to perform identification frame analysis on each sample image to obtain at least one sample identification frame, and intercept the sample image according to each sample identification frame to obtain a sample intercepted image;

The construction unit is configured to construct a sample image set based on the sample image and all the corresponding sample intercepted images, input at least two of the constructed sample image sets into a preset basic model in sequence, and according to the input The manual labeling frame corresponding to the sample image set adjusts the weight of the basis until the identification frame output by the adjusted basic model matches the manual labeling frame, and the adjusted basis is determined The model is the recognition frame optimization model; wherein, each of the sample image sets constructed corresponds to one sample image.
The device for determining a recognition frame based on target detection according to claim 8, wherein the construction unit comprises:

The calculation unit is configured to determine the recognition frame corresponding to each input sample image set output by the basic model as a basic recognition frame, and calculate the difference between the basic recognition frame and the corresponding manual annotation frame Difference parameter

The weight adjustment unit is configured to, if the difference parameter is greater than or equal to a preset expected parameter, adjust the weight of the basic model based on the difference parameter, and repeatedly obtain the next basic recognition frame output by the basic model , And adjust the weight of the basic model based on the updated difference parameter until the difference parameter is smaller than the expected parameter.
The device for determining a recognition frame based on target detection according to claim 8, wherein when at least two targets are obtained after the recognition frame analysis of the image to be detected, and each target corresponds to at least one recognition frame to be detected, the The input unit includes:

The cutting unit is used to cut the to-be-detected image according to all the to-be-detected recognition frames corresponding to each of the targets to obtain a cut image, and perform size scaling on the cut image, wherein each cut image corresponds to one The target

The determining unit is configured to input the scaled cut image and all the corresponding intercepted images into the recognition frame optimization model in sequence, and restore the size of the recognition frame output by the recognition frame optimization model to restore the size The subsequent recognition frame is determined as the target recognition frame corresponding to the cut image.
A terminal device, which is characterized by comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor executes the computer-readable instructions as follows step:

Acquiring a to-be-detected image containing a target, and performing identification frame analysis on the to-be-detected image to obtain at least one identification frame to be detected;

Intercepting the image to be detected according to each identification frame to be detected to obtain a intercepted image;

The image to be detected and all the intercepted images are sequentially input into the pre-trained recognition frame optimization model, and the sub-recognition frame output by the recognition frame optimization model corresponding to each input image is obtained, and the sub-recognition frame corresponding to each input image is obtained according to the previous one. The sub-recognition frame corrects the latter sub-recognition frame, and determines the last sub-recognition frame after the correction is completed as the target recognition frame, wherein the recognition frame optimization model is based on a preset sample image and corresponding The target recognition frame is obtained by manual labeling frame training, and the target recognition frame is used to indicate the area where the target is located in the image to be detected.
The terminal device according to claim 11, characterized in that, before sequentially inputting the to-be-detected image and all the intercepted images into a pre-trained recognition frame optimization model, the method further comprises:

Acquire at least two of the sample images and the corresponding manual labeling frame, wherein the sample image is an image containing the target, and the manual labeling frame is the manually labelled sample image where the target is located area;

Performing identification frame analysis on each of the sample images to obtain at least one sample identification frame, and intercepting the sample image according to each of the sample identification frames to obtain a sample intercepted image;

A sample image set is constructed based on the sample image and all the corresponding sample intercepted images, and at least two of the sample image sets constructed are sequentially input into a preset basic model, and according to the input sample image set corresponding to The manual labeling frame adjusts the weight of the basic model until the recognition frame output by the adjusted basic model matches the manual labeling frame, and it is determined that the adjusted basic model is the Recognition frame optimization model; wherein, each of the sample image sets constructed corresponds to one sample image.
The terminal device according to claim 12, wherein the constructed at least two sample image sets are sequentially input into a preset basic model, and the artificial image sets corresponding to the input sample image sets are inputted in sequence. The labeling frame adjusts the weight of the basic model until the identification frame output by the adjusted basic model matches the manual labeling frame, including:

Determining a recognition frame corresponding to each input sample image set output by the basic model as a basic recognition frame, and calculating a difference parameter between the basic recognition frame and the corresponding manual labeling frame;

If the difference parameter is greater than or equal to the preset expected parameter, the basic model is weighted based on the difference parameter, and the next basic recognition frame output by the basic model is repeatedly obtained, and based on the updated The difference parameter adjusts the weight of the basic model until the difference parameter is smaller than the expected parameter.
The terminal device according to claim 12, wherein if at least two of the targets are obtained after identification frame analysis of the image to be detected, and each target corresponds to at least one identification frame to be detected, Then, the to-be-detected image and all the intercepted images are sequentially input to the pre-trained recognition frame optimization model, and the sub-recognition frame output by the recognition frame optimization model corresponding to each input image is obtained, and according to the previous One of the sub-recognition frames corrects the latter sub-recognition frame, and the last sub-recognition frame after the correction is determined as the target recognition frame includes:

Cutting the to-be-detected image according to all to-be-detected recognition frames corresponding to each of the targets to obtain a cut image, and scaling the cut image, wherein each cut image corresponds to one target;

The cut image after the size scaling and all the corresponding intercepted images are sequentially input to the recognition frame optimization model, the recognition frame output by the recognition frame optimization model is restored to the size, and the recognition after the size is restored The frame is determined as the target recognition frame corresponding to the cut image.
The terminal device according to claim 12, wherein if it includes at least two optimization models of the recognition frame and at least two preset attribute characteristics, and each of the optimization models of the recognition frame corresponds to the same The sample image set of the attribute characteristics is obtained through training, and then the input of the image to be detected and all the intercepted images into a pre-trained recognition frame optimization model in sequence includes:

Determining the attribute feature corresponding to the image to be detected as a target feature;

The image to be detected and all the intercepted images are sequentially input to the recognition frame optimization model corresponding to the target feature.
A computer non-volatile readable storage medium, the computer non-volatile readable storage medium storing computer readable instructions, characterized in that, when the computer readable instructions are executed by a processor, the following steps are implemented:

Acquiring a to-be-detected image containing a target, and performing identification frame analysis on the to-be-detected image to obtain at least one identification frame to be detected;

Intercepting the image to be detected according to each identification frame to be detected to obtain a intercepted image;

The image to be detected and all the intercepted images are sequentially input into the pre-trained recognition frame optimization model, and the sub-recognition frame output by the recognition frame optimization model corresponding to each input image is obtained, and the sub-recognition frame corresponding to each input image is obtained according to the previous one. The sub-recognition frame corrects the latter sub-recognition frame, and determines the last sub-recognition frame after the correction is completed as the target recognition frame, wherein the recognition frame optimization model is based on a preset sample image and corresponding The target recognition frame is obtained by manual labeling frame training, and the target recognition frame is used to indicate the area where the target is located in the image to be detected.
The computer non-volatile readable storage medium according to claim 16, characterized in that, before inputting the to-be-detected image and all the intercepted images sequentially into a pre-trained recognition frame optimization model, the method further comprises:

Acquire at least two of the sample images and the corresponding manual labeling frame, wherein the sample image is an image containing the target, and the manual labeling frame is the manually labelled sample image where the target is located area;

Performing identification frame analysis on each of the sample images to obtain at least one sample identification frame, and intercepting the sample image according to each of the sample identification frames to obtain a sample intercepted image;

A sample image set is constructed based on the sample image and all the corresponding sample intercepted images, and at least two of the sample image sets constructed are sequentially input into a preset basic model, and according to the input sample image set corresponding to The manual labeling frame adjusts the weight of the basic model until the recognition frame output by the adjusted basic model matches the manual labeling frame, and it is determined that the adjusted basic model is the Recognition frame optimization model; wherein, each of the sample image sets constructed corresponds to one sample image.
The computer non-volatile readable storage medium according to claim 17, wherein the constructed at least two sample image sets are sequentially input into a preset basic model, and the input samples are The manual labeling frame corresponding to the image set adjusts the weight of the basic model until the identification frame output by the adjusted basic model matches the manual labeling frame, including:

Determining a recognition frame corresponding to each input sample image set output by the basic model as a basic recognition frame, and calculating a difference parameter between the basic recognition frame and the corresponding manual labeling frame;

If the difference parameter is greater than or equal to the preset expected parameter, the basic model is weighted based on the difference parameter, and the next basic recognition frame output by the basic model is repeatedly obtained, and based on the updated The difference parameter adjusts the weight of the basic model until the difference parameter is smaller than the expected parameter.
The computer non-volatile readable storage medium of claim 17, wherein if at least two of the targets are obtained after the identification frame analysis of the image to be detected, and each of the targets corresponds to at least one For the recognition frame to be detected, the image to be detected and all the intercepted images are sequentially input into a pre-trained recognition frame optimization model, and the output of the recognition frame optimization model corresponding to each input image is obtained Sub-recognition frame, and correcting the latter sub-recognition frame according to the previous sub-recognition frame, and determining the last sub-recognition frame after the correction as the target recognition frame, including:

Cutting the to-be-detected image according to all to-be-detected recognition frames corresponding to each of the targets to obtain a cut image, and scaling the cut image, wherein each cut image corresponds to one target;

The cut image after the size scaling and all the corresponding intercepted images are sequentially input to the recognition frame optimization model, the recognition frame output by the recognition frame optimization model is restored to the size, and the recognition after the size is restored The frame is determined as the target recognition frame corresponding to the cut image.
The computer non-volatile readable storage medium according to claim 17, wherein if it includes at least two optimization models of the recognition frame and at least two preset attribute characteristics, and each recognition frame is optimized The model is obtained by training the sample image set corresponding to the same attribute feature, and then sequentially inputting the to-be-detected image and all the intercepted images into a pre-trained recognition frame optimization model includes:

Determining the attribute feature corresponding to the image to be detected as a target feature;

The image to be detected and all the intercepted images are sequentially input to the recognition frame optimization model corresponding to the target feature.