CN116524333A

CN116524333A - Training method of fuzzy detection model, fuzzy detection method, device and equipment

Info

Publication number: CN116524333A
Application number: CN202310509880.4A
Authority: CN
Inventors: 丁苗高
Original assignee: Beijing IQIYI Science and Technology Co Ltd
Current assignee: Beijing IQIYI Science and Technology Co Ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-08-01

Abstract

The application discloses a training method, a fuzzy detection method, a device and equipment of a fuzzy detection model, wherein the fuzzy detection model comprises a virtual focus prediction network and a classification network. Firstly training a virtual focus prediction network, and obtaining the virtual focus prediction network after training. And further, acquiring a first training object image and a corresponding fuzzy label, inputting the first training object image into a virtual focus prediction network after training, wherein the virtual focus prediction network can extract image features of the first training object image, and the image features comprise relative depth features of the first training object image. And inputting the obtained image characteristics into a classification network to obtain a target object fuzzy detection result. And training the classification network by utilizing the fuzzy detection result of the target object and the loss of the detection result generated by the fuzzy label. Whether the image is blurred or not is related to the relative depth of the image imaging, the image features of the method are fused with the relative depth information of the image, and the detection effect of the classification network obtained by training the image features is good.

Description

Training method of fuzzy detection model, fuzzy detection method, device and equipment

Technical Field

The present disclosure relates to the field of image processing, and in particular, to a training method of a blur detection model, and a blur detection method, apparatus, and device.

Background

Typically, after an image is acquired, the object in the image may be blurred, which may affect subsequent further processing of the object. For example, if the commodity in the commodity image is blurred, the commodity is not clearly distinguished when the commodity is recommended, and the commodity recommendation effect is affected. Therefore, it is necessary to perform blur detection on an image to determine whether or not a target object in the image is blurred.

The current blurring detection method is to carry out blurring detection on each pixel point in an image, and the blurring detection result of the pixel points in the image cannot truly reflect whether a target object is blurred or not. Therefore, a blur detection method is needed.

Disclosure of Invention

In order to solve the technical problems, the application provides a training method, a device and equipment of a fuzzy detection model, and a fuzzy detection method, a device and equipment, which can train to obtain a fuzzy detection model with better performance, realize fuzzy detection of a target object in a target object image based on the fuzzy detection model, and obtain an accurate target object fuzzy detection result (the target object can be a commodity).

In order to achieve the above purpose, the technical scheme provided by the application is as follows:

the application provides a training method of a fuzzy detection model, wherein the fuzzy detection model comprises a virtual focus prediction network and a classification network, and the method comprises the following steps:

obtaining a virtual focus prediction network after training is completed; the virtual focus prediction network after training is used for extracting the image characteristics of the target object image input into the virtual focus prediction network; the image features include relative depth features of the target image;

acquiring a first training target object image and a fuzzy label corresponding to the first training target object image;

inputting the first training target object image into the virtual focus prediction network after training is completed, and obtaining the image characteristics of the first training target object image; the image features include relative depth features of the first training object image;

inputting the image characteristics into a classification network, and obtaining a target object fuzzy detection result output by the classification network;

calculating the loss of the detection result according to the fuzzy detection result of the target object and the fuzzy label;

training the classification network according to the detection result loss.

Optionally, the training process of the virtual focus prediction network includes:

Inputting a second training target object image into a virtual focus prediction network, and acquiring the image characteristics of the second training target object image extracted by the virtual focus prediction network and a predicted virtual focus image output by the virtual focus prediction network; the pixel values of the pixel points in the predicted virtual focus image are used for predicting whether the pixel points at the same position in the second training object image are in the depth of field corresponding to the image imager;

calculating image loss according to the predicted virtual focus image and the real virtual focus image corresponding to the second training target object image; the pixel value of the pixel point in the real virtual focus image is used for indicating whether the pixel point at the same position in the second training object image is in the depth of field corresponding to the image imager;

acquiring the relative depth characteristics of the second training target object image;

calculating a feature loss from the image features and the relative depth features;

training the virtual focus prediction network according to the image loss and the feature loss;

the image features of the target object image extracted by the virtual focus prediction network after training comprise the relative depth features of the target object image.

Optionally, the acquiring the relative depth feature of the second training target image includes:

inputting the second training target object image into a relative depth estimation network after training is completed, and acquiring relative depth characteristics of the second training target object image extracted by the relative depth estimation network;

the training process of the relative depth estimation network comprises the following steps:

inputting a third training target object image into a relative depth estimation network, and acquiring predicted relative depth characteristics of the third training target object image output by the relative depth estimation network;

calculating depth loss according to the predicted relative depth characteristics and the actual relative depth characteristics of the third training target object image;

and training the relative depth estimation network according to the depth loss.

Optionally, the virtual focus prediction network is a multi-scale full convolution network, the virtual focus prediction network includes a plurality of first upsampling layers and a plurality of first downsampling layers, the image features are multi-scale image features, and the image features output by each first upsampling layer are included;

the relative depth estimation network is a multi-scale full convolution network, the relative depth estimation network comprises a plurality of second up-sampling layers and a plurality of second down-sampling layers, and the relative depth features are multi-scale relative depth features and comprise image features output by each second up-sampling layer.

The application provides a fuzzy detection method, which comprises the following steps:

acquiring at least one target object image;

inputting the target object image into a fuzzy detection model to obtain a target object fuzzy detection result output by the fuzzy detection model; the target object blurring detection result is used for indicating whether a target object in the target object image is blurring;

the fuzzy detection model consists of a virtual focus prediction network and a classification network; the fuzzy detection model is trained according to the training method of the fuzzy detection model.

Optionally, the acquiring at least one target object image includes:

acquiring a video frame to be detected;

inputting the video frame to be detected into a target object detection network, and obtaining a target object detection result of the video frame to be detected;

and based on the target object detection result, carrying out image interception on the video frame to be detected, and obtaining at least one target object image in the video frame to be detected.

The application provides a training device of a fuzzy detection model, wherein the fuzzy detection model comprises a virtual focus prediction network and a classification network, and the device comprises:

the first acquisition unit is used for acquiring a virtual focus prediction network after training is completed; the virtual focus prediction network after training is used for extracting the image characteristics of the target object image input into the virtual focus prediction network; the image features include relative depth features of the target image;

The second acquisition unit is used for acquiring a first training object image and a fuzzy label corresponding to the first training object image;

the first input unit is used for inputting the first training target object image into the virtual focus prediction network after training is completed, and obtaining the image characteristics of the first training target object image; the image features include relative depth features of the first training object image;

the second input unit is used for inputting the image characteristics into a classification network and obtaining a target object fuzzy detection result output by the classification network;

the calculating unit is used for calculating the loss of the detection result according to the fuzzy detection result of the target object and the fuzzy label;

and the execution unit is used for training the classification network according to the detection result loss.

The application provides a blur detection device, the device includes:

an acquisition unit configured to acquire at least one target image;

the input unit is used for inputting the target object image into a fuzzy detection model to obtain a target object fuzzy detection result output by the fuzzy detection model; the target object blurring detection result is used for indicating whether a target object in the target object image is blurring;

The application provides an electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of training a blur detection model as described in any one of the above, or the method of blur detection as described in any one of the above.

The present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of training a blur detection model as described in any one of the above, or a method of blur detection as described in any one of the above.

According to the technical scheme, the application has the following beneficial effects:

the application provides a training method of a fuzzy detection model, wherein the fuzzy detection model comprises a virtual focus prediction network and a classification network. Firstly training a virtual focus prediction network, and obtaining the virtual focus prediction network after training. The trained virtual focus prediction network is used for extracting image features of the target object image input into the virtual focus prediction network, wherein the image features comprise relative depth features of the target object image. Based on the first training object image and the corresponding fuzzy label are acquired again, so that the classification network is trained through the first training object image and the corresponding fuzzy label. The fuzzy label is used for indicating whether the target object in the first training target object image is fuzzy or not. In the specific implementation, the first training target object image is input into a virtual focus prediction network after training is completed, and the image characteristics of the first training target object image extracted by the virtual focus prediction network are obtained. The image features include relative depth features of the first training object image. The relative depth features are used to represent relative depth information of the first training object image. Further, inputting the image characteristics into a classification network to obtain a target object fuzzy detection result. And training the classification network by utilizing the fuzzy detection result of the target object and the loss of the detection result generated by the fuzzy label.

It is known whether the image is blurred or not correlated with the relative depth of the image imaging. In the method, a virtual focus prediction network is trained in advance, and image features extracted by the trained virtual focus prediction network are fused with relative depth information of a first training target object image, namely, the relative depth information is introduced into a fuzzy detection process. The detection effect of the classification network obtained by training the image features is good, and the detection accuracy of the fuzzy detection model can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a training method of a fuzzy detection model according to an embodiment of the present application;

fig. 2 is a flowchart of training a virtual focus prediction network according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a blur detection model according to an embodiment of the present application;

Fig. 4 is a flowchart of a blur detection method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a training device for a fuzzy detection model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a blur detection device according to an embodiment of the present application.

Detailed Description

In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures and detailed description are described in further detail below.

In order to facilitate understanding and explanation of the technical solutions provided in the embodiments of the present application, the background technology related to the embodiments of the present application is first described.

Typically, after an image is acquired, the object in the image may be blurred, which may affect subsequent further processing of the object. For example, when a commercial is recommended in a movie and television show, if a background blurring exists in a video frame of the movie and television show, there may be blurring of a commercial in the video frame (i.e., a target object in the video frame). Recommending the fuzzy commodity can lead to unclear commodity discrimination and influence commodity recommending effect. Therefore, it is necessary to perform blur detection on the object in the image to determine whether the object in the image is blurred.

Currently, in the related art, the blur detection method performs blur detection on each pixel point in an image, and the object blur detection result of the pixel point in the image cannot truly reflect whether the object is blurred. Therefore, a blur detection method is needed.

Based on the above, the embodiment of the application provides a training method of a fuzzy detection model, wherein the fuzzy detection model comprises a virtual focus prediction network and a classification network. Firstly training a virtual focus prediction network, and obtaining the virtual focus prediction network after training. The trained virtual focus prediction network is used for extracting image features of the target object image input into the virtual focus prediction network, wherein the image features comprise relative depth features of the target object image. Based on the first training object image and the corresponding fuzzy label are acquired again, so that the classification network is trained through the first training object image and the corresponding fuzzy label. The fuzzy label is used for indicating whether the target object in the first training target object image is fuzzy or not. In specific implementation, the first training target object image is input into a virtual focus prediction network after training is completed, and image features of the first training target object image extracted by the virtual focus prediction network are obtained, wherein the image features comprise relative depth features of the first training target object image. The relative depth features are used to represent relative depth information of the first training object image. Further, inputting the image characteristics into a classification network to obtain a target object fuzzy detection result. And training the classification network by utilizing the fuzzy detection result of the target object and the loss of the detection result generated by the fuzzy label.

It will be appreciated that the above solutions suffer from drawbacks, all of which are the result of the applicant after practice and careful study. Accordingly, the discovery process of the above-described problems, and the solutions presented by the embodiments of the present application hereinafter with respect to the above-described problems, should be all contributions made by the applicant to the embodiments of the present application during the course of the present application.

In order to facilitate understanding of the present application, a training method of a blur detection model provided in an embodiment of the present application is described below with reference to the accompanying drawings. The training method of the blur detection model may be implemented by a terminal device or a server, for example.

Referring to fig. 1, the figure is a flowchart of a training method of a blur detection model according to an embodiment of the present application. As shown in fig. 1, the method may include S101-S106:

S101: obtaining a virtual focus prediction network after training is completed; the virtual focus prediction network after training is used for extracting the image characteristics of the target object image input into the virtual focus prediction network; the image features include relative depth features of the target image.

In the embodiment of the application, the fuzzy detection model comprises a virtual focus prediction network and a classification network, and the training process of the fuzzy detection model comprises the training process of the virtual focus prediction network and the training process of the classification network. In specific implementation, the virtual focus prediction network is trained first, and the virtual focus prediction network after training is obtained. And further training the classification network, and obtaining the classification network after training is completed after the classification network training is completed. Thus, model training is completed, and the trained fuzzy detection model comprises a trained virtual focus prediction network and a trained classification network.

The virtual focus prediction network is a network which can output virtual focus images of the target object images after the target object images are input. The target image is an image including a target. The output virtual focus image is a binary image, and the pixel values of the pixel points in the virtual focus image are used for indicating whether the pixel points at the same position in the target object image are in the depth of field corresponding to the image imager. The image imager is, for example, an image capturing device, and the target image may be considered to be acquired by the image imager, for example, captured by the image imager. The depth of field refers to the range of distances between the front and rear of a subject measured by imaging that can acquire a clear image at the front of an image imager.

In some alternative examples, the virtual focus prediction network may be a full convolution network. All network layers in the full convolution network are convolution layers, the type of the full convolution network is not limited, and the full convolution network can be selected according to actual requirements.

In one possible implementation manner, the embodiment of the present application provides a training method of a virtual focus prediction network, and in particular, see S201 to S205 below. In the training process of the virtual focus prediction network, the relative depth characteristics of the input image are adopted as supervision information, so that the image characteristics of the input image extracted by the virtual focus prediction network after the training is completed comprise the relative depth characteristics of the input image, and the training process of the Wen Xu focus prediction network can be particularly seen. The input image is an image of the virtual focus prediction network when the virtual focus prediction network is trained.

Based on the above, after the target object image is input into the virtual focus prediction network after the training is completed, the image characteristics of the target object image extracted by the virtual focus prediction network can be obtained, and the image characteristics comprise the relative depth characteristics of the target object image. Therefore, the classification network is trained by adopting the image features comprising the relative depth features, so that the training effect of the classification network is better, and the detection accuracy of the trained fuzzy detection model is higher.

Where image features refer to a collection of attribute information that characterizes the characteristics or content of an image. Image features may be represented by feature vectors or by feature maps. The relative depth features may be represented by feature vectors or by the form of feature maps, and are not limited herein. The relative depth features are used to represent relative depth information of the image. In this embodiment, the depth refers to a distance between each pixel point in the image and a shooting source (such as an image imager). Because the absolute depth of the pixel point from the shooting source cannot be accurately marked manually, only the pixel point which is closer to the shooting source and the pixel point which is farther from the shooting source can be marked. The amount representing a shorter distance from the photographing source and a longer distance from the photographing source is the relative depth. For example, the normalized relative depth ranges from 0,1, and the two relative depths are 0.5 and 0.6, respectively, with the first relative depth indicating that the pixel is closer to the source and the second relative depth indicating that the pixel is farther from the source.

As an alternative example, the virtual focus prediction network includes a target network layer, and the image features of the target object image input to the virtual focus prediction network are extracted by the target network layer in the virtual focus prediction network after the training is completed. The target network layer is a network layer inside the virtual focus prediction network. For the target network layer, see in particular the virtual focus prediction network training methods in S201-S205 below.

S102: and acquiring a first training target object image and a fuzzy label corresponding to the first training target object image.

The first training target object image is an image comprising a target object, and the target object is an object in the image, which a user needs to judge whether to blur or not. For example, when the application scene is a commodity recommendation in a movie theatrical, the target object may be a commodity in a movie theatrical video frame.

The first training object image may be obtained by capturing from a video frame of a movie. As an alternative example, the video frame may be input into a trained object detection network, and the object detection network detects the object in the video frame, and then intercepts the image containing the object. The target object detection network can be a neural network, and the target object detection network is used for detecting the target object in the video frame based on a target detection algorithm.

In practical applications, when the object detection network detects an object in a video frame, a pre-selection frame is used to frame the object. Based on this, the captured image containing the object (i.e., the object image) may be an image framed by a preselected frame in the video frame, or may be an image obtained by expanding a surrounding partial area on the basis of the image framed by the preselected frame, which is not limited herein.

As another alternative example, the first training object image may also be manually obtained by cutting from a video frame of a movie and television show.

The fuzzy label corresponding to the first training object image is used for indicating whether the object in the first training object image is fuzzy. After the first training target object image is acquired, the fuzzy label corresponding to the first training target object image can be obtained by manual labeling.

Illustratively, the fuzzy tag value may be 1 or 0. When the fuzzy label is 1, the target object is fuzzy, and when the fuzzy label is 0, the target object is clear. It should be understood that the embodiment of the present application is not limited to the specific value of the fuzzy label and the meaning represented by the value, and may be determined according to the actual situation.

It will be appreciated that the first training object image and the fuzzy label corresponding to the first training object image are used to train the classification network in the fuzzy detection model.

S103: inputting the first training target object image into a virtual focus prediction network after training is completed, and obtaining the image characteristics of the first training target object image; the image features include relative depth features of the first training object image.

After the virtual focus prediction network training is completed, the classification network may be trained, which may be trained based on the first training object image and its corresponding fuzzy label.

It can be understood that, because the network parameters of the virtual focus prediction network are better parameters obtained through training, the network parameters of the virtual focus prediction network can be frozen in the process of training the classification network after the virtual focus prediction network is trained. The "freezing" refers to directly using the network parameters of the virtual focus prediction network after training without retraining the virtual focus prediction network in the process of training the classification network. For example, in this step, the trained virtual focus prediction network may be used directly.

In the specific implementation, the first training object image is input into a virtual focus prediction network after the training is completed, and the virtual focus prediction network performs feature extraction on the first training object image in the processing process before the virtual focus image corresponding to the first training object image is output after the first training object image is input, so that the image features of the first training object image are obtained. The image features of the first training object image include relative depth features of the first training object image.

It is known that the image imager images clearly at the depth of field, and blurred before and after the depth of field. Depth of field refers to a range of relative depths that sharpen an image. Thus, the relative depth of imaging of pixels in an image has a relationship with blurring of objects in the image. Therefore, in the embodiment of the application, the relative depth features of the image are integrated into the image features, and the image features comprising the relative depth features are used for training the classification network later, so that the training effect of the classification network is better, and the detection accuracy of the trained fuzzy detection model is higher.

S104: and inputting the image characteristics into a classification network, and obtaining a target object fuzzy detection result output by the classification network.

And inputting the image characteristics comprising the relative depth characteristics of the first training target object image into a classification network, and obtaining a target object fuzzy detection result output by the classification network.

As an alternative example, the object blur detection result is used to predict whether the object in the first training object image is blurred. For example, the target object blur detection result may be 1 or 0, and when the target object blur detection result is 1, it indicates that the predicted target object is blurred, and when the target object blur detection result is 0, it indicates that the predicted target object is clear.

As another alternative example, the object blur detection result may also be used to represent the possibility that the object in the first training object image is blurred, which may be represented by a probability. Further, a blurring threshold value may be set, and whether the object is blurred is determined by the object blurring detection result and the blurring threshold value. For example, when the target object blur detection result is greater than the blur threshold value, the target object blur is indicated; and when the object blurring detection result is smaller than or equal to the blurring threshold value, the object blurring detection result indicates that the object is clear. For example, if the blur threshold is 0.6 and the object blur detection result is 0.8, the object can be regarded as blurred.

It can be understood that the target object fuzzy detection result is not limited in the embodiment of the present application, and may be determined according to actual situations.

For example, the structure of the classification network in the embodiment of the present application may be a neural network, where the specific structure of the classification network is not limited, and only the classification network needs to be able to implement the classification task.

S105: and calculating the loss of the detection result according to the fuzzy detection result of the target object and the fuzzy label.

It can be understood that the fuzzy detection result of the target object is a predicted value, and the fuzzy label is a true value. And calculating the loss of the detection result according to the fuzzy detection result of the target object and the fuzzy label. The loss of detection result is used for measuring the difference between the fuzzy detection result of the target object and the fuzzy label.

When the method is implemented, a loss function comprising a target object fuzzy detection result and a fuzzy label can be constructed, after the target object fuzzy detection result and the fuzzy label are obtained, the target object fuzzy detection result and the fuzzy label are input into the loss function, and a loss value is calculated, wherein the loss value is the loss of the detection result.

It will be appreciated that the embodiment of the present application is not limited to the specific formula of the loss function in this step, and may be set according to the requirements.

S106: and training the classification network according to the loss of the detection result.

In the specific implementation, after the detection result loss is obtained, the classification network is trained by using the detection result loss, and the network parameters of the classification network are adjusted. And after adjusting the network parameters of the classification network, judging whether a first preset condition is met, if so, indicating that the training of the classification network is finished, and obtaining the classification network with the training completed. If not, the image features of the first training target object image are input into the classification network after the network parameters are adjusted again, and the target object fuzzy detection result output by the classification network again is obtained. And then, according to the re-output target object fuzzy detection result and fuzzy labels, re-calculating the detection result loss, and according to the re-calculated detection result loss, re-adjusting the network parameters of the classification network until the first preset condition is reached.

As an alternative example, the first preset condition is that the first training time is reached or that the loss of the detection result reaches a first preset range. The first training times and the first preset range are not limited, and the first training times and the first preset range can be selected according to actual conditions. When the first preset condition is reached, the loss of the detection result is approximately 0, the difference between the fuzzy detection result of the target object and the fuzzy label is small enough, the fuzzy detection result of the target object output by the classification network is approximately equal to the fuzzy label, and the training of the classification network is finished. And obtaining a trained classification network, and obtaining a trained fuzzy detection model by combining the trained virtual focus prediction network.

The trained fuzzy detection model is used for inputting the target object image, outputting the target object fuzzy detection result, and determining whether the target object in the target object image is fuzzy or not according to the target object fuzzy detection result. In the training process, the image features of the input classification network comprise the relative depth features, so that the training process of the classification network considers the relative depth information of the image pixels, the training effect of the classification network is better, and the detection effect of the obtained fuzzy detection model is better.

Based on the above description of S101-S106, the embodiment of the present application provides a training method for a blur detection model, where the blur detection model includes a virtual focus prediction network and a classification network. Firstly training a virtual focus prediction network, and obtaining the virtual focus prediction network after training. The trained virtual focus prediction network is used for extracting image features of the target object image input into the virtual focus prediction network, wherein the image features comprise relative depth features of the target object image. Based on the first training object image and the corresponding fuzzy label are acquired again, so that the classification network is trained through the first training object image and the corresponding fuzzy label. The fuzzy label is used for indicating whether the target object in the first training target object image is fuzzy or not. In the specific implementation, the first training target object image is input into a virtual focus prediction network after training is completed, and the image characteristics of the first training target object image extracted by the virtual focus prediction network are obtained. Including the relative depth characteristics of the first training object image. The relative depth features are used to represent relative depth information of the first training object image. Further, inputting the image characteristics into a classification network to obtain a target object fuzzy detection result. And training the classification network by utilizing the fuzzy detection result of the target object and the loss of the detection result generated by the fuzzy label.

In order to facilitate understanding of the virtual focus prediction network provided in the embodiments of the present application, a training procedure of the virtual focus prediction network is described below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a flowchart of training a virtual focus prediction network according to an embodiment of the present application. As shown in fig. 2, the method may include the following steps S201-S205:

s201: inputting the second training target object image into a virtual focus prediction network, and acquiring the image characteristics of the second training target object image extracted by the virtual focus prediction network and a predicted virtual focus image output by the virtual focus prediction network; the pixel values of the pixel points in the predicted virtual focus image are used for predicting whether the pixel points at the same position in the second training object image are within the depth of field corresponding to the image imager.

The second training target image is an input image when training the virtual focus prediction network, and may be the same image or a different image from the first training target image, which is not limited herein. The method for obtaining the second training target image is similar to that of the first training target image, and will not be described here again.

And inputting the second training object image into a virtual focus prediction network, wherein the virtual focus prediction network can extract the image characteristics of the second training object image and can output the virtual focus image. In the process of training the virtual focus prediction network, the virtual focus prediction network outputs a predicted virtual focus image, and the predicted virtual focus image is a prediction result of the virtual focus image. It will be appreciated that the size of the predicted virtual focus image corresponding to the second training object image is the same as the size of the second training object image. And predicting the pixel value of the pixel point in the virtual focus image to predict whether the pixel point at the same position in the second training object image is within the depth of field corresponding to the image imager. Wherein the same position refers to the same pixel position. For example, the pixel value of the pixel point a in the predicted virtual focus image is used to predict whether the pixel point a in the second training target image is within the depth of field corresponding to the image imager. The pixel position of the pixel point A in the predicted virtual focus image is the same as the pixel position of the pixel point a in the second training object image.

It will be appreciated that the image imager images clearly within the depth of field, while imaging is blurred before and after the depth of field. The depth of field includes a front depth of field and a rear depth of field, and a point between the front depth of field and the rear depth of field is a focusing point, so that the depth of field can be understood as a focus area. The virtual focus image is a binary image, for example, the pixel value of a pixel point in the virtual focus image is 1 or 0, the pixel value is 1, the pixel point in the same position in the first training object image is in the depth of field (or focal area) corresponding to the image imager, and the pixel point is clear; the pixel value is 0, which indicates that the pixel points at the same position in the first training object image are not in the depth of field (or focal area) corresponding to the image imager, and the pixel points are blurred.

As an alternative example, the virtual focus prediction network comprises a target network layer. After the second training object image is input into the virtual focus prediction network, image features of the second training object image are extracted by the target network layer. It can be understood that the embodiments of the present application are not limited to the target network layer in the virtual focus prediction network, and the target network layer in the virtual focus prediction network may be determined according to the actual situation.

As an alternative example, the virtual focus prediction network may be a full convolution network, such as a multi-scale full convolution network. When the virtual focus prediction network is a multi-scale full convolution network, the image features of the second training object image are multi-scale image features.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a blur detection model according to an embodiment of the present application. The virtual focus prediction network shown in fig. 2 is a multi-scale full convolution network. When training the virtual focus prediction network, the input image in fig. 2 is specifically a second training target image.

In some alternative examples, the multi-scale full convolution network includes a plurality of first upsampling layers and a plurality of first downsampling layers. In this case, the target network layer in the virtual focus prediction network may include a plurality of first upsampling layers.

Specifically, the processing procedure of the multi-scale full convolution network includes downsampling and upsampling, that is, downsampling is performed based on a plurality of first downsampling layers, and upsampling is performed based on a plurality of first upsampling layers. Where downsampling, i.e. downsampling, is understood to be the downsizing of the input image. The up-sampling is to expand and enlarge the image or the feature map so that the final output image has the same size as the second training object image. Upsampling is also known as image interpolation or upscaling of images, the main purpose of which is to upscale the image.

As shown in fig. 2, the downsampling is implemented by four convolutional layers (i.e., four first downsampling layers), and the upsampling is also implemented by four convolutional layers (i.e., four first upsampling layers). It will be appreciated that the number of convolution layers in a multi-scale full convolution network is not limited in the embodiments of the present application, and that fig. 2 is illustrated by way of example only.

When the virtual focus prediction network is a multi-scale full convolution network, the image features comprise the image features output by each up-sampling convolution layer in the up-sampling process, namely comprise the image features output by each first up-sampling layer. It will be appreciated that the image features output by each of the first upsampling layers are of different scales, and that the image features at that time are multi-scale image features. For example, the multi-scale image features in fig. 2 include four-scale image features of the second training object image output by the four upsampling convolution layers during the upsampling process.

It can be seen that, when the structure of the virtual focus prediction network is the structure shown in fig. 2, after the virtual focus prediction network training is completed, in the training process of the classification network, the image features of the input classification network include the image features of the four scales of the first training target object image. At this time, the image features of the input classification network are those of the four-scale images, or the image features of the four-scale images are spliced and then input into the classification network.

It can be understood that when the virtual focus prediction network is a multi-scale full convolution network, the extracted image features are multi-scale image features, and as the image features with different scales represent different feature information, the more the scales are, the more the feature information is rich, so that after the virtual focus prediction network is trained, the effect of a classification network trained based on the multi-scale image features is higher.

S202: calculating image loss according to the predicted virtual focus image and the real virtual focus image corresponding to the second training target object image; the pixel values of the pixel points in the real virtual focus image are used for indicating whether the pixel points at the same position in the second training object image are within the depth of field corresponding to the image imager.

The real virtual focus image is a label image, and the pixel value of the pixel point in the real virtual focus image is used for truly representing whether the pixel point at the same position in the second training object image is in the depth of field corresponding to the image imager. In practical application, the real virtual focus image can be obtained by manual annotation.

After the predicted virtual focus image is obtained, calculating image loss according to the predicted virtual focus image and the real virtual focus image corresponding to the second training object image. Image loss can be used to measure the gap between the predicted virtual focus image and the real virtual focus image. The image loss is used to train the virtual focus prediction network.

As an alternative example, an image loss function may be constructed first, and after the predicted virtual focus image and the real virtual focus image are acquired, the predicted virtual focus image and the real virtual focus image are input into the image loss function, and the image loss is calculated. It can be understood that the embodiment of the present application is not limited to the specific formula of the image loss function in this step, and may be set according to the requirements.

S203: and acquiring the relative depth characteristic of the second training object image.

In an alternative embodiment, the embodiment of the present application provides a specific implementation manner for acquiring a relative depth feature of a second training target object image, including:

and inputting the second training target object image into a relative depth estimation network after training is completed, and acquiring the relative depth characteristics of the second training target object image extracted by the relative depth estimation network.

The relative depth estimation network is used for outputting relative depth characteristics, and training of the virtual focus prediction network can be supervised by utilizing the relative depth characteristics later. The relative depth features of the second training object image are used to represent relative depth information of the second training object image.

The relative depth estimation network is pre-trained and is only used for the training process of the virtual focus prediction network. After the virtual focus prediction network is trained, the relative depth estimation network is not needed to be used in the training process and the application process of the fuzzy detection model.

It should be noted that, when the virtual focus prediction network is a multi-scale full convolution network, the relative depth estimation network is also a multi-scale full convolution network, and the number of scales of the virtual focus prediction network and the relative depth estimation network is the same. As an alternative example, the relative depth estimation network comprises a plurality of second upsampling layers and a plurality of second downsampling layers. Wherein the number of second upsampling layers is the same as the number of first upsampling layers, and the number of second downsampling layers is the same as the number of first downsampling layers. In addition, when the relative depth estimation network is a multi-scale full convolution network, the relative depth features are multi-scale relative depth features, which are also features output during the upsampling process of the relative depth estimation network. In particular, when the multi-scale image features include image features output by each first upsampling layer, the multi-scale relative depth features include image features output by each second upsampling layer.

In one possible implementation manner, the embodiment of the present application provides a specific implementation manner of training a relative depth estimation network, which includes:

inputting the third training object image into a relative depth estimation network, and obtaining the predicted relative depth characteristics of the third training object image output by the relative depth estimation network;

calculating depth loss according to the predicted relative depth characteristics and the actual relative depth characteristics of the third training object image;

based on the depth loss, a relative depth estimation network is trained.

It is to be understood that the third training object image may be the same image as the first training object image, the second training object image, or the like, or may be different images, which are not limited herein. The method for obtaining the third training target image is similar to the method for obtaining the first training target image, and will not be described here again.

The predicted relative depth feature is a predicted value and the actual relative depth feature is a tag value. In practical applications, the actual relative depth features may be obtained by manual annotation. The depth loss is used to measure the gap between the predicted relative depth feature and the actual relative depth feature. In specific implementation, a depth loss function can be constructed first, and after the predicted relative depth feature and the actual relative depth feature are obtained, the predicted relative depth feature and the actual relative depth feature are input into the depth loss function, so that the depth loss is obtained. It should be noted that, the embodiment of the present application does not limit a specific formula of the depth loss function, and may be set according to actual situations.

The depth penalty is used to train the relative depth estimation network and adjust network parameters of the relative depth estimation network. And further, judging whether a third preset condition is met, if so, ending the relative depth estimation network training. If not, repeatedly executing the steps of inputting the third training target object image into the relative depth estimation network, obtaining the predicted relative depth characteristics output by the relative depth estimation network and the subsequent steps until the third preset condition is reached. The third preset condition is that the third training times are reached or the loss of the detection result reaches a third preset range. The third training times and the third preset range are not limited, and the third training times and the third preset range can be selected according to actual conditions.

S204: from the image features and the relative depth features, feature loss is calculated.

Image features may be represented by feature maps and relative depth features may also be represented by feature maps. When the image features are represented by the feature map, the relative depth features also need to be represented by the feature map to ensure that the image features and the relative depth features are represented in a consistent manner.

In this step, the image features are specifically image features of the second training target image, which are predicted values. In order to enable the relative depth features to serve as supervision information to supervise the training process of the virtual focus prediction network, the embodiment of the application takes the relative depth features output by the relative depth estimation network as tag values. In the process of training the virtual focus prediction network, the image features or partial features in the image features are made to approach to the relative depth features, the partial features in the image features are not limited, and the partial features can be selected according to actual needs.

Specifically, feature loss is calculated from the image features and the relative depth features. Feature loss can measure the gap between an image feature and a relative depth feature when it is desired to have the image feature approach the relative depth feature. When it is desired to have a partial feature in an image feature approach a relative depth feature, the feature loss can measure the gap between the partial feature and the relative depth feature in the image feature. The virtual focus prediction network is trained, network parameters of the virtual focus prediction network are adjusted, the size of the feature loss can be controlled, and further image features or partial features in the image features are close to relative depth features.

S205: and training the virtual focus prediction network according to the image loss and the characteristic loss.

After the image loss and the feature loss are obtained, training the virtual focus prediction network based on the image loss and the feature loss, adjusting network parameters of the virtual focus prediction network, judging whether a second preset condition is met, if so, finishing the virtual focus prediction network training, and obtaining the virtual focus prediction network after the training is finished. If not, repeating the step S201 and the subsequent steps until a second preset condition is reached, and obtaining the virtual focus prediction network after training is completed. The second preset condition is that the second training times are reached or the loss of the detection result reaches a second preset range. The second training times and the second preset range are not limited, and the second training times and the second preset range can be selected according to actual conditions.

It can be understood that the virtual focus prediction network is trained by utilizing the image loss, so that the virtual focus prediction network after training can output more accurate virtual focus images, and the extracted image features are more accurate in the process of extracting the image features by the virtual focus prediction network. The virtual focus prediction network is trained by utilizing the feature loss, so that the image features of the target object images extracted by the virtual focus prediction network after training comprise the relative depth features of the target object images, namely, the relative depth information of the target object images is fused, and further, the effect of the classification network for subsequent training is better. Specifically, in the process of training the virtual focus prediction network, if the target network layer of the virtual focus prediction network extracts the image features of the second training target object image, the target network layer extracts the image features of the target object image when the virtual focus prediction network after training is used for extracting the image features of the target object image.

Based on the above-mentioned related content of S201-S205, the second training target image is input into the virtual focus prediction network, and the image feature of the second training target image extracted by the virtual focus prediction network and the predicted virtual focus image output by the virtual focus prediction network are obtained. And calculating the image loss according to the predicted virtual focus image and the real virtual focus image corresponding to the second training object image. And inputting the second training target object image into a relative depth estimation network after training is completed, and acquiring the relative depth characteristics of the second training target object image extracted by the relative depth estimation network. From the image features and the relative depth features, feature loss is calculated. And training the virtual focus prediction network according to the image loss and the characteristic loss. In the method, the relative depth estimation network is introduced in the process of training the virtual focus prediction network, the relative depth characteristics extracted by the relative depth estimation network are used as supervision information to supervise the training of the virtual focus prediction network, so that the image characteristics extracted by the virtual focus prediction network after the training can be fused with the relative depth information of the input image, and the subsequent training of the classification network is facilitated.

Based on the embodiment of the method, after the fuzzy detection model is trained, the fuzzy detection model is applied to carry out fuzzy detection on the target object in the video frame. Referring to fig. 4, fig. 4 is a flowchart of a blur detection method according to an embodiment of the present application, as shown in fig. 4, the method may include S401-S402:

s401: at least one target image is acquired.

In one possible implementation manner, the embodiment of the present application provides a specific implementation manner for acquiring at least one target object image, including:

acquiring a video frame to be detected;

based on the target object detection result, image interception is carried out on the video frame to be detected, and at least one target object image in the video frame to be detected is obtained.

The method comprises the steps of inputting a video frame to be detected into a target object detection network, and obtaining a target object detection result of the video frame to be detected. The target detection result may be a pre-selected frame marked in the video frame to be detected, in which the target is framed. And then, based on the target object detection result, carrying out image interception on the video frame to be detected, and obtaining at least one target object image. Specifically, the captured target image may be an image framed by a preselected frame, or an image obtained by expanding a surrounding partial region on the basis of the image framed by the preselected frame.

The video frame to be detected may be one or more video frames in a movie. The object detection network may be a commodity detection network, the object to be detected is a commodity in a video frame, and the acquired object image is a commodity image.

S402: inputting the target object image into a fuzzy detection model to obtain a target object fuzzy detection result output by the fuzzy detection model; the object blur detection result is used for indicating whether the object in the object image is blurred or not.

The fuzzy detection model consists of a virtual focus prediction network and a classification network; the fuzzy detection model is trained according to the training method of the fuzzy detection model described in any of the embodiments above.

In the implementation, the target object image is firstly input into the virtual focus detection network, and the image characteristics of the target object image extracted by the virtual focus detection network are acquired. And inputting the image characteristics of the target object image into a classification network to obtain a target object fuzzy detection result output by the classification network.

When the object blur detection result is blur, marking the object in the video frame to be detected as blur; and when the fuzzy detection result of the target object is clear, marking the target object in the video frame to be detected as clear. When the target object is a commodity, the video frame can be processed according to the fuzzy detection result of the target object, so that the occurrence of the simulation commodity in the video frame is reduced, the commodity recommendation can meet the requirement of a user, and the experience of the user is improved.

It can be appreciated that, for the technical implementation of the virtual focus prediction network and the classification network, reference may be made to the above method embodiments, which are not described herein.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the training method of the fuzzy detection model provided by the above method embodiment, the embodiment of the application also provides a training device of the fuzzy detection model, and the training device of the fuzzy detection model will be described with reference to the accompanying drawings. Because the principle of solving the problem by the device in the embodiment of the present disclosure is similar to that of the training method of the fuzzy detection model in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is not repeated.

Referring to fig. 5, the structure diagram of a training device for a fuzzy detection model according to an embodiment of the present application is shown, where the fuzzy detection model includes a virtual focus prediction network and a classification network. As shown in fig. 5, the training device of the blur detection model includes:

A first obtaining unit 501, configured to obtain a virtual focus prediction network after training is completed; the virtual focus prediction network after training is used for extracting the image characteristics of the target object image input into the virtual focus prediction network; the image features include relative depth features of the target image;

a second obtaining unit 502, configured to obtain a first training target object image and a fuzzy label corresponding to the first training target object image;

a first input unit 503, configured to input the first training target image into the trained virtual focus prediction network, and obtain an image feature of the first training target image; the image features include relative depth features of the first training object image;

a second input unit 504, configured to input the image feature into a classification network, and obtain a target object fuzzy detection result output by the classification network;

a calculating unit 505, configured to calculate a detection result loss according to the target object fuzzy detection result and the fuzzy label;

and an execution unit 506, configured to train the classification network according to the detection result loss.

In an alternative embodiment, the apparatus further comprises: virtual focus prediction network training unit;

The virtual focus prediction network training unit comprises:

the first input subunit is used for inputting a second training object image into the virtual focus prediction network, and acquiring the image characteristics of the second training object image extracted by the virtual focus prediction network and the predicted virtual focus image output by the virtual focus prediction network; the pixel values of the pixel points in the predicted virtual focus image are used for predicting whether the pixel points at the same position in the second training object image are in the depth of field corresponding to the image imager;

the first calculating subunit is used for calculating image loss according to the predicted virtual focus image and the real virtual focus image corresponding to the second training target object image; the pixel value of the pixel point in the real virtual focus image is used for indicating whether the pixel point at the same position in the second training object image is in the depth of field corresponding to the image imager;

a first obtaining subunit, configured to obtain a relative depth feature of the second training target object image;

a second computing subunit for computing a feature loss from the image feature and the relative depth feature;

a first training subunit, configured to train the virtual focus prediction network according to the image loss and the feature loss;

In an alternative embodiment, the first acquisition subunit includes:

the second input subunit is used for inputting the second training target object image into a relative depth estimation network after training is completed, and acquiring the relative depth characteristics of the second training target object image extracted by the relative depth estimation network;

the apparatus further comprises: a relative depth estimation network training unit;

the relative depth estimation network training unit comprises:

the second acquisition subunit is used for inputting the third training object image into a relative depth estimation network and acquiring the predicted relative depth characteristic of the third training object image output by the relative depth estimation network;

a third calculation subunit, configured to calculate a depth loss according to the predicted relative depth feature and an actual relative depth feature of the third training target object image;

and the second training subunit is used for training the relative depth estimation network according to the depth loss.

In an optional implementation manner, the virtual focus prediction network is a multi-scale full convolution network, the virtual focus prediction network comprises a plurality of first upsampling layers and a plurality of first downsampling layers, the image features are multi-scale image features, and the image features output by each first upsampling layer are included;

Based on the blur detection method provided by the above method embodiment, the embodiment of the present application further provides a blur detection device, and the blur detection device will be described below with reference to the accompanying drawings. Because the principle of solving the problem by the device in the embodiment of the present disclosure is similar to that of the blur detection method in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is not repeated.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a blur detection device according to an embodiment of the present application. As shown in fig. 6, the blur detection device includes:

an acquiring unit 601, configured to acquire at least one target image;

an input unit 602, configured to input the target object image into a fuzzy detection model, and obtain a target object fuzzy detection result output by the fuzzy detection model; the target object blurring detection result is used for indicating whether a target object in the target object image is blurring;

In an alternative embodiment, the obtaining unit 601 includes:

the first acquisition subunit is used for acquiring the video frame to be detected;

the second acquisition subunit is used for inputting the video frame to be detected into a target object detection network and acquiring a target object detection result of the video frame to be detected;

and the intercepting subunit is used for intercepting the image of the video frame to be detected based on the target object detection result and acquiring at least one target object image in the video frame to be detected.

In addition, an embodiment of the present application provides an electronic device, including:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for training a blur detection model as described in any one of the embodiments above, or the method for blur detection as described in any one of the embodiments above.

In addition, an embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement a method for training a blur detection model according to any one of the above embodiments, or a blur detection method according to any one of the above embodiments.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to perform the methods described in the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the method disclosed in the embodiment, since it corresponds to the system disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the system part.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of training a fuzzy detection model, the fuzzy detection model comprising a virtual focus prediction network and a classification network, the method comprising:

training the classification network according to the detection result loss.

2. The method of claim 1, wherein the virtual focus prediction network training process comprises:

Inputting a second training target object image into the virtual focus prediction network, and acquiring the image characteristics of the second training target object image extracted by the virtual focus prediction network and a predicted virtual focus image output by the virtual focus prediction network; the pixel values of the pixel points in the predicted virtual focus image are used for predicting whether the pixel points at the same position in the second training object image are in the depth of field corresponding to the image imager;

3. The method of claim 2, wherein the acquiring the relative depth features of the second training object image comprises:

inputting a third training target object image into the relative depth estimation network, and acquiring predicted relative depth characteristics of the third training target object image output by the relative depth estimation network;

and training the relative depth estimation network according to the depth loss.

4. The method of claim 1, wherein the virtual focus prediction network is a multi-scale full convolution network, the virtual focus prediction network comprising a plurality of first upsampling layers and a plurality of first downsampling layers, the image features being multi-scale image features comprising image features output by each of the first upsampling layers;

5. A method of blur detection, the method comprising:

acquiring at least one target object image;

the fuzzy detection model consists of a virtual focus prediction network and a classification network; the fuzzy detection model is trained according to the training method of the fuzzy detection model of any one of claims 1-4.

6. The method of claim 5, wherein the acquiring at least one image of the object comprises:

acquiring a video frame to be detected;

7. A training apparatus for a blur detection model, the blur detection model comprising a virtual focus prediction network and a classification network, the apparatus comprising:

8. A blur detection device, the device comprising:

an acquisition unit configured to acquire at least one target image;

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of training a blur detection model according to any one of claims 1-4, or the method of blur detection according to any one of claims 5-6.

10. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements a method of training a blur detection model according to any one of claims 1-4, or a method of blur detection according to any one of claims 5-6.