CN109344908B

CN109344908B - Method and apparatus for generating a model

Info

Publication number: CN109344908B
Application number: CN201811273468.2A
Authority: CN
Inventors: 袁泽寰; 王长虎
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2020-04-28
Anticipated expiration: 2038-10-30
Also published as: WO2020087974A1; CN109344908A

Abstract

The embodiment of the application discloses a method and a device for generating a model. One embodiment of the method comprises: obtaining a sample set; extracting samples from the sample set, and executing the following training steps: inputting frames in a sample video in the extracted sample into an initial model, and respectively obtaining the probability that the sample video belongs to a low-quality video and each low-quality category; determining a loss value of the sample based on the extracted labeling information in the sample, the obtained probability and a pre-established loss function; determining whether the initial model is trained based on the comparison of the loss value and the target value; in response to determining that the initial model training is complete, determining the trained initial model as a low-quality video detection model. The embodiment can obtain a model which can be used for low-quality video detection and is beneficial to improving the efficiency of low-quality video detection.

Description

Method and apparatus for generating a model

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating a model.

Background

With the development of computer technology, short video applications have come into play. The user can upload and publish videos by using the short video application. After receiving a video, the server may check the video to determine if it is a low quality video. Here, the low-quality video is generally a video of lower quality, and may include, for example, a blurred video, a black screen video, a screen-recorded video, and the like.

In a related manner, videos are generally classified into multiple categories, such as a black screen video category, a recorded screen video category, a fuzzy video category, and a normal video category. And training a classification model to determine the probability that the video belongs to each category, and taking the sum of the probabilities that the video belongs to abnormal videos as the probability that the video belongs to low-quality videos to further determine whether the video is the low-quality video.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating a model.

In a first aspect, an embodiment of the present application provides a method for generating a model, where the method includes: obtaining a sample set, wherein samples in the sample set comprise sample videos and first annotation information used for indicating whether the sample videos belong to low-quality videos, and when the sample videos belong to the low-quality videos, the samples further comprise second annotation information used for indicating low-quality categories of the sample videos; extracting samples from the sample set, and executing the following training steps: inputting frames in a sample video in the extracted sample into an initial model, and respectively obtaining the probability that the sample video belongs to a low-quality video and each low-quality category; determining a loss value of the sample based on the extracted labeling information in the sample, the obtained probability and a pre-established loss function; determining whether the initial model is trained based on the comparison of the loss value and the target value; in response to determining that the initial model training is complete, determining the trained initial model as a low-quality video detection model.

In some embodiments, determining the loss value of the sample based on the annotation information in the extracted sample, the obtained probability, and a pre-established loss function comprises: inputting the first marking information in the extracted sample and the probability that the sample video belongs to the low-quality video into a first loss function which is established in advance to obtain a first loss value; in response to determining that the second annotation information is not included in the extracted sample, determining the first loss value as a loss value of the extracted sample.

In some embodiments, determining the loss value of the sample based on the annotation information in the extracted sample, the obtained probability, and a pre-established loss function further comprises: in response to the fact that the extracted sample contains second labeling information, taking a low-quality category indicated by the second labeling information in the extracted sample as a target category, and inputting the second labeling information contained in the extracted sample and the probability of the sample video belonging to the target category into a pre-established second loss function to obtain a second loss value; the sum of the first loss value and the second loss value is determined as the loss value of the extracted sample.

In some embodiments, the method further comprises: in response to determining that the initial model is not trained, updating parameters in the initial model based on the determined loss values, re-extracting samples from the set of samples, and continuing the training step using the initial model after updating the parameters as the initial model.

In a second aspect, an embodiment of the present application provides an apparatus for generating a model, where the apparatus includes: the acquisition unit is configured to acquire a sample set, wherein samples in the sample set comprise sample videos and first annotation information used for indicating whether the sample videos belong to low-quality videos, and when the sample videos belong to the low-quality videos, the samples further comprise second annotation information used for indicating low-quality categories of the sample videos; a training unit configured to extract samples from a set of samples, performing the following training steps: inputting frames of sample videos in the extracted samples into an initial model, and respectively obtaining the probability that the sample videos belong to low-quality videos and low-quality categories; determining a loss value of the sample based on the extracted labeling information in the sample, the obtained probability and a pre-established loss function; determining whether the initial model is trained based on the comparison of the loss value and the target value; in response to determining that the initial model training is complete, determining the trained initial model as a low-quality video detection model.

In some embodiments, the training unit is further configured to: inputting the first marking information in the extracted sample and the probability that the sample video belongs to the low-quality video into a first loss function which is established in advance to obtain a first loss value; in response to determining that the second annotation information is not included in the extracted sample, determining the first loss value as a loss value of the extracted sample.

In some embodiments, the training unit is further configured to: in response to the fact that the extracted sample contains second labeling information, taking a low-quality category indicated by the second labeling information in the extracted sample as a target category, and inputting the second labeling information contained in the extracted sample and the probability of the sample video belonging to the target category into a pre-established second loss function to obtain a second loss value; the sum of the first loss value and the second loss value is determined as the loss value of the extracted sample.

In some embodiments, the apparatus further comprises: and an updating unit configured to update parameters in the initial model based on the determined loss value in response to determining that the initial model is not trained, re-extract samples from the sample set, and continue to perform the training step using the initial model after updating the parameters as the initial model.

In a third aspect, an embodiment of the present application provides a method for detecting low-quality video, including: receiving a low-quality video detection request containing a target video; inputting a frame in a target video into a low-quality video detection model generated by adopting the method described in any embodiment of the first aspect to obtain a detection result, wherein the detection result comprises a probability that the target video belongs to the low-quality video; in response to determining that the probability is greater than a first preset threshold, determining that the target video is a low-quality video.

In some embodiments, the detection result further includes a probability that the target video belongs to each low-quality category; and after the determining that the target video is a low-quality video, the method further comprises: in response to receiving a low-quality category detection request for the target video, taking a probability that the target video belongs to a low-quality video as a first probability, for each low-quality category, determining a product of the probability that the target video belongs to the low-quality category and the first probability, and determining the product as the probability that the target video belongs to the low-quality category; and determining the low-quality category with the probability larger than a second preset value as the low-quality category of the target video.

In a fourth aspect, an embodiment of the present application provides an apparatus for detecting low-quality video, including: a first receiving unit configured to receive a low-quality video detection request containing a target video; an input unit configured to input frames in a target video into a low-quality video detection model generated by the method described in any one of the embodiments of the first aspect, and obtain a detection result, where the detection result includes a probability that the target video belongs to the low-quality video; a first determination unit configured to determine that the target video is a low-quality video in response to determining that the probability is greater than a first preset threshold.

In some embodiments, the detection result further includes a probability that the target video belongs to each low-quality category; and the apparatus further comprises: a second receiving unit configured to, in response to receiving a low-quality category detection request for the target video, take a probability that the target video belongs to a low-quality video as a first probability, determine, for each low-quality category, a product of the probability that the target video belongs to the low-quality category and the first probability, determine the product as a probability that the target video belongs to the low-quality category; a second determining unit configured to determine a low-quality category having a probability greater than a second preset value as a low-quality category of the target video.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the embodiments of the first and third aspects above.

In a sixth aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method according to any one of the first and third aspects.

According to the method and the device for generating the model, provided by the embodiment of the application, by obtaining the sample set, samples can be extracted from the sample set so as to train the initial model. Wherein the samples in the sample set may include the sample video, the first annotation information indicating whether the sample video belongs to the low-quality video, and the second annotation information indicating the low-quality category of the sample video belonging to the low-quality video. In this way, the frame of the sample video in the extracted samples is input to the initial model, so that the probability that the sample video output by the initial model belongs to the low-quality video and each low-quality category can be obtained. Then, based on the labeling information in the extracted sample, the obtained probability and a pre-established loss function, the loss value of the sample can be determined. Thereafter, based on the comparison of the loss value to the target value, it may be determined whether the initial model is trained. If the initial model training is completed, the trained initial model can be determined to be a low-quality video detection model. Thus, a model can be obtained that can be used for low-quality video detection, which helps to improve the efficiency of low-quality video detection.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating a model according to the present application;

FIG. 3 is a schematic illustration of an application scenario of a method for generating a model according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating a model according to the present application;

FIG. 5 is a schematic diagram of an embodiment of an apparatus for generating a model according to the present application;

FIG. 6 is a flow diagram for one embodiment of a method for detecting low quality video according to the present application;

fig. 7 is a schematic block diagram of an embodiment of an apparatus for detecting low-quality video according to the present application;

FIG. 8 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which the method for generating a model or the apparatus for generating a model of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a video recording application, a video playing application, a voice interaction application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

When the

terminal devices

101, 102, 103 are hardware, an image capturing device may be mounted thereon. The image acquisition device can be various devices capable of realizing the function of acquiring images, such as a camera, a sensor and the like. The user may capture video using an image capture device on the

terminal device

101, 102, 103.

The server 105 may be a server that provides various services, such as a video processing server for storing, managing, or analyzing videos uploaded by the

terminal devices

101, 102, 103. The video processing server may obtain a sample set. The sample set may contain a large number of samples. The samples in the sample set may include sample videos, first annotation information indicating whether the sample videos belong to low-quality videos, and second annotation information indicating low-quality categories of the sample videos belonging to the low-quality videos. In addition, the video processing server may train the initial model using the samples in the sample set, and may store the training results (e.g., the generated low quality video detection model). In this way, after the user uploads the video using the

terminal apparatuses

101, 102, and 103, the server 105 can detect whether the video uploaded by the user is a low-quality video, and can perform operations such as push of the prompt information.

The server 105 may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for generating the model provided in the embodiment of the present application is generally performed by the server 105, and accordingly, the apparatus for generating the model is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating a model according to the present application is shown. The method for generating the model comprises the following steps:

step 201, a sample set is obtained.

In the present embodiment, the execution subject of the method for generating a model (e.g., server 105 shown in fig. 1) may obtain a sample set in a variety of ways. For example, the executing entity may obtain the existing sample set stored therein from another server (e.g., a database server) for storing the samples through a wired connection or a wireless connection. As another example, a user may collect a sample via a terminal device (e.g.,

terminal devices

101, 102, 103 shown in FIG. 1). In this way, the execution entity may receive samples collected by the terminal and store the samples locally, thereby generating a sample set. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

Here, the sample set may include a large number of samples. Wherein the sample may include a sample video and first annotation information indicating whether the sample video belongs to a low-quality video. For example, belonging to low-quality video, the first annotation information may be "1"; when not belonging to low quality video, the first annotation information may be "0". When a sample video in the sample belongs to a low-quality video, the sample further includes second annotation information indicating a low-quality category of the sample video.

Note that the low-quality video is generally a lower-quality video. For example, the low quality video may include, but is not limited to, a blurred video, a black screen video, a recorded screen video, and the like. Accordingly, the low quality category may include, but is not limited to, a blurred video category, a black screen video category, a recorded screen video category, and the like.

At step 202, samples are taken from a sample set.

In this embodiment, the executing subject may extract samples from the sample set acquired in step 201, and perform the training steps of step 203 to step 206. The manner of extracting the sample is not limited in this application. For example, a sample may be randomly extracted, or samples to be currently extracted may be extracted from a sample set in a specified order.

Step 203, inputting the frames in the sample video in the extracted sample into the initial model, and respectively obtaining the probability that the sample video belongs to the low-quality video and each low-quality category.

In this embodiment, the execution subject may input the frames in the sample video in the sample extracted in step 202 to the initial model, and the initial model may output the probability that the sample video belongs to the low-quality video by performing feature extraction, analysis, and the like on the frames in the video, and may output the probability that the sample video belongs to each low-quality category. The probability that a sample video belongs to each low-quality category may be understood as a conditional probability that the sample video belongs to each low-quality category when the sample video belongs to a low-quality video.

In the present embodiment, the initial model may be various models having an image feature extraction function and a classification function created based on a machine learning technique. The initial model can extract the features of the frames in the video, then perform the processing of fusion, analysis and the like on the extracted features, and finally output the probability that the sample video belongs to the low-quality video and each low-quality category. In practice, during the training process of the initial model, the probabilities output by the initial model are often inaccurate. The purpose of training the initial model is to make the probabilities output by the initial model after trial training more accurate.

By way of example, the initial model may be a convolutional neural network using various existing structures (e.g., DenseBox, VGGNet, ResNet, SegNet, etc.). In practice, a Convolutional Neural Network (CNN) is a feed-forward Neural Network, and its artificial neurons can respond to a part of surrounding units in the coverage range, and have excellent performance on image processing, so that the Convolutional Neural Network can be used for extracting frame features in sample video. In this example, the built product neural network may contain convolutional layers, pooling layers, feature fusion layers, fully-connected layers, and the like. Among other things, convolutional layers may be used to extract image features. The pooling layer may be used to down-sample (down sample) the incoming information. The feature fusion layer may be configured to fuse the obtained image features (for example, in the form of a feature matrix or a feature vector) corresponding to each frame. For example, feature values at the same position in feature matrices corresponding to different frames may be averaged to perform feature fusion, so as to generate a fused feature matrix. The fully connected layer may be used to classify the resulting features.

It will be appreciated that the initial model may output the probability that the sample video belongs to a low quality video, as well as the probability that the sample video belongs to each low quality category. Thus, the fully connected layer may be made up of two parts. Wherein a portion may output a probability that the sample video belongs to a low quality video. Another portion may output the probability that the sample video belongs to each low-quality category. In practice, each part can use independent softmax function to perform probability calculation.

The initial model may be another model having an image feature extraction function and a classification function, and is not limited to the above example, and the specific model structure is not limited herein.

And step 204, determining a loss value of the sample based on the extracted labeling information in the sample, the obtained probability and a pre-established loss function.

In this embodiment, the execution subject may determine the loss value of the sample based on the label information (including the first label information and the second label information) in the extracted sample, the obtained probability, and a pre-established loss function. In practice, the loss function (loss function) can be used to measure the degree of inconsistency between the information (e.g. probability) output by the initial model and the true value (e.g. annotation information). In general, the smaller the value of the loss function (loss value), the better the robustness of the model. The loss function may be set according to actual requirements.

In this embodiment, the loss function may be set to take into account the two-part loss (e.g., may be set to be the sum of the two-part loss, or a weighted result of the two-part loss). A part of the loss may be used to characterize a difference between a probability that the sample video output by the initial model belongs to a low-quality video and a true value (e.g., the first label information indicates that the sample video is a low-quality video, the true value is 1, otherwise, the true value is 0). Another portion of the loss may be used to characterize the degree to which the probability that the sample video of the initial model output belongs to the low quality category indicated by the second annotation information differs from the true value (e.g., 1). It should be noted that, when the extracted sample does not contain the second labeling information, the partial loss may be set to a preset value (e.g., 0). In practice, the two part losses can be calculated separately using the cross entropy losses.

In some optional implementations of this embodiment, the executing entity may determine the loss value of the sample according to the following steps:

the method comprises the following steps of firstly, inputting first marking information in an extracted sample and the probability that a sample video belongs to a low-quality video into a first loss function which is established in advance, and obtaining a first loss value. Here, the first loss function may be used to characterize a degree of difference between a probability that the sample video output by the initial model belongs to the low-quality video and the first annotation information. In practice, the first penalty function may use cross-entropy penalty.

In the second step, in response to determining that the second label information is not included in the extracted sample, the first loss value may be determined as a loss value of the extracted sample.

Optionally, in the foregoing implementation manner, in response to determining that the extracted sample includes the second annotation information, the executing entity may perform the following steps to determine a loss value of the sample: first, a low-quality category indicated by the second label information in the extracted sample may be taken as the target category. Then, the second annotation information included in the extracted sample and the probability that the sample video output by the initial model belongs to the target category may be input to a second loss function established in advance to obtain a second loss value. Here, the second loss function may be used to characterize a degree of difference between a probability that the sample video output by the initial model belongs to the target class (i.e., the low-quality class indicated by the second annotation information) and a true value (e.g., 1). In practice, the second loss function may also use cross-entropy loss. Then, the sum of the first loss value and the second loss value may be determined as the loss value of the extracted sample. Here, the loss value of the sample may also be obtained in other manners. For example, a result of weighting the first loss value and the above-described second loss value is determined as the loss value of the extracted sample. Wherein the weights may be preset by the technician as needed.

Step 205, based on the comparison of the loss value and the target value, it is determined whether the initial model is trained completely.

In this embodiment, the execution subject may determine whether the initial model is trained based on a comparison of the determined loss value and the target value. As an example, the execution body described above may determine whether the loss value has converged. When it is determined that the loss value converges, it may be determined that the initial model at this time is trained. As yet another example, the execution body may first compare the loss value with a target value. In response to determining that the loss value is less than or equal to the target value, the ratio of the number of loss values less than or equal to the target value to the preset number of loss values determined in the last preset number of training steps (e.g., the last 100 training steps) may be counted. When the ratio is greater than a preset ratio (e.g., 95%), it may be determined that the initial model training is completed. It should be noted that there may be a plurality (at least two) of samples extracted in step 202. At this time, for each sample, the loss value of the sample may be calculated by the operations described in step 202 to step 204. At this time, the execution body may compare the loss value of each sample with the target value, respectively. It can thus be determined whether the loss value for each sample is less than or equal to the target value. It is noted that the target value may generally be used to represent an ideal case of the degree of inconsistency between the predicted value and the true value. That is, when the loss value is less than or equal to the target value, the predicted value may be considered to be close to or approximate the true value. The preset value can be set according to actual requirements.

It is noted that in response to determining that the initial model has been trained, execution may continue to step 206. In response to determining that the initial model is not trained, parameters in the initial model may be updated based on the determined loss values, samples are re-extracted from the set of samples, and the training step is continued using the initial model with the updated parameters as the initial model. Here, the gradient of the loss value with respect to the model parameters may be found using a back propagation algorithm, and then the model parameters may be updated based on the gradient using a gradient descent algorithm. It should be noted that the back propagation algorithm, the gradient descent algorithm, and the machine learning method are well-known technologies that are currently widely researched and applied, and are not described herein again. It should be noted that the sample extraction method is not limited in this application. For example, in the case where there are a large number of samples in the sample set, the execution subject may extract a sample from which it has not been extracted.

And step 206, in response to determining that the training of the initial model is completed, determining the trained initial model as a low-quality video detection model.

In this embodiment, in response to determining that the initial model training is completed, the executing entity may determine the trained initial model as a low-quality video detection model. The low-quality video detection model can detect whether the video is a low-quality video or not, and can detect a low-quality category of the low-quality video.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating a model according to the present embodiment. In the application scenario of fig. 3, a terminal device 301 used by a user may have a model training application installed thereon. After a user opens the application and uploads a sample set or a storage path of the sample set, a server 302 providing background support for the application may run a method for generating a low quality video detection model, comprising:

first, a sample set may be obtained. Wherein the samples in the sample set may include a sample video 303, first annotation information 304 for indicating whether the sample video belongs to a low-quality video, and second annotation information 305 for indicating a low-quality category of the sample video belonging to the low-quality video. Thereafter, samples may be extracted from the set of samples, performing the following training steps: inputting frames in the sample video in the extracted samples into the initial model 306, and respectively obtaining the probability that the sample video belongs to the low-quality video and each low-quality category; determining a loss value 307 of the sample based on the labeling information in the extracted sample, the obtained probability and a pre-established loss function; and determining whether the initial model is trained completely based on the comparison between the loss value and the target value. If the initial model training is complete, the trained initial model is determined to be a low-quality video detection model 308.

The method provided by the above embodiment of the present application may extract a sample from a sample set by obtaining the sample set to perform training of an initial model. The samples in the sample set may include sample videos, first annotation information indicating whether the sample videos belong to low-quality videos, and second annotation information indicating low-quality categories of the sample videos belonging to the low-quality videos. In this way, the frame of the sample video in the extracted samples is input to the initial model, so that the probability that the sample video output by the initial model belongs to the low-quality video and each low-quality category can be obtained. Then, based on the labeling information in the extracted sample, the obtained probability and a pre-established loss function, the loss value of the sample can be determined. Thereafter, based on the comparison of the loss value with the target value, it may be determined whether the initial model is trained. If the initial model training is completed, the trained initial model can be determined to be a low-quality video detection model. Thus, a model can be obtained that can be used for low-quality video detection, which helps to improve the efficiency of low-quality video detection.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for generating a model is shown. The process 400 of the method for generating a model includes the steps of:

step 401, a sample set is obtained.

In this embodiment, an executing agent of the method for generating a model (e.g., server 105 shown in FIG. 1) may obtain a sample set. Wherein the sample may include a sample video and first annotation information indicating whether the sample video belongs to a low-quality video. When a sample video in the sample belongs to a low-quality video, the sample further includes second annotation information indicating a low-quality category of the sample video.

At step 402, samples are taken from a sample set.

In the present embodiment, the executing subject may extract samples from the sample set acquired in step 401, and perform the training steps of step 403 to step 410. The manner of extracting the sample is not limited in this application. For example, a sample may be randomly extracted, or samples to be currently extracted may be extracted from a sample set in a specified order.

Step 403, inputting the frames in the sample video in the extracted sample into the initial model, and obtaining the probability that the sample video belongs to the low-quality video and each low-quality category respectively.

In this embodiment, the execution subject may input the frames in the sample video in the sample extracted in step 402 to the initial model, and the initial model may output the probability that the sample video belongs to the low-quality video by performing feature extraction, analysis, and the like on the frames in the video, and may output the probability that the sample video belongs to each low-quality category.

In this embodiment, the initial model may use a convolutional neural network created based on machine learning techniques. The built product neural network may include convolutional layers, pooling layers, feature fusion layers, fully-connected layers, and the like. The fully-connected layer may be made up of two parts. Wherein a portion may output a probability that the sample video belongs to a low quality video. Another portion may output the probability that the sample video belongs to each low-quality category. In practice, each part can use independent softmax function to perform probability calculation.

Step 404, inputting the first annotation information in the extracted sample and the probability that the sample video belongs to the low-quality video into a first loss function established in advance to obtain a first loss value.

In this embodiment, the executing entity may input the first annotation information in the extracted sample and the probability that the sample video output in step 403 belongs to the low-quality video to a first loss function established in advance, so as to obtain a first loss value. Here, the first loss function may be used to characterize a degree of difference between a probability that the sample video output by the initial model belongs to the low-quality video and the first annotation information. In practice, the first penalty function may use cross-entropy penalty.

Step 405, determining whether the extracted sample contains the second labeling information.

In this embodiment, the executing entity may determine whether the extracted sample includes the second label information. If not, step 406 may be performed to determine a loss value for the sample. If so, step 407 and 408 may be performed to determine the loss value of the sample.

Step 406, in response to determining that the second annotation information is not included in the extracted sample, determining the first loss value as the loss value of the extracted sample.

In this embodiment, in response to determining that the second label information is not included in the extracted sample, the execution body may determine the first loss value as the loss value of the extracted sample.

Step 407, in response to determining that the extracted sample includes the second annotation information, taking the low-quality category indicated by the second annotation information in the extracted sample as the target category, and inputting the second annotation information included in the extracted sample and the probability that the sample video belongs to the target category to a second loss function established in advance to obtain a second loss value.

In this embodiment, in response to determining that the extracted sample includes the second annotation information, the execution subject may take the low-quality category indicated by the second annotation information in the extracted sample as the target category, and input the second annotation information included in the extracted sample and the probability that the sample video belongs to the target category to a second loss function established in advance, so as to obtain a second loss value. Here, the second loss function may be used to characterize the degree of difference between the probability that the sample video output by the initial model belongs to the target class and the true value (e.g., 1). In practice, the second loss function may also use cross-entropy loss.

Step 408, determine a sum of the first loss value and the second loss value as a loss value of the extracted sample.

In this embodiment, the execution body may determine a sum of the first loss value and the second loss value as a loss value of the extracted sample.

Step 409, determining whether the initial model is trained based on the comparison of the loss value and the target value.

In this embodiment, the execution subject may determine whether the initial model is trained based on a comparison of the determined loss value and the target value. As an example, the execution body described above may determine whether the loss value has converged. When it is determined that the loss value converges, it may be determined that the initial model at this time is trained. As an example, the execution body described above may first compare the loss value with a target value. In response to determining that the loss value is less than or equal to the target value, the ratio of the number of loss values less than or equal to the target value to the preset number of loss values determined in the last preset number of training steps (e.g., the last 100 training steps) may be counted. When the ratio is greater than a preset ratio (e.g., 95%), it may be determined that the initial model training is completed. It should be noted that the target value may be generally used as an ideal case of representing the degree of inconsistency between the predicted value and the true value. That is, when the loss value is less than or equal to the target value, the predicted value may be considered to be close to or approximate the true value. The preset value can be set according to actual requirements.

It is noted that in response to determining that the initial model has been trained, execution may continue to step 410. In response to determining that the initial model is not trained, parameters in the initial model may be updated based on the determined loss values, samples are re-extracted from the set of samples, and the training step is continued using the initial model with the updated parameters as the initial model. Here, the gradient of the loss value with respect to the model parameters may be found using a back propagation algorithm, and then the model parameters may be updated based on the gradient using a gradient descent algorithm. It should be noted that the back propagation algorithm, the gradient descent algorithm, and the machine learning method are well-known technologies that are currently widely researched and applied, and are not described herein again. It should be noted that the sample extraction method is not limited in this application. For example, in the case where there are a large number of samples in the sample set, the execution subject may extract a sample from which it has not been extracted.

Step 410, in response to determining that the initial model training is complete, determining the trained initial model as a low-quality video detection model.

As can be seen from fig. 4, compared to the embodiment corresponding to fig. 2, the flow 400 of the method for generating a model in the present embodiment relates to a calculation manner of the loss value. The initial model is trained based on the loss value calculated in the mode, so that the trained model can realize the detection function of the low-quality video and the detection function of the low-quality category of the low-quality video. Meanwhile, the trained low-quality video detection model is used for video detection, so that the detection speed of the low-quality video is improved, and the detection effect of the low-quality category is improved.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating a model, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 5, the apparatus 500 for generating a model according to the present embodiment includes: an obtaining unit 501 configured to obtain a sample set, where a sample may include a sample video and first annotation information indicating whether the sample video belongs to a low-quality video. When a sample video in the sample belongs to a low-quality video, the sample further comprises second annotation information for indicating a low-quality category of the sample video; a training unit 502 configured to extract samples from the set of samples, performing the following training steps: inputting frames of sample videos in the extracted samples into an initial model, and respectively obtaining the probability that the sample videos belong to low-quality videos and low-quality categories; determining a loss value of the sample based on the extracted labeling information in the sample, the obtained probability and a pre-established loss function; determining whether the initial model is trained or not based on the comparison between the loss value and the target value; in response to determining that the initial model training is complete, determining the trained initial model as a low-quality video detection model.

In some optional implementations of this embodiment, the training unit 502 may be further configured to: inputting the first marking information in the extracted sample and the probability that the sample video belongs to the low-quality video into a first loss function which is established in advance to obtain a first loss value; in response to determining that the second label information is not included in the extracted sample, determining the first loss value as a loss value of the extracted sample.

In some optional implementations of this embodiment, the training unit 502 may be further configured to: in response to the fact that the extracted sample contains second labeling information, taking a low-quality category indicated by the second labeling information in the extracted sample as a target category, and inputting the second labeling information contained in the extracted sample and the probability of the sample video belonging to the target category into a pre-established second loss function to obtain a second loss value; and determining the sum of the first loss value and the second loss value as the loss value of the extracted sample.

In some optional implementations of this embodiment, the apparatus may further include an updating unit (not shown in the figure). Wherein the updating unit may be configured to, in response to determining that the initial model is not trained completely, update parameters in the initial model based on the determined loss value, re-extract samples from the set of samples, and continue to perform the training step using the initial model after updating the parameters as the initial model.

The apparatus provided by the above embodiment of the present application obtains the sample set through the obtaining unit 501, and the training unit 502 may extract the sample from the sample set to perform training of the initial model. The samples in the sample set may include sample videos, first annotation information indicating whether the sample videos belong to low-quality videos, and second annotation information indicating low-quality categories of the sample videos belonging to the low-quality videos. In this way, the training unit 502 inputs the frames of the sample videos in the extracted samples into the initial model, so as to obtain the probabilities that the sample videos output by the initial model belong to the low-quality videos and the low-quality categories. Then, based on the labeling information in the extracted sample, the obtained probability and a pre-established loss function, the loss value of the sample can be determined. Thereafter, based on the comparison of the loss value with the target value, it may be determined whether the initial model is trained. If the initial model training is completed, the trained initial model can be determined to be a low-quality video detection model. Thus, a model can be obtained that can be used for low-quality video detection, which helps to improve the efficiency of low-quality video detection.

Referring to fig. 6, a flow diagram 600 of one embodiment of a method for detecting low quality video is provided. The method for detecting low-quality video may comprise the steps of:

step 601, receiving a low-quality video detection request containing a target video.

In this embodiment, an executing entity (e.g., server 105 shown in fig. 1, or other server storing a low-quality video detection model) for detecting low-quality video may receive a low-quality video detection request containing a target video. Here, the target video may be a video to be subjected to low-quality video detection. The target video may be stored in advance in the execution body described above. Or may be transmitted by other electronic devices (e.g.,

terminal devices

101, 102, 103 shown in fig. 1).

Step 602, inputting the frames in the target video into the low-quality video detection model to obtain the detection result.

In this embodiment, the execution subject may input a frame in the target video into the low-quality video detection model to obtain a detection result. The detection result may include a probability that the target video belongs to a low-quality video. The low-quality video detection model may be generated using the method of generating a low-quality video detection model as described in the embodiment of fig. 2 above. For a specific generation process, reference may be made to the related description of the embodiment in fig. 2, which is not described herein again.

Step 603, in response to determining that the probability that the target video belongs to the low-quality video is greater than a first preset threshold, determining that the target video is the low-quality video.

In this embodiment, in response to determining that the probability that the target video belongs to the low-quality video is greater than the first preset threshold, the execution subject may determine that the target video is the low-quality video.

In some optional implementations of this embodiment, the detection result may further include a probability that the target video belongs to each low-quality category. After determining that the target video is a low-quality video, the executing body may further perform the following operations:

first, in response to receiving a low-quality category detection request, the probability that the target video belongs to a low-quality video may be taken as a first probability, and for each low-quality category, a product of the probability that the target video belongs to the low-quality category and the first probability is determined, and the product is determined as the probability that the target video belongs to the low-quality category.

Then, the low-quality category with the probability greater than the second preset value may be determined as the low-quality category of the target video. Thus, a low quality category of the target video may be determined.

It should be noted that the method for detecting low-quality video according to the present embodiment may be used to test the low-quality video detection model generated by the foregoing embodiments. And then the low-quality video detection model can be continuously optimized according to the test result. The method may also be a practical application method of the low-quality video detection model generated by the above embodiments. The low-quality video detection is performed by adopting the low-quality video detection model generated by the embodiments, which is beneficial to improving the performance of the low-quality video detection model. Meanwhile, the low-quality video detection is performed by using the low-quality video detection model, so that the detection speed of the low-quality video is increased, and the detection effect of the low-quality category is improved.

With continuing reference to fig. 7, the present application provides one embodiment of an apparatus for detecting low quality video as an implementation of the method illustrated in fig. 6 and described above. The embodiment of the device corresponds to the embodiment of the method shown in fig. 6, and the device can be applied to various electronic devices.

As shown in fig. 7, the apparatus 700 for detecting low-quality video according to this embodiment includes: a first receiving unit 701 configured to receive a low-quality video detection request containing a target video; an input unit 702 is configured to input the frames in the target video into a low-quality video detection model, so as to obtain a detection result. Wherein the detection result comprises the probability that the target video belongs to the low-quality video; a first determining unit 703 configured to determine the target video to be a low-quality video in response to determining that the probability is greater than a first preset threshold.

In some optional implementations of this embodiment, the detection result may further include a probability that the target video belongs to each low-quality category. The apparatus may further include a second receiving unit and a second determining unit (not shown in the figure). Wherein the second receiving unit may be configured to, in response to receiving the low quality category detection request, take the probability that the target video belongs to the low quality video as a first probability, determine, for each low quality category, a product of the probability that the target video belongs to the low quality category and the first probability, and determine the product as the probability that the target video belongs to the low quality category. The second determining unit may be configured to determine a low quality category having a probability greater than a second preset value as the low quality category of the target video.

It will be understood that the elements described in the apparatus 700 correspond to various steps in the method described with reference to fig. 6. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 700 and the units included therein, and will not be described herein again.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 801. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit and a training unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the acquisition unit may also be described as a "unit acquiring a sample set".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: obtaining a sample set; extracting samples from the sample set, and executing the following training steps: inputting frames in a sample video in the extracted sample into an initial model, and respectively obtaining the probability that the sample video belongs to a low-quality video and each low-quality category; determining a loss value of the sample based on the extracted labeling information in the sample, the obtained probability and a pre-established loss function; determining whether the initial model is trained based on the comparison of the loss value and the target value; in response to determining that the initial model training is complete, determining the trained initial model as a low-quality video detection model.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for generating a model, comprising:

obtaining a sample set, wherein samples in the sample set comprise sample videos and first annotation information used for indicating whether the sample videos belong to low-quality videos, and when the sample videos belong to the low-quality videos, the samples further comprise second annotation information used for indicating low-quality categories of the sample videos;

extracting samples from the sample set, performing the following training steps: inputting frames in a sample video in the extracted sample into an initial model, and respectively obtaining the probability that the sample video belongs to a low-quality video and each low-quality category; determining a loss value of the sample based on the extracted labeling information in the sample, the obtained probability and a pre-established loss function; determining whether the initial model is trained based on the comparison of the loss value and a target value; in response to determining that the initial model training is complete, determining the trained initial model as a low-quality video detection model;

determining a loss value of the sample based on the extracted labeling information in the sample, the obtained probability and a pre-established loss function, comprising: inputting the first marking information in the extracted sample and the probability that the sample video belongs to the low-quality video into a first loss function which is established in advance to obtain a first loss value; in response to the fact that the extracted sample contains second labeling information, taking a low-quality category indicated by the second labeling information in the extracted sample as a target category, and inputting the second labeling information contained in the extracted sample and the probability of the sample video belonging to the target category into a pre-established second loss function to obtain a second loss value, wherein the second loss function can be used for representing the difference degree between the probability of the sample video belonging to the target category and a true value; determining a sum of the first loss value and the second loss value as a loss value of the extracted sample.

2. The method for generating a model according to claim 1, wherein determining the loss value of the sample based on the annotation information in the extracted sample, the obtained probability and a pre-established loss function comprises:

in response to determining that second annotation information is not included in the extracted sample, determining the first loss value as a loss value for the extracted sample.

3. The method for generating a model of claim 1, wherein the method further comprises:

in response to determining that the initial model is not trained completely, updating parameters in the initial model based on the determined loss values, re-extracting samples from the set of samples, and continuing the training step using the initial model after updating the parameters as the initial model.

4. An apparatus for generating a model, comprising:

the obtaining unit is configured to obtain a sample set, wherein samples in the sample set comprise sample videos and first annotation information used for indicating whether the sample videos belong to low-quality videos, and when the sample videos belong to the low-quality videos, the samples further comprise second annotation information used for indicating low-quality categories of the sample videos;

a training unit configured to extract samples from the set of samples, performing the following training steps: inputting frames of sample videos in the extracted samples into an initial model, and respectively obtaining the probability that the sample videos belong to low-quality videos and low-quality categories; determining a loss value of the sample based on the extracted labeling information in the sample, the obtained probability and a pre-established loss function; determining whether the initial model is trained based on the comparison of the loss value and a target value; in response to determining that the initial model training is complete, determining the trained initial model as a low-quality video detection model;

the training unit, further configured to: inputting the first marking information in the extracted sample and the probability that the sample video belongs to the low-quality video into a first loss function which is established in advance to obtain a first loss value; in response to the fact that the extracted sample contains second labeling information, taking a low-quality category indicated by the second labeling information in the extracted sample as a target category, and inputting the second labeling information contained in the extracted sample and the probability of the sample video belonging to the target category into a pre-established second loss function to obtain a second loss value, wherein the second loss function can be used for representing the difference degree between the probability of the sample video belonging to the target category and a true value; determining a sum of the first loss value and the second loss value as a loss value of the extracted sample.

5. The apparatus for generating a model of claim 4, wherein the training unit is further configured to:

6. The apparatus for generating a model of claim 4, wherein the apparatus further comprises:

an updating unit configured to update parameters in the initial model based on the determined loss value in response to determining that the initial model is not trained, re-extract samples from the set of samples, and continue the training step using the initial model after updating the parameters as the initial model.

7. A method for detecting low quality video, comprising:

receiving a low-quality video detection request containing a target video;

inputting frames in the target video into a low-quality video detection model generated by the method according to any one of claims 1 to 3, and obtaining a detection result, wherein the detection result comprises a probability that the target video belongs to the low-quality video;

in response to determining that the probability is greater than a first preset threshold, determining that the target video is a low-quality video;

the detection result also comprises the probability that the target video belongs to each low-quality category;

the method further comprises the following steps: in response to receiving a low-quality category detection request for the target video, taking a probability that the target video belongs to a low-quality video as a first probability, determining, for each low-quality category, a product of the probability that the target video belongs to the low-quality category and the first probability, determining the product as the probability that the target video belongs to the low-quality category; and determining the low-quality category with the probability larger than a second preset value as the low-quality category of the target video.

8. An apparatus for detecting low-quality video, comprising:

a first receiving unit configured to receive a low-quality video detection request containing a target video;

an input unit configured to input frames in the target video into a low-quality video detection model generated by the method according to any one of claims 1 to 3, and obtain a detection result, wherein the detection result includes a probability that the target video belongs to the low-quality video;

a first determining unit configured to determine that the target video is a low-quality video in response to determining that the probability is greater than a first preset threshold;

the device further comprises: a second receiving unit configured to, in response to receiving a low-quality category detection request for the target video, take a probability that the target video belongs to a low-quality video as a first probability, determine, for each low-quality category, a product of the probability that the target video belongs to the low-quality category and the first probability, determine the product as a probability that the target video belongs to the low-quality category; a second determining unit configured to determine a low-quality category having a probability greater than a second preset value as a low-quality category of the target video.

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-3, 7.

10. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1-3, 7.