WO2022171011A1 - 视频审核模型训练方法、视频审核方法及相关装置 - Google Patents

视频审核模型训练方法、视频审核方法及相关装置 Download PDF

Info

Publication number
WO2022171011A1
WO2022171011A1 PCT/CN2022/074703 CN2022074703W WO2022171011A1 WO 2022171011 A1 WO2022171011 A1 WO 2022171011A1 CN 2022074703 W CN2022074703 W CN 2022074703W WO 2022171011 A1 WO2022171011 A1 WO 2022171011A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
sample image
level sub
image
video
Prior art date
Application number
PCT/CN2022/074703
Other languages
English (en)
French (fr)
Inventor
丘林
眭哲豪
Original Assignee
百果园技术(新加坡)有限公司
丘林
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 丘林 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2022171011A1 publication Critical patent/WO2022171011A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the embodiments of the present application relate to the technical field of video review, for example, to a video review model training method, a video review method, and a related device.
  • Video content auditing can help enterprises. Screen the illegal images, videos, text and other content in the platform. Through the video content review, the illegal content can be filtered and deleted, so as to build a green and safe network environment for users.
  • the live broadcast scene has complexity and particularity.
  • the live broadcast scene in the live broadcast is complex and changeable, and there are multiple objects;
  • live screenshots are affected by light, camera equipment, etc., resulting in poor image quality and blurring.
  • visual features such as mobile phones, walkie-talkies, microphones and other visual features in the live broadcast scene are similar to the viewing angle features of the offending objects, resulting in sending manual
  • the accuracy of the audited videos is not high; finally, in the online real data scenario, the ratio of positive samples and negative samples is too large.
  • the above aspects eventually lead to the FP (false positive, false positive) problem when the video audit model is used to audit videos.
  • the video review model cannot accurately distinguish negative samples from positive samples, and the accuracy of video review is low.
  • Embodiments of the present application provide a video review model training method, a video review method, an apparatus, an electronic device, and a storage medium, so as to avoid a situation where the video review model in the related art is difficult to distinguish between positive samples and negative samples, resulting in low review accuracy .
  • an embodiment of the present application provides a method for training a video review model, including:
  • the video review model includes a first-level sub-model and a second-level sub-model;
  • the second-level sub-model is trained using the first sample image.
  • an embodiment of the present application provides a video review method, including:
  • the video review model includes a first-level sub-model and A second-level sub-model, where the first-level sub-model is set to predict that the video image belongs to the first score of the illegal image, and in response to determining that the first score is less than a preset value, outputting the first score, the The secondary sub-model is configured to predict that the video image belongs to the second score of the violating image in response to determining that the first score is greater than a preset value, and output the second score;
  • the video review model is trained by the video review model training method described in the first aspect.
  • an embodiment of the present application provides a video review model training device, including:
  • a sample acquisition module configured to acquire a first sample image and a classification label of the first sample image
  • a model initialization module configured to initialize a video review model, where the video review model includes a first-level sub-model and a second-level sub-model;
  • a first-level sub-model training module configured to use the first sample image to train the first-level sub-model, and calculate the classification of the first-sample image by the first-level sub-model according to the classification label loss rate;
  • a secondary sub-model training module configured to train the secondary sub-model using the first sample image in response to determining that the classification loss rate is greater than a preset value.
  • an embodiment of the present application provides a video review device, including:
  • a video image acquisition module set to acquire video images from the video to be reviewed
  • a model prediction module configured to input the video image into a pre-trained video review model to obtain a score that the video image belongs to a violation image, wherein the score includes a first score and a second score
  • the video review model Including a first-level sub-model and a second-level sub-model
  • the first-level sub-model is set to predict that the video image belongs to the first score of the illegal image, and in response to determining that the first score is less than a preset value, output the first score a score
  • the secondary sub-model is configured to predict that the video image belongs to a second score of a violation image in response to determining that the first score is greater than the preset value, and output the second score;
  • an auditing module configured to audit the video to be audited in response to determining that the score is greater than a preset threshold
  • the video review model is trained by the video review model training method described in the first aspect.
  • an embodiment of the present application provides an electronic device, the electronic device comprising:
  • processors one or more processors
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the video review model training method described in the first aspect of the present application, and/or the second aspect The video review method described.
  • an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the video review model training method described in the first aspect of the present application, and/or , the video review method described in the second aspect.
  • FIG. 1 is a flow chart of steps of a method for training a video review model provided by an embodiment of the present application
  • 2A is a flowchart of steps of a method for training a video review model provided by another embodiment of the present application
  • 2B is a schematic structural diagram of a video review model according to an embodiment of the present application.
  • 2C is a schematic diagram of Densenet in an embodiment of the present application.
  • 2D is a schematic diagram of a residual module in an embodiment of the present application.
  • 2E is a schematic diagram of a first-level sub-model and a second-level sub-model in an embodiment of the present application
  • 2F is a schematic diagram of an attention mechanism module in an embodiment of the present application.
  • FIG. 3 is a flowchart of steps of a video review method provided by an embodiment of the present application.
  • FIG. 4 is a structural block diagram of a video review model training device provided by an embodiment of the present application.
  • FIG. 5 is a structural block diagram of a video review apparatus provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the embodiment of the present application can be applied to a situation in which a video review model is trained to review videos.
  • the audit model training device is performed.
  • the video audit model training device can be implemented by hardware or software, and is integrated in the electronic equipment provided by the embodiment of the present application.
  • the video audit of the embodiment of the present application The model training method can include the following steps:
  • a sample image may refer to an image used for training a video review model
  • the sample image may include illegal objects, such as images of illegal objects such as guns, knives, violent terrorism, etc.
  • the classification label of the sample image may be is a label expressing whether the sample image is a normal image or a violating image.
  • the classification label can be 0 when the sample image is a normal image, and the classification label can be 1 when the sample image is a violating image.
  • multiple original images may be acquired first, image enhancement processing and normalization processing are performed on each original image to obtain multiple sample images, and classification labels of the sample images are determined based on the labeling operation, exemplarily , you can capture multiple video images from multiple live videos as original images, and then adjust the brightness, contrast, and sharpness of each original image to enhance the original image, and adjust the size of the original image to a uniform size, such as adjusting It is an image with a length and width of 224 pixels. Finally, the pixel values of the image are normalized to obtain a sample image, and the classification label of the sample image is marked based on the manual judgment of whether the sample image contains a violation object. If the sample image contains a violation object , the classification label of the sample image is 1, otherwise the classification label of the sample image is 0.
  • the video review model includes a cascaded first-level sub-model and a second-level sub-model, the first-level sub-model is set to predict that the sample image belongs to the first score of the illegal image, and the second-level sub-model is set to respond to the determination If the first score is greater than the preset value, it is predicted that the sample image belongs to the second score of the violating image.
  • the first-level sub-model and the second-level sub-model may be classification neural networks, and exemplarily, the first-level sub-model and the second-level sub-model may be VGG, residual neural network (Residual Neural Network, ResNet) and dense convolutional network (Dense Convolutional Network, DenseNet) and other classification neural networks.
  • the first-level sub-model and the second-level sub-model can be constructed, and the model parameters of the first-level sub-model and the second-level sub-model can be initialized.
  • the first sample image can be randomly extracted from a plurality of first sample images and input into the first-level sub-model after initialization to obtain the score that the first sample image belongs to the violating image, and based on the score and the first sample image
  • the classification label calculates the classification loss rate of the first-level sub-model for classifying the first sample image.
  • the absolute value of the difference between the score and the classification label can be directly calculated as the classification loss rate.
  • the mean square error of , etc. is used as the classification loss rate, or the classification loss rate may also be calculated in other ways, and the embodiment of the present application does not limit the way of calculating the classification loss rate.
  • the model parameters of the first-level sub-model can be adjusted according to the classification loss rate.
  • the gradient can be calculated according to the classification loss rate.
  • the classification loss rate of the first-level sub-model for classifying the first sample image is greater than the preset value, it means that the first sample image is difficult to distinguish between positive and negative samples.
  • the first sample image can be used to train the second-level sub-model, so that the second-level sub-model can learn the ability to distinguish whether the difficult sample image belongs to a positive sample or a negative sample.
  • the classification loss rate is greater than the preset value.
  • the first sample image is input into the second-level sub-model to obtain the classification loss rate of the second-level sub-model, and the model parameters of the second-level sub-model are adjusted according to the classification loss rate of the second-level sub-model until the preset number of iterations or two is reached. After the classification loss rate of the secondary sub-model is less than the preset threshold, the trained secondary sub-model is obtained.
  • the video review model in the embodiment of the present application includes a first-level sub-model and a second-level sub-model. After the video review model is initialized, the first-level sub-model is trained by using the first sample image, and the first-level sub-model is calculated according to the classification label. The classification loss rate for classifying this image, in response to determining that the classification loss rate is greater than the preset value, the first sample image is used to train the second-level sub-model. The classification loss rate of the first sample image is obtained by the prediction calculation.
  • the difficult sample image can be used to train the secondary sub-model , so that the secondary sub-model learns the ability to distinguish difficult samples, and finally the entire video review model can accurately distinguish positive and negative samples, and can accurately determine the existence of illegal images in the video, improving the accuracy of video submission.
  • FIG. 2A is a flowchart of steps of a video review model training method provided by another embodiment of the present application.
  • the embodiment of the present application is refined on the basis of the foregoing embodiment.
  • the video review model training method can include the following steps:
  • a plurality of video images may be intercepted from a video
  • a plurality of first sample images may be obtained after image enhancement and normalization of the plurality of video images
  • a first sample image may be obtained based on manual annotation.
  • a classification label of a sample image In an example, the classification label is 0 when the first sample image does not include a violating object, and the classification label is 1 when the first sample image includes a violating object.
  • a certain number of images can also be randomly selected from the network image library as sample images, not limited to intercepting video images from videos to obtain sample images, and the embodiment of the present application does not limit the manner of obtaining the first sample image.
  • the video review model in this embodiment of the present application includes cascaded first-level sub-models and second-level sub-models.
  • the first-level sub-model is set to predict that the sample image belongs to the first score of the illegal image
  • the second-level sub-model is set to In response to determining that the first score is greater than the preset value, the sample image is predicted to belong to the second score of the offending image.
  • the model parameters of the first-level sub-model and the second-level sub-model can be initialized.
  • the first-level sub-model can be DenseNet, as shown in Figure 2C, which is a schematic diagram of DenseNet.
  • DenseNet all network layers are connected to each other, that is, each network layer accepts all previous network layers as additional input, so that Each network layer can reuse the output features of all network layers before the network layer to realize feature reuse and improve efficiency.
  • the secondary sub-model can be ResNet. ResNet reduces the difficulty of training deep networks through residual learning. ResNet introduces a residual module on the basis of a fully convolutional network.
  • Figure 2D shows a schematic diagram of the residual module.
  • Each residual module contains two paths, one of which is a direct-connected path of the input features, and the other path performs two or three convolution operations on the input features to obtain the residuals of the input features, and finally the features on the two paths are related to each other.
  • the residual module can reduce the difficulty of training deep networks and make it easier to extract features.
  • those skilled in the art may also set network types of the first-level sub-model and the second-level sub-model according to actual needs, which are not limited in the embodiments of the present application.
  • rough training can be performed on the first-level sub-model and the second-level sub-model first, that is, the first-level sub-model and the second-level sub-model are first trained through a specified number of first sample images. model, and get the rough first-level sub-model and rough second-level sub-model after training for a certain number of times.
  • Figure 2E shows the network structure of the first-level sub-model and the second-level sub-model.
  • the first-level sub-model and the second-level sub-model include five groups of convolutional layers, and a pooling layer is used between each two groups of convolutional layers for spatial analysis. Dimensionality reduction, multiple consecutive 3 ⁇ 3 convolution operations are used in the same group of convolutional layers, and the number of convolution kernels is increased from 64 in the first group of convolutional layers to 512 in the last group of convolutional layers. The same group of convolutional layers The number of convolution kernels in the layers is the same. The last group of convolution layers is followed by two fully connected layers, and the fully connected layer is followed by a classification layer.
  • the first-level sub-model and the second-level sub-model may be a convolutional neural network with an attention mechanism module added, that is, inserted after the partial convolutional layers of the first-level sub-model and the second-level sub-model
  • the attention mechanism module is used to replace the pooling layer.
  • Figure 2E shows a schematic diagram of the attention mechanism module, which includes a channel attention sub-module and a spatial attention sub-module.
  • the first sample image is input into the first-level sub-model, and for the convolutional layer connected to the attention mechanism module, the output features of the convolutional layer are Input the attention mechanism module to get the final output feature of the attention mechanism module to input the next convolutional layer; pass the output feature of the last convolutional layer through the fully connected layer and the classification layer in turn to obtain the first sample image belongs to the violation image.
  • the first score returns to the step of inputting the first sample image into the first-level sub-model until a specified number of first sample images are input into the first-level sub-model, so as to train the first-level sub-model for a certain number of times to obtain a rough first-level sub-model. submodel.
  • the output features of the convolution layer are input into the channel attention sub-module of the attention mechanism module to obtain the channel features, and the channel features and the output features of the convolution layer are multiplied to obtain the intermediate features.
  • input the intermediate feature into the spatial attention sub-module of the attention mechanism module to obtain the spatial feature, and multiply the spatial feature and the intermediate feature to obtain the final output feature of the attention mechanism module for input to the next convolutional layer.
  • the output features of the convolutional layer go through the maximum pooling layer and the average pooling layer in the channel attention sub-module, and then pass through the perceptron to output channel feature 1 and channel feature 2, channel feature 1 and channel feature 1 and After the channel feature 2 is added, the final channel feature of the channel attention sub-module is obtained through the sigmoid activation operation.
  • the channel feature output by the channel attention sub-module is multiplied by the output feature of the convolution layer to obtain the intermediate feature, which is used as the intermediate feature.
  • the input features of the spatial attention sub-module are described by the spatial attention sub-module.
  • the intermediate features undergo convolution operations after the maximum pooling layer and the average pooling layer respectively, and finally obtain the final space of the spatial attention sub-module through the sigmoid activation operation.
  • Features, spatial features and intermediate features are multiplied to obtain the final output features of the entire attention mechanism module.
  • the final output features of the entire attention mechanism module are input into the next convolutional layer, and finally the first output is output in the classification layer of the first-level sub-model.
  • the sample image belongs to the first score of the offending image.
  • S204 Calculate the classification loss rate of the first sample image by using the first score of the first sample image and the classification label.
  • the classification layer of the first-level sub-model outputs the first score that the first sample image belongs to the illegal image.
  • the first score may be a probability value
  • the classification label is used to calculate the classification loss rate of the first-level sub-model for classifying the first sample image.
  • the absolute value of the difference between the predicted value and the classification label can be calculated as the classification loss rate, and the mean square error loss can also be calculated. function to calculate the classification loss rate.
  • the model parameters of the first-level sub-model are adjusted according to the classification loss rate.
  • the first sample image can be used to roughly train the second-level sub-model to obtain the second-level sub-model, and after this iteration of training the second-level sub-model, return the specified number of The first-level sub-model is roughly trained with one sample image, until the first-level sub-model is roughly trained with all the first sample images of the specified number, and the rough first-level sub-model and the rough second-level sub-model are obtained.
  • the process of rough training of the first-level sub-model in S203-S204 which will not be described in detail here.
  • the heat map expresses the mapping relationship between the first score of the first-level sub-model predicting that the first sample image belongs to the violation image and the sensitive area in the first sample image, that is, the first-level sub-model predicts the first score
  • the first score that this image belongs to the offending image is related to which areas in the first sample image are more sensitive.
  • all the first sample images can be input into the trained rough first-level sub-model, and the second score that the first sample image belongs to the offending image can be obtained, based on the gradient-weighted class activation map (Gradient-weighted Class Activation Map) , Grad-CAM) and the second score to generate a heatmap of the first sample image.
  • the gradient-weighted class activation map Grad-CAM
  • the heat map can be obtained by linear combination.
  • the first sample image may be represented as H ⁇ W ⁇ 3, where H is the number of pixels in the length direction of the first sample image, W is the number of pixels in the height direction of the first sample image, and 3 is the RGB channel data of the first sample image.
  • a fourth channel with a value of 0 is added to the first sample image, that is, the first sample image is represented as H ⁇ W ⁇ 3 ⁇ 0.
  • the heat map can be The pixel value of is used as the value of the fourth channel of the first sample image, so that the heat map and the first sample image are stitched together to obtain the second sample image H ⁇ W ⁇ 3 ⁇ 1, where 1 represents the pixel value of the heat map.
  • the fourth channel value of a specified number of second sample images may be randomly set to 0 to obtain a third sample image, and the second sample image and the third sample image are used to train a rough first-level sub-model to obtain final training Good first-level submodel.
  • the pixel value of the highlighted part in part of the second sample image can be set to 0, that is, the channel value of the fourth channel of the second sample image whose channel value is greater than the preset threshold is set to 0 to obtain the third sample image, and then randomly selected
  • the second sample image and the third sample image are used to iteratively train the rough first-level sub-model until the number of training times reaches a preset number of times or the loss rate is less than a preset threshold, and a trained first-level sub-model is obtained.
  • the score of each first sample image belonging to the violation image can be obtained, and the classification loss rate of the first sample image can be calculated through the score, so that the classification loss rate can be higher than the pre-defined classification loss rate.
  • the first sample image with the value is set as the fourth sample image, and the fourth sample image is a difficult sample image that the first-level sub-model is difficult to distinguish as a positive sample or a negative sample.
  • the fourth sample image can be input into the trained rough secondary sub-model, the third score of the fourth sample image can be obtained, and the heat map of the fourth sample image can be generated based on Grad-CAM and the third score.
  • the heat map of the first sample image is obtained, which will not be described in detail here.
  • the pixel value of the pixel in the heat map can be used as the channel value of the fourth channel of the fourth sample image to stitch the heat map and the fourth sample image.
  • the pixel value of the pixel in the heat map can be used as the channel value of the fourth channel of the fourth sample image to stitch the heat map and the fourth sample image.
  • S207 please refer to S207, which will not be described in detail here.
  • Using the fifth sample image to train the rough second-level sub-model may refer to training the rough first-level sub-model in S208, which will not be described in detail here.
  • the last convolutional layer of the rough second-level sub-model adopts a variable convolution kernel, and the receptive field of the second-level sub-model can be changed, so that the second-level sub-model can learn the offending object
  • the features of the second-level sub-model enhance the discrimination ability of the offending object.
  • the video review model in the embodiment of the present application includes a first-level sub-model and a second-level sub-model. After the video review model is initialized, the first-level sub-model is trained by using the first sample image, and the first-level sub-model is calculated according to the classification label. The classification loss rate for classifying this image, in response to determining that the classification loss rate is greater than the preset value, the first sample image is used to train the second-level sub-model. The classification loss rate of the first sample image is obtained by the prediction calculation.
  • the difficult sample image can be used to train the secondary sub-model , so that the secondary sub-model learns the ability to distinguish difficult samples, and finally the entire video review model can accurately distinguish positive and negative samples, and can accurately determine the existence of illegal images in the video, improving the accuracy of video submission.
  • rough training can speed up the model convergence.
  • heat maps are added to the sample images to provide weakly supervised data for model training and improve the classification accuracy of images by the video review model.
  • adding an attention mechanism module to the first-level sub-model and the second-level sub-model makes the model pay attention to the local area of the offending object in the image, which is beneficial to improve the ability of the video review model to detect the offending object.
  • the last convolutional layer of the second-level sub-model adopts a variable convolution kernel, so that the second-level sub-model can better learn the characteristics of the offending objects and improve the ability of the second-level sub-model to identify the offending objects.
  • randomly setting the pixel value of the highlighted area in the heatmap to 0 can not only avoid model overfitting, but also improve the model's ability to identify occluded illegal objects, and improve the robustness of the model to identify occluded illegal objects sex.
  • FIG. 3 is a flowchart of steps of a video review method provided by an embodiment of the present application.
  • the embodiment of the present application is applicable to the case where a trained video review model is used to review videos.
  • the video review device may be implemented by hardware or software, and integrated in the electronic equipment provided by the embodiment of the present application.
  • the video review method of the embodiment of the present application may include the following steps :
  • the video to be reviewed may be a short video.
  • the video to be reviewed may be a live video on a live broadcast platform, a short video on a short video platform, or a long video of course.
  • a certain number of video images can be intercepted from the video to be reviewed. For example, a certain number of video images can be obtained from the video to be reviewed according to a certain sampling rate, and a certain number of video images can be obtained from the video to be reviewed at a certain time interval.
  • a certain number of video images are obtained from the video, and the embodiments of the present application do not limit the manner of obtaining the video images from the video to be reviewed.
  • the video review model of the embodiment of the present application can be trained by the video review model training method of the previous embodiment.
  • the video review model includes cascaded first-level sub-models and second-level sub-models, and video images are first input into the first-level sub-model to obtain The video image belongs to the first score of the illegal image. If the first score is less than the preset value, the video review model outputs the first score. If the first score is greater than the preset value, the video image is input into the secondary sub-model to obtain the video image. The second score belongs to the offending image and outputs the second score.
  • the score of the video image is greater than the preset threshold, it means that the video image has a high probability of containing illegal objects, and the user ID and video image of the video to be reviewed can be sent to the background, and the video can be reviewed manually in the background.
  • the video review model in this embodiment of the present application includes a first-level sub-model and a second-level sub-model.
  • the video image of the video to be reviewed is first input into the first-level sub-model to obtain a first score that the video image belongs to the illegal image. If the first score is less than the preset value, the video review model outputs the first score, and if the first score is greater than the preset value, the video image is input into the secondary sub-model to obtain the second score that the video image belongs to the illegal image, and the second score is output.
  • the video review model uses a cascaded two-level sub-model. During training, the first-level sub-model predicts and calculates the classification loss rate of the first sample image.
  • the classification loss rate is greater than the preset value, the first sample image is indistinguishable. Difficult sample images of positive and negative samples, so that the difficult sample images can be used to train the secondary sub-model, so that the secondary sub-model can learn the ability to distinguish difficult samples, and finally the entire video review model can accurately distinguish positive and negative samples, and can accurately determine the video There are illegal images in the video to improve the accuracy of video submission.
  • FIG. 4 is a structural block diagram of a video review model training device provided by an embodiment of the present application. As shown in FIG. 4 , the video review model training device in the embodiment of the present application includes:
  • a sample acquisition module 401 configured to acquire a first sample image and a classification label of the first sample image
  • the model initialization module 402 is configured to initialize a video review model, and the video review model includes a first-level sub-model and a second-level sub-model;
  • the first-level sub-model training module 403 is configured to use the first sample image to train the first-level sub-model, and calculate the first-level sub-model for classifying the first sample image according to the classification label. classification loss rate;
  • the secondary sub-model training module 404 is configured to use the first sample image to train the secondary sub-model in response to determining that the classification loss rate is greater than a preset value.
  • the video review model training apparatus provided by the embodiments of the present application can execute the video review model training methods provided by the foregoing embodiments of the present application, and has functional modules and beneficial effects corresponding to the execution methods.
  • FIG. 5 is a structural block diagram of a video review apparatus provided by an embodiment of the present application. As shown in FIG. 5 , the video review apparatus of the embodiment of the present application may include the following modules:
  • a video image acquisition module 501 configured to acquire video images from the video to be reviewed
  • the model prediction module 502 is configured to input the video image into a pre-trained video review model to obtain a score that the video image belongs to a violation image, wherein the score includes a first score and a second score, and the video review model includes A first-level sub-model and a second-level sub-model, the first-level sub-model is set to predict that the video image belongs to a first score of a violation image, and in response to determining that the first score is less than a preset value, output the first score a score, the secondary sub-model is configured to predict that the video image belongs to a second score of a violating image in response to determining that the first score is greater than a preset value, and output the second score;
  • the review module 503 is configured to review the video to be reviewed in response to determining that the score is greater than a preset threshold
  • the video review model is trained by the video review model training method described in the foregoing embodiment.
  • the video review apparatus provided by the embodiment of the present application can execute the video review method provided by the embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method.
  • the electronic device may include: a processor 601 , a storage device 602 , a display screen 603 with a touch function, an input device 604 , an output device 605 and a communication device 606 .
  • the number of processors 601 in the electronic device may be one or more, and one processor 601 is taken as an example in FIG. 6 .
  • the processor 601 , storage device 602 , display screen 603 , input device 604 , output device 605 , and communication device 606 of the electronic device may be connected by a bus or in other ways. In FIG. 6 , the connection by a bus is taken as an example.
  • the electronic device is configured to execute the video review model training method and/or the video review method provided by any of the embodiments of the present application.
  • Embodiments of the present application further provide a computer-readable storage medium, where the instructions in the storage medium, when executed by the processor of the device, enable the device to execute the video review model training method described in the foregoing method embodiments, and/or , the video review method.
  • the computer-readable storage medium may be a non-transitory computer-readable storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种视频审核模型训练方法、视频审核方法及相关装置,视频审核模型训练方法包括:获取第一样本图像以及第一样本图像的分类标签;初始化视频审核模型,视频审核模型包括一级子模型和二级子模型;采用第一样本图像训练一级子模型,并根据分类标签计算一级子模型对第一样本图像进行分类的分类损失率;响应于确定分类损失率大于预设值,采用第一样本图像训练二级子模型。

Description

视频审核模型训练方法、视频审核方法及相关装置
本申请要求在2021年2月9日提交中国专利局、申请号为202110181850.6的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及视频审核技术领域,例如涉及一种视频审核模型训练方法、视频审核方法及相关装置。
背景技术
随着移动互联网的爆发式增长以及网络安全法的实施,内容平台运营者面临更加严峻的考验,一方面是恶意用户增加,一方面是对视频中违规内容监管力度加强,视频内容审核可以帮助企业筛查平台中存在的违规图像、视频以及文字等内容,通过视频内容审核可以过滤删除掉违规内容,从而为用户构建一个绿色安全的网络环境。
随着机器学习技术的应用,相关技术中通常通过训练好的视频审核模型来审核视频,然而,直播场景存在复杂性和特殊性,一方面,直播中直播场景复杂多变,存在多个对象;另一方面,直播截图受光线、摄像设备等影响,存在图像质量差,模糊等问题;再者,直播场景中存在诸如手机,对讲机,话筒等视觉特征与违规物的视角特征相似,导致送人工审核的视频的精度不高;最后,线上真实数据场景下,正样本和负样本比例差距过大,上述几方面最终造成采用视频审核模型审核视频时出现FP(false positive,误报)问题,视频审核模型无法精确区分负样例和正样例,视频审核的准确度低。
发明内容
本申请实施例提供一种视频审核模型训练方法、视频审核方法、装置、电子设备和存储介质,以避免相关技术中视频审核模型难以区分正样例和负样例,造成审核准确度低的情况。
第一方面,本申请实施例提供了一种视频审核模型训练方法,包括:
获取第一样本图像以及所述第一样本图像的分类标签;
初始化视频审核模型,所述视频审核模型包括一级子模型和二级子模型;
采用所述第一样本图像训练所述一级子模型,并根据所述分类标签计算所述一级子模型对所述第一样本图像进行分类的分类损失率;
响应于确定所述分类损失率大于预设值,采用所述第一样本图像训练所述二级子模型。
第二方面,本申请实施例提供了一种视频审核方法,包括:
从待审核视频中获取视频图像;
将所述视频图像输入预先训练好的视频审核模型中得到所述视频图像属于违规图像的得分;其中,所述得分包括第一得分和第二得分,所述视频审核模 型包括一级子模型和二级子模型,所述一级子模型设置为预测所述视频图像属于违规图像的所述第一得分,并响应于确定所述第一得分小于预设值,输出所述第一得分,所述二级子模型设置为响应于确定所述第一得分大于预设值,预测所述视频图像属于违规图像的所述第二得分,并输出所述第二得分;
响应于确定所述得分大于预设阈值,对所述待审核视频进行审核;
其中,所述视频审核模型通过第一方面所述的视频审核模型训练方法所训练。
第三方面,本申请实施例提供了一种视频审核模型训练装置,包括:
样本获取模块,设置为获取第一样本图像以及所述第一样本图像的分类标签;
模型初始化模块,设置为初始化视频审核模型,所述视频审核模型包括一级子模型和二级子模型;
一级子模型训练模块,设置为采用所述第一样本图像训练所述一级子模型,并根据所述分类标签计算所述一级子模型对所述第一样本图像进行分类的分类损失率;
二级子模型训练模块,设置为响应于确定所述分类损失率大于预设值,采用所述第一样本图像训练所述二级子模型。
第四方面,本申请实施例提供了一种视频审核装置,包括:
视频图像获取模块,设置为从待审核视频中获取视频图像;
模型预测模块,设置为将所述视频图像输入预先训练好的视频审核模型中得到所述视频图像属于违规图像的得分,其中,所述得分包括第一得分和第二得分,所述视频审核模型包括一级子模型和二级子模型,所述一级子模型设置为预测所述视频图像属于违规图像的第一得分,并响应于确定所述第一得分小于预设值,输出所述第一得分,所述二级子模型设置为响应于确定所述第一得分大于所述预设值,预测所述视频图像属于违规图像的第二得分,并输出所述第二得分;
审核模块,设置为响应于确定所述得分大于预设阈值,对所述待审核视频进行审核;
其中,所述视频审核模型通过第一方面所述的视频审核模型训练方法所训练。
第五方面,本申请实施例提供了一种电子设备,所述电子设备包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现本申请第一方面所述的视频审核模型训练方法,和/或,第二方面所述的视频审核方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本申请第一方面所述的视频审核模型训练方法,和/或,第二方面所述的视频审核方法。
附图说明
图1是本申请一实施例提供的一种视频审核模型训练方法的步骤流程图;
图2A是本申请另一实施例提供的一种视频审核模型训练方法的步骤流程图;
图2B是本申请一实施例的视频审核模型的结构示意图;
图2C是本申请一实施例中Densenet的示意图;
图2D是本申请一实施例中残差模块的示意图;
图2E是本申请一实施例中一级子模型和二级子模型的示意图;
图2F是本申请一实施例中注意力机制模块的示意图;
图3是本申请一实施例提供的一种视频审核方法的步骤流程图;
图4是本申请一实施例提供的一种视频审核模型训练装置的结构框图;
图5是本申请一实施例提供的一种视频审核装置的结构框图;
图6是本申请一实施例提供的一种电子设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互结合。
图1为本申请一实施例提供的一种视频审核模型训练方法的步骤流程图,本申请实施例可适用于训练视频审核模型来对视频进行审核的情况,该方法可以由本申请实施例的视频审核模型训练装置来执行,该视频审核模型训练装置可以由硬件或软件来实现,并集成在本申请实施例所提供的电子设备中,例如,如图1所示,本申请实施例的视频审核模型训练方法可以包括如下步骤:
S101、获取第一样本图像以及所述第一样本图像的分类标签。
本申请实施例中,样本图像可以是指用于训练视频审核模型的图像,该样本图像中可以包括违规对象,如包括枪支、刀具、暴恐等违规对象的图像,则样本图像的分类标签可以是表达样本图像是正常图像或者违规图像的标签,在一个示例中,样本图像为正常图像时分类标签可以为0,样本图像为违规图像时分类标签为1。
在本申请的示例实施例中,可以先获取多个原始图像,对每个原始图像进行图像增强处理和归一化处理得到多个样本图像,基于标注操作确定样本图像的分类标签,示例性地,可以从多个直播视频中截取多个视频图像作为原始图像,然后对每个原始图像进行亮度、对比度、清晰度调整以增强原始图像,并将原始图像的尺寸调整为统一的尺寸,例如调整为长和宽均为224像素的图像,最后对图像的像素值进行归一化处理得到样本图像,并基于人工判断样本图像是否包含违规对象来标注样本图像的分类标签,如果样本图像包含违规对象,则样本图像的分类标签为1,否则样本图像的分类标签为0。
S102、初始化视频审核模型,所述视频审核模型包括一级子模型和二级子模型。
在本申请实施例中,视频审核模型包括级联的一级子模型和二级子模型,一级子模型设置为预测样本图像属于违规图像的第一得分,二级子模型设置为响应于确定第一得分大于预设值,预测样本图像属于违规图像的第二得分。例如,一级子模型和二级子模型可以是分类神经网络,示例性地,一级子模型和二级子模型可以是VGG,残差神经网络(Residual Neural Network,ResNet)以及密集卷积网络(Dense Convolutional Network,DenseNet)等分类神经网络。在训练视频审核模型之前,可以构建一级子模型和二级子模型,并初始化一级子模型和二级子模型的模型参数。
S103、采用所述第一样本图像训练所述一级子模型,并根据所述分类标签计算所述一级子模型对所述第一样本图像进行分类的分类损失率。
例如,可以从多个第一样本图像中随机提取第一样本图像输入初始化之后的一级子模型中得到第一样本图像属于违规图像的得分,并根据该得分和第一样本图像的分类标签计算一级子模型对第一样本图像进行分类的分类损失率,示例性地,可以直接计算得分与分类标签的差值的绝对值作为分类损失率,还可以根据得分与分类标签的均方差等作为分类损失率,或者还可以通过其他方式计算分类损失率,本申请实施例对计算分类损失率的方式不加以限制。
在输入一个第一样本图像训练一级子模型并且计算分类损失率之后,可以根据该分类损失率来调整一级子模型的模型参数,示例性地,可以根据分类损失率来计算梯度,对一级子模型的模型参数进行梯度下降之后继续迭代训练该一级子模型,直到达到预设的迭代次数或者分类损失率小于预设阈值之后得到训练好的一级子模型。
S104、响应于确定所述分类损失率大于预设值,采用所述第一样本图像训练所述二级子模型。
在每次迭代训练一级子模型后,如果一级子模型对第一样本图像进行分类的分类损失率大于预设值,说明该第一样本图像为难以区分是正样本还是负样本的难样本图像,可以采用该第一样本图像来训练二级子模型,从而使得二级子模型学习到区分难样本图像属于正样本或者负样本的能力,例如,将分类损失率大于预设值的第一样本图像输入二级子模型中得到二级子模型的分类损失率,并根据二级子模型的分类损失率来调整二级子模型的模型参数,直到达到预设的迭代次数或者二次子模型的分类损失率小于预设阈值之后得到训练好的二级子模型。
本申请实施例的视频审核模型包括一级子模型和二级子模型,初始化视频审核模型后,采用第一样本图像训练一级子模型,并根据分类标签计算一级子模型对第一样本图像进行分类的分类损失率,响应于确定分类损失率大于预设值,采用第一样本图像训练二级子模型,本申请实施例采用级联的两级子模型,由一级子模型预测计算得到第一样本图像的分类损失率,由于分类损失率大于预设值的第一样本图像是难以区分正负样本的难样本图像,从而能够采用难样 本图像来训练二级子模型,使得二级子模型学习到区分难样本的能力,最终整个视频审核模型可以准确区分正负样本,能够准确确定视频中存在违规图像,提高视频送审的准确度。
图2A为本申请另一实施例提供的一种视频审核模型训练方法的步骤流程图,本申请实施例在前述实施例的基础上进行细化,例如,如图2A所示,本申请实施例的视频审核模型训练方法可以包括如下步骤:
S201、获取第一样本图像以及所述第一样本图像的分类标签。
在本申请实施例的示例实施例中,可以从视频中截取多个视频图像,对多个视频图像进行图像增强和归一化处理后得到多个第一样本图像,并基于人工标注得到第一样本图像的分类标签,在一个示例中,第一样本图像中没有包括违规对象时分类标签为0,第一样本图像中包括违规对象时分类标签为1。当然,还可以从网络图像库中随机抽取一定数量的图像作为样本图像而不仅仅限于从视频中截取视频图像来获得样本图像,本申请实施例对获取第一样本图像的方式不加以限制。
S202、初始化视频审核模型,所述视频审核模型包括一级子模型和二级子模型。
如图2B所示,本申请实施例的视频审核模型包括级联的一级子模型和二级子模型,一级子模型设置为预测样本图像属于违规图像的第一得分,二级子模型设置为响应于确定第一得分大于预设值,预测样本图像属于违规图像的第二得分。在训练视频审核模型前,可以初始化一级子模型和二级子模型的模型参数。
例如,一级子模型可以是DenseNet,如图2C所示为DenseNet的示意图,在DenseNet中,所有的网络层相互连接,即每个网络层均接受其前面所有网络层作为额外的输入,从而使得每个网络层均可以复用该网络层之前的所有网络层的输出特征,以实现特征复用,提升效率。二级子模型可以是ResNet,ResNet通过残差学习方法来减轻训练深层网络的困难,ResNet在全卷积网络的基础上,引入了残差模块,如图2D所示为残差模块的示意图,每个残差模块包含两条路径,其中一条路径是输入特征的直连通路,另一条路径对输入特征做两到三次卷积操作得到输入特征的残差,最后将两条路径上的特征相加,通过残差模块可以降低训练深层网络的难度,更容易提取到特征。当然,在实施本申请实施例时,本领域技术人员还可以根据实际需要设置一级子模型和二级子模型的网络类型,本申请实施例对此不加以限制。
S203、采用指定数量的所述第一样本图像对所述一级子模型进行粗糙训练,得到粗糙一级子模型以及每个所述第一样本图像属于违规图像的第一得分。
在本申请实施例中,在训练视频审核模型时,可以先对一级子模型和二级子模型进行粗糙训练,即先通过指定数量的第一样本图像训练一级子模型和二级子模型,得到训练一定次数后的粗糙一级子模型和粗糙二级子模型。
如图2E所示为一级子模型和二级子模型的网络结构,一级子模型和二级子模型包括五组卷积层,每两组卷积层之间采用池化层来进行空间降维,同一组 卷积层内采用多次连续的3×3卷积操作,卷积核的数目由第一组卷积层的64增多到最后一组卷积层的512,同一组卷积层内卷积核的数目相同,最后一组卷积层之后接两层全连接层,全连接层之后是分类层,当然,在实际应用中本领域技术人员可以任意设置任意数量组的卷积层,以及任意设置每组卷积层中卷积层的数量以及卷积核的大小,本申请实施例对此不加以限制。
在本申请的示例实施例中,一级子模型和二级子模型可以是增加了注意力机制模块的卷积神经网络,即在一级子模型和二级子模型的部分卷积层后插入注意力机制模块来替代池化层,如图2E所示为注意力机制模块的示意图,该注意力机制模块包括通道注意力子模块和空间注意力子模块。
在采用指定数量的第一样本图像对一级子模型进行粗糙训练时,将第一样本图像输入一级子模型,对于连接注意力机制模块的卷积层,将卷积层的输出特征输入注意力机制模块得到注意力机制模块的最终输出特征以输入下一卷积层;将最后一个卷积层的输出特征依次经过全连接层和分类层后得到第一样本图像属于违规图像的第一得分,返回将第一样本图像输入一级子模型的步骤,直到将指定数量的第一样本图像输入一级子模型,从而实现对一级子模型训练一定的次数得到粗糙一级子模型。
如图2F所示,在注意力机制模块中,将卷积层的输出特征输入注意力机制模块的通道注意力子模块得到通道特征,对通道特征和卷积层的输出特征相乘得到中间特征,将中间特征输入注意力机制模块的空间注意力子模块得到空间特征,对空间特征和中间特征相乘得到注意力机制模块的最终输出特征以输入下一卷积层。
其中,如图2F所示,卷积层的输出特征在通道注意力子模块中经过最大池化层和平均池化层之后,再经过感知器输出通道特征1和通道特征2,通道特征1和通道特征2经过加和操作后,通过sigmoid激活操作得到通道注意力子模块最终的通道特征,通道注意力子模块输出的通道特征与卷积层的输出特征相乘得到中间特征,该中间特征作为空间注意力子模块的输入特征,在空间注意力子模块中,中间特征分别经过最大池化层和平均池化层之后进行卷积操作,最后通过sigmoid激活操作得到空间注意力子模块最终的空间特征,空间特征与中间特征做乘法操作得到整个注意力机制模块的最终输出特征,整个注意力机制模块的最终输出特征输入下一个卷积层中,最后在一级子模型的分类层输出第一样本图像属于违规图像的第一得分。
S204、采用所述第一样本图像的第一得分和所述分类标签计算所述样第一样本图像的分类损失率。
在本申请实施例中,一级子模型的分类层输出第一样本图像属于违规图像的第一得分,该第一得分可以是一个概率值,则可以通过第一得分和第一样本图像的分类标签来计算一级子模型对第一样本图像进行分类的分类损失率,在一个示例中,可以计算预测值与分类标签的差值的绝对值作为分类损失率,还可以均方差损失函数等损失函数来计算分类损失率。
需要说明的是,每迭代训练一次一级子模型之后根据分类损失率来对一级 子模型的模型参数进行调整。
S205、在每采用一个所述第一样本图像对所述一级子模型进行粗糙训练之后,响应于确定所述分类损失率大于预设值,采用所述第一样本图像对所述二级子模型进行粗糙训练,得到粗糙二级子模型,直到采用所述指定数量的所述第一样本图像对所述一级子模型进行粗糙训练。
在本申请实施例中,每迭代训练一次一级子模型之后,如果一级子模型对第一样本图像进行分类的分类损失率大于预设值,则可以确定第一样本图像为难以区分是正样本还是负样本的难样本,可以采用该第一样本图像来对二级子模型进行粗糙训练,得到二级子模型,并且在该次迭代训练二级子模型后返回采用指定数量的第一样本图像对一级子模型进行粗糙训练,直到采用所有指定数量的第一样本图像对一级子模型进行粗糙训练,得到粗糙一级子模型和粗糙二级子模型。其中,对二级子模型进行粗糙训练可以参考S203-S204中对一级子模型进行粗糙训练的过程,在此不再详述。
S206、获取所述第一样本图像的热力图。
在本申请实施例中,热力图表达了一级子模型预测第一样本图像属于违规图像的第一得分与第一样本图像中敏感区域的映射关系,即一级子模型预测第一样本图像属于违规图像的第一得分与第一样本图像中哪些区域更为敏感相关。
在一个示例中,可以将所有第一样本图像输入训练好的粗糙一级子模型,得到第一样本图像属于违规图像的第二得分,基于梯度-类激活图(Gradient-weighted Class Activation Map,Grad-CAM)和第二得分生成第一样本图像的热力图。
例如,可以计算第一样本图像属于违规图像的第二得分对一级子模型的全连接层输出的特征图的所有像素Aij的偏导数,然后对偏导数取特征图的宽度和高度维度上的全局平均值,得到第一样本图像中违规对象相对于全连接层输出的特征图中第K个通道(RGB通道)的敏感程度,最后将每个像素点的多个通道的敏感程度加权线性组合即可以得到热力图,具体详情可参考相关技术中Grad-CAM生成热力图的方法,本申请实施例在此不再详述。
S207、将所述热力图和所述第一样本图像拼接得到第二样本图像。
在一个示例中,第一样本图像可以表示为H×W×3,H为第一样本图像在长度方向上的像素数量,W为第一样本图像在高度方向上的像素数量,3为第一样本图像的RGB通道数据。基于此,第一样本图像增加一个值为0的第四通道,即第一样本图像表示为H×W×3×0,在生成第一样本图像的热力图后,可以将热力图的像素值作为第一样本图像的第四通道的数值,从而将热力图和第一样本图像拼接得到第二样本图像H×W×3×1,其中1表示热力图的像素值。
S208、采用所述第二样本图像训练所述粗糙一级子模型,得到最终训练好的一级子模型。
在一个示例实施例中,可以随机将指定数量的第二样本图像的第四通道值设置为0得到第三样本图像,采用第二样本图像和第三样本图像训练粗糙一子级模型得到最终训练好的一级子模型。例如,可以将部分第二样本图像中高亮 部分的像素值设置为0,即将第二样本图像的第四通道中通道值大于预设阈值的通道值设置为0得到第三样本图像,然后随机采用第二样本图像和第三样本图像来对粗糙一级子模型进行迭代训练,直到训练次数达到预设次数或者损失率小于预设阈值为止得到训练好的一级子模型。
S209、从所述第一样本图像中确定出分类损失率大于预设值的第四样本图像。
第一样本图像输入粗糙一级子模型后可以得到每个第一样本图像属于违规图像的得分,通过该得分可以计算第一样本图像的分类损失率,从而可以将分类损失率大于预设值的第一样本图像作为第四样本图像,第四样本图像为一级子模型难以区分为正样本或者负样本的难样本图像。
S210、获取所述第四样本图像的热力图。
例如,可以将第四样本图像输入训练好的粗糙二级子模型,得到第四样本图像的第三得分,基于Grad-CAM和第三得分生成第四样本图像的热力图,具体可参考S206中获取第一样本图像的热力图,在此不再详述。
S211、将所述热力图和所述第四样本图像拼接得到第五样本图像。
例如,可以将热力图中像素点的像素值作为第四样本图像的第四通道的通道值以拼接热力图和第四样本图像,具体详情可参考S207,在此不再详述。
S212、采用所述第五样本图像训练所述粗糙二级子模型,得到最终训练好的二级子模型。
采用第五样本图像训练粗糙二级子模型可参考S208中训练粗糙一级子模型,在此不再详述。
在本申请的示例实施例中,粗糙二级子模型的最后一层卷积层采用可变卷积核,二级子模型的感受野是可变化的,使得二级子模型可以学习到违规对象的特征,增强二级子模型对违规对象的鉴别能力。
本申请实施例的视频审核模型包括一级子模型和二级子模型,初始化视频审核模型后,采用第一样本图像训练一级子模型,并根据分类标签计算一级子模型对第一样本图像进行分类的分类损失率,响应于确定分类损失率大于预设值,采用第一样本图像训练二级子模型,本申请实施例采用级联的两级子模型,由一级子模型预测计算得到第一样本图像的分类损失率,由于分类损失率大于预设值的第一样本图像是难以区分正负样本的难样本图像,从而能够采用难样本图像来训练二级子模型,使得二级子模型学习到区分难样本的能力,最终整个视频审核模型可以准确区分正负样本,能够准确确定视频中存在违规图像,提高视频送审的准确度。
例如,先采用第一样本图像对一级子模型和二级子模型进行粗糙训练,在采用拼接了热力图的样本图像对粗糙训练后的一级子模型和二级子模型进行训练,一方面粗训练可以加快模型收敛,另一方面,热力图加入到样本图像中,为模型训练提供弱监督数据,提升视频审核模型对图像的分类准确率。
例如,一级子模型和二级子模型中增加注意力机制模块,使得模型关注图像中违规对象的局部区域,有利于提高视频审核模型检测违规对象的能力。
例如,二级子模型的最后一层卷积层采用可变卷积核,使得二级子模型可以更好地学习到违规对象的特征,提高二级子模型鉴别违规对象的能力。
例如,采用随机将热力图中高亮区域的像素值设置为0,既可以避免模型过拟合,又能提高模型鉴别出被遮挡的违规对象的能力,提升模型鉴别被遮挡的违规对象的鲁棒性。
图3为本申请一实施例提供的一种视频审核方法的步骤流程图,本申请实施例可适用采用训练好的视频审核模型对视频进行审核的情况,该方法可以由本申请实施例的视频审核装置来执行,该视频审核装置可以由硬件或软件来实现,并集成在本申请实施例所提供的电子设备中,例如,如图3所示,本申请实施例的视频审核方法可以包括如下步骤:
S301、从待审核视频中获取视频图像。
在本申请实施例中,待审核视频可以是短视频,示例性地,待审核视频可以是直播平台上的直播视频,还可以是短视频平台上的短视频,当然还可以是长视频等。在确定待审核视频后,可以从待审核视频中截取一定数量的视频图像,例如,可以按照一定的采样率从待审核视频中获取一定数量的视频图像,还可以按照一定的时间间隔从待审核视频中获取一定数量的视频图像,本申请实施例对从待审核视频中获取视频图像的方式不加以限制。
S302、将所述视频图像输入预先训练好的视频审核模型中得到所述视频图像属于违规图像的得分,其中,得分包括第一得分和第二得分,所述视频审核模型包括一级子模型和二级子模型,所述一级子模型设置为预测所述视频图像属于违规图像的第一得分,响应于确定所述第一得分小于预设值,输出所述第一得分,所述二级子模型设置为响应于确定所述第一得分大于预设值,预测所述视频图像属于违规图像的第二得分,并输出所述第二得分。
本申请实施例的视频审核模型可通过前述实施例的视频审核模型训练方法所训练,该视频审核模型包括级联的一级子模型和二级子模型,视频图像先输入一级子模型中得到视频图像属于违规图像的第一得分,如果第一得分小于预设值,则视频审核模型输出第一得分,如果第一得分大于预设值,则将视频图像输入二级子模型中得到视频图像属于违规图像的第二得分并输出第二得分。
S303、响应于确定所述得分大于预设阈值,对所述待审核视频进行审核。
如果视频图像的得分大于预设阈值,说明该视频图像大概率包含违规对象,可以将该待审核视频的用户ID、视频图像发送到后台,在后台通过人工对视频进行审核。
本申请实施例的视频审核模型包括一级子模型和二级子模型,待审核视频的视频图像先输入一级子模型中得到视频图像属于违规图像的第一得分,如果第一得分小于预设值,则视频审核模型输出第一得分,如果第一得分大于预设值,则将视频图像输入二级子模型中得到视频图像属于违规图像的第二得分并输出第二得分。视频审核模型采用级联的两级子模型,在训练时由一级子模型预测计算得到第一样本图像的分类损失率,由于分类损失率大于预设值的第一样本图像是难以区分正负样本的难样本图像,从而能够采用难样本图像来训练 二级子模型,使得二级子模型学习到区分难样本的能力,最终整个视频审核模型可以准确区分正负样本,能够准确确定视频中存在违规图像,提高视频送审的准确度。
图4是本申请一实施例提供的一种视频审核模型训练装置的结构框图,如图4所示,本申请实施例的视频审核模型训练装置包括:
样本获取模块401,设置为获取第一样本图像以及所述第一样本图像的分类标签;
模型初始化模块402,设置为初始化视频审核模型,所述视频审核模型包括一级子模型和二级子模型;
一级子模型训练模块403,设置为采用所述第一样本图像训练所述一级子模型,并根据所述分类标签计算所述一级子模型对所述第一样本图像进行分类的分类损失率;
二级子模型训练模块404,设置为响应于确定所述分类损失率大于预设值,采用所述第一样本图像训练所述二级子模型。
本申请实施例所提供的视频审核模型训练装置可执行本申请前述实施例所提供的视频审核模型训练方法,具备执行方法相应的功能模块和有益效果。
图5是本申请一实施例提供的一种视频审核装置的结构框图,如图5所示,本申请实施例的视频审核装置可以包括如下模块:
视频图像获取模块501,设置为从待审核视频中获取视频图像;
模型预测模块502,设置为将所述视频图像输入预先训练好的视频审核模型中得到所述视频图像属于违规图像的得分,其中,得分包括第一得分和第二得分,所述视频审核模型包括一级子模型和二级子模型,所述一级子模型设置为预测所述视频图像属于违规图像的第一得分,并响应于确定所述第一得分小于预设值,输出所述第一得分,所述二级子模型设置为响应于确定所述第一得分大于预设值,预测所述视频图像属于违规图像的第二得分,并输出所述第二得分;
审核模块503,设置为响应于确定所述得分大于预设阈值,对所述待审核视频进行审核;
其中,所述视频审核模型通过前述实施例所述的视频审核模型训练方法所训练。
本申请实施例所提供的视频审核装置可执行本申请实施例所提供的视频审核方法,具备执行方法相应的功能模块和有益效果。
参照图6,示出了本申请一个示例中的一种电子设备的结构示意图。如图6所示,该电子设备可以包括:处理器601、存储装置602、具有触摸功能的显示屏603、输入装置604、输出装置605以及通信装置606。该电子设备中处理器601的数量可以是一个或者多个,图6中以一个处理器601为例。该电子设备的处理器601、存储装置602、显示屏603、输入装置604、输出装置605以及通信装置606可以通过总线或者其他方式连接,图6中以通过总线连接为例。所述电子设备设置为执行如本申请任一实施例提供的视频审核模型训练方法,和/ 或,视频审核方法。
本申请实施例还提供一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执行时,使得设备能够执行如上述方法实施例所述的视频审核模型训练方法,和/或,视频审核方法。计算机可读存储介质可以是非暂态计算机可读存储介质。
需要说明的是,对于装置、电子设备、存储介质实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。

Claims (16)

  1. 一种视频审核模型训练方法,包括:
    获取第一样本图像以及所述第一样本图像的分类标签;
    初始化视频审核模型,所述视频审核模型包括一级子模型和二级子模型;
    采用所述第一样本图像训练所述一级子模型,并根据所述分类标签计算所述一级子模型对所述第一样本图像进行分类的分类损失率;
    响应于确定所述分类损失率大于预设值,采用所述第一样本图像训练所述二级子模型。
  2. 根据权利要求1所述的视频审核模型训练方法,其中,所述获取第一样本图像以及所述第一样本图像的分类标签,包括:
    获取原始图像;
    对所述原始图像进行图像增强处理和归一化处理得到第一样本图像;
    基于标注操作确定所述第一样本图像的分类标签,所述分类标签表示所述第一样本图像为正常图像或者违规图像。
  3. 根据权利要求1所述的视频审核模型训练方法,其中,所述采用所述第一样本图像训练所述一级子模型,并根据所述分类标签计算所述一级子模型对所述第一样本图像进行分类的分类损失率,包括:
    采用指定数量的所述第一样本图像对所述一级子模型进行粗糙训练,得到粗糙一级子模型以及每个所述第一样本图像属于违规图像的第一得分;
    采用所述第一样本图像的第一得分和所述分类标签计算所述第一样本图像的分类损失率;
    获取所述第一样本图像的热力图;
    将所述热力图和所述第一样本图像拼接得到第二样本图像;
    采用所述第二样本图像训练所述粗糙一级子模型,得到最终训练好的一级子模型。
  4. 根据权利要求3所述的视频审核模型训练方法,其中,所述一级子模型包括卷积层、注意力机制模块、全连接层,以及分类层;
    所述采用指定数量的所述第一样本图像对所述一级子模型进行粗糙训练,得到粗糙一级子模型以及每个所述第一样本图像属于违规图像的第一得分,包括:
    将所述第一样本图像输入所述一级子模型,对于连接注意力机制模块的卷积层,将所述卷积层的输出特征输入所述注意力机制模块,得到所述注意力机制模块的最终输出特征以输入下一卷积层;
    将最后一个卷积层的输出特征依次经过所述全连接层和所述分类层后得到所述第一样本图像属于违规图像的第一得分,返回将所述第一样本图像输入所述一级子模型的步骤,直到将指定数量的第一样本图像输入所述一级子模型。
  5. 根据权利要求4所述的视频审核模型训练方法,其中,所述对于连接注意力机制模块的卷积层,将所述卷积层的输出特征输入所述注意力机制模块得到所述注意力机制模块的最终输出特征以输入下一卷积层,包括:
    将所述卷积层的输出特征输入所述注意力机制模块的通道注意力子模块得到通道特征;
    对所述通道特征和所述卷积层的输出特征相乘得到中间特征;
    将所述中间特征输入所述注意力机制模块的空间注意力子模块得到空间特征;
    对所述空间特征和所述中间特征相乘得到所述注意力机制模块的最终输出特征以输入下一卷积层。
  6. 根据权利要求4所述的视频审核模型训练方法,其中,所述获取所述第一样本图像的热力图,包括:
    将所有第一样本图像输入所述粗糙一级子模型,得到所述第一样本图像属于违规图像的第二得分;
    基于梯度-类激活图Grad-CAM和所述第二得分生成所述第一样本图像的热力图。
  7. 根据权利要求4所述的视频审核模型训练方法,其中,所述将所述热力图和所述第一样本图像拼接得到第二样本图像,包括:
    将所述热力图的像素值拼接到所述第一样本图像的第四通道上得到第二样本图像,所述第二样本图像的第一通道、第二通道和第三通道分别为所述第二样本图像的RGB值。
  8. 根据权利要求7所述的视频审核模型训练方法,其中,所述采用所述第二样本图像训练所述粗糙一级子模型,得到最终训练好的一级子模型,包括:
    随机将指定数量的第二样本图像的第四通道值设置为0得到第三样本图像;
    采用所述第二样本图像和所述第三样本图像训练所述粗糙一级子模型,得到最终训练好的一级子模型。
  9. 根据权利要求3-8任一项所述的视频审核模型训练方法,其中,所述响应于确定所述分类损失率大于预设值,采用所述第一样本图像训练所述二级子模型,包括:
    在每采用一个所述第一样本图像对所述一级子模型进行粗糙训练之后,响应于确定所述分类损失率大于所述预设值,采用所述第一样本图像对所述二级子模型进行粗糙训练,得到粗糙二级子模型,直到采用所述指定数量的所述第一样本图像对所述一级子模型进行粗糙训练;
    从所述第一样本图像中确定出分类损失率大于所述预设值的第四样本图像;
    获取所述第四样本图像的热力图;
    将所述热力图和所述第四样本图像拼接得到第五样本图像;
    采用所述第五样本图像训练所述粗糙二级子模型,得到最终训练好的二级子模型。
  10. 根据权利要求9所述的视频审核模型训练方法,其中,所述获取所述第四样本图像的热力图,包括:
    将所述第四样本图像输入训练好的粗糙二级子模型,得到所述第四样本图像属于违规图像的第三得分;
    基于Grad-CAM和所述第三得分生成所述第四样本图像的热力图。
  11. 根据权利要求9所述的视频审核模型训练方法,其中,所述二级子模型的最后一层卷积层的卷积核为可变形卷积核。
  12. 一种视频审核方法,包括:
    从待审核视频中获取视频图像;
    将所述视频图像输入预先训练好的视频审核模型中得到所述视频图像属于违规图像的得分;其中,所述得分包括第一得分和第二得分,所述视频审核模型包括一级子模型和二级子模型,所述一级子模型设置为预测所述视频图像属于违规图像的所述第一得分,并响应于确定所述第一得分小于预设值,输出所述第一得分;所述二级子模型设置为响应于确定所述第一得分大于所述预设值,预测所述视频图像属于违规图像的所述第二得分,并输出所述第二得分;
    响应于确定所述得分大于预设阈值,对所述待审核视频进行审核;
    其中,所述视频审核模型通过权利要求1-11任一项所述的视频审核模型训练方法所训练。
  13. 一种视频审核模型训练装置,包括:
    样本获取模块,设置为获取第一样本图像以及所述第一样本图像的分类标签;
    模型初始化模块,设置为初始化视频审核模型,所述视频审核模型包括一级子模型和二级子模型;
    一级子模型训练模块,设置为采用所述第一样本图像训练所述一级子模型,并根据所述分类标签计算所述一级子模型对所述第一样本图像进行分类的分类损失率;
    二级子模型训练模块,设置为响应于确定所述分类损失率大于预设值,采用所述第一样本图像训练所述二级子模型。
  14. 一种视频审核装置,包括:
    视频图像获取模块,设置为从待审核视频中获取视频图像;
    模型预测模块,设置为将所述视频图像输入预先训练好的视频审核模型中得到所述视频图像属于违规图像的得分,其中,所述得分包括第一得分和第二得分,所述视频审核模型包括一级子模型和二级子模型,所述一级子模型设置为预测所述视频图像属于违规图像的所述第一得分,并响应于确定所述第一得分小于预设值,输出所述第一得分,所述二级子模型设置为响应于确定所述第一得分大于所述预设值,预测所述视频图像属于违规图像的所述第二得分,并输出所述第二得分;
    审核模块,设置为响应于确定所述得分大于预设阈值,对所述待审核视频进行审核;
    其中,所述视频审核模型通过权利要求1-11任一项所述的视频审核模型训练方法所训练。
  15. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,设置为存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-11任一项所述的视频审核模型训练方法,和/或,权利要求12所述的视频审核方法。
  16. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-11任一项所述的视频审核模型训练方法,和/或,权利要求12所述的视频审核方法。
PCT/CN2022/074703 2021-02-09 2022-01-28 视频审核模型训练方法、视频审核方法及相关装置 WO2022171011A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110181850.6A CN112818888B (zh) 2021-02-09 2021-02-09 视频审核模型训练方法、视频审核方法及相关装置
CN202110181850.6 2021-02-09

Publications (1)

Publication Number Publication Date
WO2022171011A1 true WO2022171011A1 (zh) 2022-08-18

Family

ID=75864970

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074703 WO2022171011A1 (zh) 2021-02-09 2022-01-28 视频审核模型训练方法、视频审核方法及相关装置

Country Status (2)

Country Link
CN (1) CN112818888B (zh)
WO (1) WO2022171011A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115471698A (zh) * 2022-09-06 2022-12-13 湖南经研电力设计有限公司 基于深度学习网络的输变电工程遥感图像分类方法及系统
CN118333690A (zh) * 2024-06-17 2024-07-12 浦江三思光电技术有限公司 智能广告审核方法、系统及终端

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818888B (zh) * 2021-02-09 2024-09-06 广州市百果园信息技术有限公司 视频审核模型训练方法、视频审核方法及相关装置
CN113590944B (zh) * 2021-07-23 2024-01-19 北京达佳互联信息技术有限公司 内容查找方法及装置
CN114022800A (zh) * 2021-09-27 2022-02-08 百果园技术(新加坡)有限公司 模型训练方法、违规直播识别方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110191356A (zh) * 2019-06-06 2019-08-30 北京字节跳动网络技术有限公司 视频审核方法、装置和电子设备
CN111225234A (zh) * 2019-12-23 2020-06-02 广州市百果园信息技术有限公司 视频审核方法、视频审核装置、设备和存储介质
CN111385602A (zh) * 2018-12-29 2020-07-07 广州市百果园信息技术有限公司 基于多层级多模型的视频审核方法、介质及计算机设备
CN112818888A (zh) * 2021-02-09 2021-05-18 广州市百果园信息技术有限公司 视频审核模型训练方法、视频审核方法及相关装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107403198B (zh) * 2017-07-31 2020-12-22 广州探迹科技有限公司 一种基于级联分类器的官网识别方法
CN109145766B (zh) * 2018-07-27 2021-03-23 北京旷视科技有限公司 模型训练方法、装置、识别方法、电子设备及存储介质
CN109784293B (zh) * 2019-01-24 2021-05-14 苏州科达科技股份有限公司 多类目标对象检测方法、装置、电子设备、存储介质
CN109934226A (zh) * 2019-03-13 2019-06-25 厦门美图之家科技有限公司 关键区域确定方法、装置及计算机可读存储介质
CN111090776B (zh) * 2019-12-20 2023-06-30 广州市百果园信息技术有限公司 一种视频审核的方法、装置、审核服务器和存储介质
CN111143612B (zh) * 2019-12-27 2023-06-27 广州市百果园信息技术有限公司 视频审核模型训练方法、视频审核方法及相关装置
CN112052877B (zh) * 2020-08-06 2024-04-09 杭州电子科技大学 一种基于级联增强网络的图片细粒度分类方法
CN112131978B (zh) * 2020-09-09 2023-09-01 腾讯科技(深圳)有限公司 一种视频分类方法、装置、电子设备和存储介质
CN112132196B (zh) * 2020-09-14 2023-10-20 中山大学 一种结合深度学习和图像处理的烟盒缺陷识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111385602A (zh) * 2018-12-29 2020-07-07 广州市百果园信息技术有限公司 基于多层级多模型的视频审核方法、介质及计算机设备
CN110191356A (zh) * 2019-06-06 2019-08-30 北京字节跳动网络技术有限公司 视频审核方法、装置和电子设备
CN111225234A (zh) * 2019-12-23 2020-06-02 广州市百果园信息技术有限公司 视频审核方法、视频审核装置、设备和存储介质
CN112818888A (zh) * 2021-02-09 2021-05-18 广州市百果园信息技术有限公司 视频审核模型训练方法、视频审核方法及相关装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115471698A (zh) * 2022-09-06 2022-12-13 湖南经研电力设计有限公司 基于深度学习网络的输变电工程遥感图像分类方法及系统
CN115471698B (zh) * 2022-09-06 2023-06-30 湖南经研电力设计有限公司 基于深度学习网络的输变电工程遥感图像分类方法及系统
CN118333690A (zh) * 2024-06-17 2024-07-12 浦江三思光电技术有限公司 智能广告审核方法、系统及终端

Also Published As

Publication number Publication date
CN112818888B (zh) 2024-09-06
CN112818888A (zh) 2021-05-18

Similar Documents

Publication Publication Date Title
WO2022171011A1 (zh) 视频审核模型训练方法、视频审核方法及相关装置
JP6994588B2 (ja) 顔特徴抽出モデル訓練方法、顔特徴抽出方法、装置、機器および記憶媒体
CN109784293B (zh) 多类目标对象检测方法、装置、电子设备、存储介质
WO2019085905A1 (zh) 图像问答方法、装置、系统和存储介质
US8792722B2 (en) Hand gesture detection
US8750573B2 (en) Hand gesture detection
WO2018054329A1 (zh) 物体检测方法和装置、电子设备、计算机程序和存储介质
CN112052831B (zh) 人脸检测的方法、装置和计算机存储介质
CN108780508A (zh) 用于归一化图像的系统和方法
CN112348036A (zh) 基于轻量化残差学习和反卷积级联的自适应目标检测方法
CN113434716B (zh) 一种跨模态信息检索方法和装置
CN113627504B (zh) 基于生成对抗网络的多模态多尺度特征融合目标检测方法
WO2022178833A1 (zh) 目标检测网络的训练方法、目标检测方法及装置
Pedraza et al. Really natural adversarial examples
CN111428740A (zh) 网络翻拍照片的检测方法、装置、计算机设备及存储介质
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
Zhao et al. Improved algorithm for face mask detection based on Yolo-V4
Zhao et al. Scene-adaptive crowd counting method based on meta learning with dual-input network DMNet
CN115205157B (zh) 图像处理方法和系统、电子设备和存储介质
CN113361336B (zh) 基于注意力机制的视频监控场景下行人视图属性的定位与识别方法
CN116612355A (zh) 人脸伪造识别模型训练方法和装置、人脸识别方法和装置
CN114064973B (zh) 视频新闻分类模型建立方法、分类方法、装置及设备
CN115294636A (zh) 一种基于自注意力机制的人脸聚类方法和装置
Meng et al. A Novel Steganography Algorithm Based on Instance Segmentation.
Shi et al. Semantic-driven context aggregation network for underwater image enhancement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22752172

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22752172

Country of ref document: EP

Kind code of ref document: A1