CN112926429A

CN112926429A - Machine audit model training method, video machine audit method, device, equipment and storage medium

Info

Publication number: CN112926429A
Application number: CN202110189459.0A
Authority: CN
Inventors: 罗雄文; 卢江虎; 项伟
Original assignee: Bigo Technology Singapore Pte Ltd
Current assignee: Bigo Technology Singapore Pte Ltd
Priority date: 2021-02-19
Filing date: 2021-02-19
Publication date: 2021-06-08

Abstract

The embodiment of the invention provides a method, a device, equipment and a storage medium for machine trial model training and video machine trial. The machine review model training method comprises the following steps: obtaining a first training sample; recognizing the probability that the first training sample belongs to the scene corresponding to each label respectively by using a pre-trained multi-label scene recognition model, and calculating a scene confrontation loss value based on the probability; calculating a loss value of the computer trial model based on the scene confrontation loss value in the process of training the computer trial model by using the first training sample; and determining that the training of the machine review model is finished under the condition that the loss value of the machine review model meets the preset condition. In the training process of the machine check model, the method and the device combine the influence of scene factors on video violation, and the pre-trained scene recognition model provides scene control information to implement counterstudy, so that the adaptability of the machine check model to complex scenes is improved, and the recognition accuracy of the machine check model is improved.

Description

Machine audit model training method, video machine audit method, device, equipment and storage medium

Technical Field

The invention belongs to the technical field of internet, and particularly relates to a method, a device, equipment and a storage medium for machine audit model training and video machine audit.

Background

Along with the propagation and development of video services on the internet and mobile terminals, the uploading amount of videos is increased day by day, and the corresponding auditing amount is also greatly improved. The quality of the video audit determines the development scale and attractiveness of the video to users.

Because the online measurement models of videos are huge, most auditing sides use a machine auditing (namely machine auditing) method and a human auditing (namely manual auditing) method at the same time, a machine finds out an illegal video candidate set, and then the candidate set is pushed to an auditor for multiple rounds of manual review. The machine review has high speed and accurate rate, and the workload of human review can be reduced to one tenth or more, so that the machine review method is a key ring in the review process. In the face of huge video auditing amount, the improvement of the precision of the mechanical auditing link becomes an important link for auditing the safety bottom, and the method has great significance for reducing the labor cost.

At present, the mainstream video machine review method trains a visual classification/detection model based on a deep learning algorithm, and the model is used for identifying the violation degree of a video. Such a review model generally includes two types: one is to combine all violation types into one "penalty large class"; and secondly, each violation type is divided into single-stage or multi-stage labels, and each label is responsible for a single small-sized machine review model. However, because only the factors of the image characteristics of the video are considered in the training process, the accuracy of the above-mentioned machine review model for identifying the violation degree of the video is low.

Disclosure of Invention

In view of this, the invention provides a method, a device, equipment and a storage medium for machine audit model training and video machine audit, which solve the problem of low identification precision of the existing machine audit model to a certain extent.

In a first aspect, an embodiment of the present invention provides an audit model training method, where the method includes: obtaining a first training sample; recognizing the probability that the first training sample belongs to the scene corresponding to each label respectively by using a pre-trained multi-label scene recognition model, and calculating a scene confrontation loss value based on the probability; calculating a loss value of the computer trial model based on the scene confrontation loss value in the process of training the computer trial model by using the first training sample; and determining that the training of the machine review model is finished under the condition that the loss value of the machine review model meets the preset condition.

In a second aspect, an embodiment of the present invention provides a video screening method, where the method includes: acquiring a pre-trained machine review model; the machine check model is obtained by training the machine check model training method; inputting a video to be audited into the machine audit model to obtain an identification result output by the machine audit model; and the identification result is used for indicating the violation degree of the video to be audited.

In a third aspect, an embodiment of the present invention provides an apparatus for training a trial model, where the apparatus includes: the first acquisition module is used for acquiring a first training sample; the first identification module is used for identifying and obtaining the probability that the first training sample belongs to the scene corresponding to each label respectively by utilizing a pre-trained multi-label scene identification model, and calculating a scene countermeasure loss value based on the probability; the training module is used for calculating a loss value of the computer review model based on the scene confrontation loss value in the process of training the computer review model by using the first training sample; and determining that the training of the machine review model is finished under the condition that the loss value of the machine review model meets the preset condition.

In a fourth aspect, an embodiment of the present invention provides a video screening apparatus, where the apparatus includes: the second acquisition module is used for acquiring a pre-trained machine review model; the machine check model is obtained by training the machine check model training method; the second identification module is used for inputting the video to be audited into the machine audit model to obtain an identification result output by the machine audit model; and the identification result is used for indicating the violation degree of the video to be audited.

In a fifth aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a computer program stored on the memory and executed on the processor; the processor, when executing the program, implements a machine review model training method as described above, or a video machine review method as described above.

In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where instructions are stored, and when the instructions are executed on a computer, the instructions cause the computer to perform the computer review model training method described above or the video computer review method described above.

According to the machine trial model training method in the embodiment of the invention, a first training sample is obtained; recognizing the probability that the first training sample belongs to the scene corresponding to each label respectively by using a pre-trained multi-label scene recognition model, and calculating a scene confrontation loss value based on the probability; calculating a loss value of the computer trial model based on the scene confrontation loss value in the process of training the computer trial model by using the first training sample; and determining that the training of the machine review model is finished under the condition that the loss value of the machine review model meets the preset condition. Therefore, in the training process of the machine check model, the influence of scene factors on video violation is combined, the pre-trained scene recognition model provides scene control information to carry out counterstudy, the loss value of the scene counterloss value computer check model is specifically combined, the loss value of the image feature computer check model based on the training sample is not used, the adaptability of the machine check model to the complex scene is improved, and the recognition accuracy of the machine check model is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating steps of an approach model training method according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating steps of a video screening method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a video review process according to an embodiment of the present invention.

Fig. 4 is a flow chart of a video screening process according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of an edge detection enhancement process according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a close-up template according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of a process of matching a filtered image with a close-up template according to an embodiment of the present invention.

Fig. 8 is a diagram illustrating a process of removing special effects according to an embodiment of the present invention.

Fig. 9 is a schematic diagram of a training process of a self-made special effect recognition model according to an embodiment of the present invention.

Fig. 10 is a schematic diagram of a feature extraction part of a home-made special effect recognition model according to an embodiment of the present invention.

FIG. 11 is a schematic diagram of the principle of a deep separation convolution according to an embodiment of the present invention.

FIG. 12 is a schematic diagram of a channel shuffle according to an embodiment of the present invention.

Fig. 13 is a schematic diagram of a pixel-to-pixel special effect suppression mapping model according to an embodiment of the present invention.

FIG. 14 is a diagram illustrating an audit model training process according to an embodiment of the present invention.

FIG. 15 is a schematic diagram of a dual rail attention mechanism according to an embodiment of the present invention.

FIG. 16 is a block diagram of an audit model training apparatus according to an embodiment of the present invention.

Fig. 17 is a block diagram of a video screening apparatus according to an embodiment of the present invention.

Fig. 18 is a block diagram of an electronic device of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Considering the machine review models of a garbage barrel type (namely, all violation types are combined into a large penalty type) and a violation fine type (namely, all violation types are split into single-stage or multi-stage labels), the machine review model is limited by the difficulty in collecting sensitive violation videos, so that training data is low, the violation discovery precision is low, and the pushing amount of a machine review link has to be increased to ensure that most violation videos can be recalled. In addition, such models also generally lack learning capabilities for complex scenes, because violations and scenarios can occur in a wide variety of scenes, while the above-mentioned screening models lack utilization of normal data, they are more susceptible to interference from video-variant scenes, so that accuracy improvement suffers from bottlenecks.

Aiming at the problems, the embodiment of the invention improves the adaptability of the machine review model to the complex scene by combining a scene excitation mechanism. However, if the scene recognition model and the violation recognition model are simply combined and the scene recognition model and the machine review model are trained independently, the incidence relation between the scene recognition model and the machine review model is split, and errors are easily accumulated due to different probability distributions. For example, a scene recognition model and an violation recognition model are connected in series, and a scene which does not need to be concerned about the current violation type is filtered, so that the misprediction probability is reduced; or the scene data is used for pre-training the machine check model, so that the machine check model is guided to learn the violation data, and the scene data and the violation data share part of weight; or integrating or comprehensively processing the scores given by the scene recognition model and the violation recognition model, and pushing according to the final score. Therefore, in the embodiment of the present invention, when the machine review model is trained, the pre-trained scene recognition model provides the scene control information to perform the counterlearning, the machine review model and the scene recognition model mutually promote the improvement of the precision, and the training of the machine review model and the scene recognition model is integrated, so that the error accumulation caused by different training states and randomness is avoided.

First, some terms mentioned in the embodiments of the present invention are explained.

A convolutional neural network: a mapping function for extracting image features or video features is composed of multiple feature extracting operations and mapping operations, and features are obtained through end-to-end training to obtain mapping parameters, summarizing the image layers by using the mapping parameters, and discriminating and detecting.

And (3) rolling layers: the basic components of a convolutional neural network for feature extraction, which are composed of weighted summation operation of pixel values in a specific receptive field and nonlinear activation operation, are equivalent to the network layer of a perceptron.

Pooling: an operation of summarizing features of a specific range is often a set operation of a specific dimension, such as a spatial dimension, a channel dimension, and the like.

And (3) packet convolution: the characteristic diagrams of different channels are divided into a plurality of groups, and the convolution operation is executed in parallel, so that the use efficiency of the memory can be effectively improved.

ShuffleNet: a residual error neural network improved through light weight has the key characteristics that grouping convolution is used in a residual error block, and channel shuffling is carried out on different groups of feature maps to increase information exchange.

Residual block: a convolutional neural network module that uses either a stride connection or a new convolution operation to improve back propagation efficiency at the bypass can build a very deep network convolutional neural network, usually with two or more branches in the module.

bottleeck: a three-layer residual block structure for changing the number of input channels to reduce the amount of operations on the intermediate convolutional layer, but there are special cases in reverse, and the structure is used to replace the residual block of a conventional residual neural network.

Resnet 50: a residual error neural network based on a bottleeck residual block is named because the convolution layer has 50 layers.

An attention mechanism is as follows: a means for fitting the degree of importance of different partial features.

Resnest 101: a residual neural network that uses a grouping attention mechanism to enhance key features of different groups of image channels, yet retains the characteristics of a residual block.

Densenet 161: a neural network using dense channel connections having a stronger feature extraction capability than a conventional residual block.

Cross entropy: a loss assessment function for a simple classification problem is used to train a classification network.

ArcFace Loss: a neural network loss assessment method for carrying out boundary punishment on a general difficult sample is used for training a classification task with a large number of classes, and has a better effect after combining cross entropy.

The following embodiments are described in detail with reference to the accompanying drawings.

As shown in fig. 1, the trial model training method may include the following steps:

step 101, a first training sample is obtained.

And step 102, identifying to obtain probabilities that the first training samples belong to the scenes corresponding to the labels respectively by using a pre-trained multi-label scene identification model, and calculating a scene confrontation loss value based on the probabilities.

And 103, calculating a loss value of the computer trial model based on the scene confrontation loss value in the process of training the computer trial model by using the first training sample.

And 104, determining that the training of the machine review model is finished under the condition that the loss value of the machine review model meets the preset condition.

The first training sample is used to train the trial model. The first training sample may include a first sample video and first annotation information corresponding to the first sample video, where the first annotation information is used to indicate whether the first sample video is an illegal video.

If the computer-aided model is trained independently, in the process of training the computer-aided model by using the first training sample, the loss value of the computer-aided model according to the output of the computer-aided model for the first sample video and the first annotation information corresponding to the first sample video, namely the loss value of the computer-aided model based on the image characteristics of the first training sample.

In the embodiment of the invention, the machine review model is trained by combining the scene recognition model. The high-precision multi-label scene recognition model is trained by collecting data in advance, and certain video pictures have multiple attributes of scenes, such as family scenes, catering scenes, game scenes and war scenes, so that all labels of the multi-label scene recognition model are not mutually exclusive in category, and the multi-label scene recognition model can attribute the video pictures to multiple scenes. And identifying to obtain the probability that the first training sample belongs to the scene corresponding to each label respectively by using a pre-trained multi-label scene identification model, and calculating a scene confrontation loss value based on the probability. And aiming at any one first training sample, identifying to obtain the probability that the any one first training sample belongs to the scene corresponding to each label respectively. And uniformly calculating to obtain a scene confrontation loss value based on the corresponding probabilities of all the first training samples.

In the process of training the computer audit model by using the first training sample, when calculating the loss value, uniformly calculating the loss value of the computer audit model based on the image characteristics of the training sample and the scene confrontation loss value. And determining that the training of the machine review model is finished under the condition that the loss value meets a preset condition, for example, the preset condition can be that the loss value is smaller than a preset loss threshold value. The machine trial model obtained through the training in the mode has stronger adaptability to complex scenes.

After the machine review model is obtained through training, the machine review model can be used for carrying out machine review operation on the video.

As shown in fig. 2, the video screening method may include the following steps:

step 201, obtaining a pre-trained machine review model.

Step 202, inputting the video to be audited into the machine audit model, and obtaining the recognition result output by the machine audit model.

The video to be audited refers to the video which needs to be audited by the machine auditing model to check whether the video violates rules. After the machine review model is obtained through the training in the mode, the video to be reviewed is input into the machine review model for each video to be reviewed, and the recognition result output by the machine review model is obtained through the internal recognition operation of the machine review model. And the identification result is used for indicating the violation degree of the video to be audited. For example, the identification result may be an illegal score of the video to be audited, and when the illegal score exceeds a preset score threshold, the video to be audited may be determined to be an illegal video; otherwise, the video to be audited can be determined to be a non-violation video.

In the training process of the machine check model, the influence of scene factors on video violation is combined, the pre-trained scene recognition model provides scene control information to carry out counter learning, the loss value of the scene counter loss value computer check model is specifically combined, the loss value of the image feature computer check model based on the training sample is not used, and therefore the adaptability of the machine check model to the complex scene is improved, and the recognition accuracy of the machine check model is improved.

The embodiment of the invention is applied to the machine review link at the review side, and the video to be reviewed can complete the machine review process through the machine review model. The video considered to be illegal by the machine review can be used as an illegal video candidate set to be pushed to an auditor for the human review, and finally the illegal attribute of the video is confirmed by a human review link; and the reporting channel is opened when the videos are considered to be normal by the machine review, and the subsequently reported videos also form a video set and are pushed to an auditor for human review so as to prevent important illegal videos from being missed. In addition, the embodiment of the invention also adds an interference suppression link, thereby reducing the interference of additional information (such as a snap shot, a special effect, a clip and the like) added by a service party or a user to the video on the judgment of the violation type and making up the defect of weak interference resistance of a machine check model.

Fig. 3 is a schematic diagram of a video review process according to an embodiment of the present invention. As shown in fig. 3, the video review process may include: video is examined; a mechanical examination link; and (6) human review links.

Firstly, a video to be audited is obtained and is audited. For example, a video to be audited may be obtained from a video stored in a database, or a video uploaded by a user in real time may be used as the video to be audited, and so on.

And then, the video to be audited enters a machine auditing link. The machine check link of the embodiment of the invention can comprise three parts, namely, the decomposition of the co-shooting video, the special effect inhibition of the video and the identification of the illegal video. The first two parts are two interference suppression parts for suppressing the interference of the combined video and the video special effect on the illegal video identification, and the third part is a machine review model combined with a scene identification strategy for identifying the possible illegal video. In the video co-shooting decomposition part, performing co-shooting type identification on a video to be audited, if the video to be audited is the co-shooting video, splitting the video to be audited, and splitting the video to be audited into a plurality of independent videos; and if the video to be audited is not the video in time, the video to be audited is not processed. In the video special effect inhibiting part, according to the use condition of the special effect, the method is divided into a business side special effect template inhibiting process and a user self-made special effect inhibiting process. And in the illegal video identification part, identifying the video to be audited only with the source information, and finishing illegal identification of the video to be audited by combining a machine audit model obtained by training a scene identification model. And the video identified as the violation in the mechanical examination link can be used as a violation video candidate set to be pushed to a human examination link, the video identified as the normal video is not pushed in the mechanical examination link, and the video set which is not pushed in the mechanical examination but reported can also be pushed to the human examination link.

And then, the video to be audited enters a human auditing link. And finally, performing final review of a first review and a second review … … in a human review link, and finally confirming the violation attribute of the video to be reviewed by a reviewer so as to perform subsequent operation.

As shown in fig. 4, the video review process performs three functions: the method comprises the steps of video composition, video special effect suppression, and illegal video identification and pushing. Before the deep learning machine review model is used for judging the violation type of the video, some interference factors of the video can be removed.

Specifically, a co-shooting video formed by combining a plurality of video pictures according to a certain layout mode is split and restored into a plurality of independent videos, and each independent video is independently input into the machine check model to judge the violation type, so that the complexity of the pictures input into the machine check model is greatly reduced, and the improvement of the precision is facilitated.

Furthermore, if special effects are added to videos, they are also erased or restored depending on the type of the added special effects. The video added with the business side special effect template directly inquires corresponding special effect fields and information of a source video in an information base, and the source video information is loaded to recover the original video; and the other videos which do not use the business side special effect template firstly judge whether the self-made special effect is added or not by a classification model for judging the self-made special effect of the user, and the videos added with the self-made special effect enter a pixel-to-pixel special effect inhibition mapping model (a pixel2pixel special effect inhibition mapping model and a p2p special effect inhibition mapping model) for erasing or weakening the special effect in the picture.

Because most of the video additional interference information is eliminated, the machine review model can more easily mine violation information from the video with the special effect suppressed, and the precision is further improved.

After the processing of the two interference suppression modules, the video frame is submitted to an opportunistic auditing model, and violation scores corresponding to violation types are given by the opportunistic auditing model trained by combining a scene recognition strategy; and then, pushing the candidate violation video set to an auditor for manual auditing according to the violation score and the pushing ratio. Different from a general method for combining a scene recognition model and an illegal recognition model, the illegal recognition model of the embodiment of the invention is obtained by the scene recognition model through countermeasure learning training, the precision of the illegal recognition model is higher than that of the two models which are directly combined, and the detailed design is described in the following.

Specifically, as shown in fig. 4, the video screening process may include the following steps:

step 401, acquiring a decoded video frame.

And acquiring a video to be audited, decoding the video to be audited to obtain a decoded video frame, and subsequently processing the decoded video frame.

In the embodiment of the invention, before the video to be audited is input into the machine audit model and the video to be audited is identified by the machine audit model, the video to be audited is subjected to interference suppression processing, and then the video to be audited after the interference suppression processing is input into the machine audit model.

Optionally, the process of performing interference suppression processing on the video to be audited may include: identifying whether the video to be audited is a snap video, splitting the video to be audited into independent videos when the video to be audited is the snap video, and not processing the video to be audited when the video to be audited is not the snap video; and/or identifying whether the video to be audited is the special-effect video, inhibiting the special effect in the video to be audited when the video to be audited is the special-effect video, and not processing the video to be audited when the video to be audited is not the special-effect video. Here, "and/or" indicates at least one, that is, the process of the interference suppression process may include only the process of the snap video, may include only the process of the special effect video, and may include both the process of the snap video and the process of the special effect video.

In the embodiment of the present invention, the process of interference suppression processing includes a process of a photographed video and a process of a special effect video, and the order between the process of the photographed video and the process of the special effect video is not limited. The following steps 402 to 405 are a process of processing a snap-shot video (i.e., a process of decomposing a snap-shot video), and the following steps 406 to 409 are a process of processing a special-effect video (i.e., a process of suppressing a special effect of a video).

Step 402, performing edge detection enhancement processing on the video frame.

In the process of identifying the snap-shot video, uniformly extracting or randomly extracting partial video frames from the video to be audited; aiming at each extracted video frame, carrying out edge detection enhancement processing on the extracted video frame to obtain a filtering image corresponding to the extracted video frame, and matching the filtering image corresponding to the extracted video frame with a preset close-shot template; and when the matching of at least one extracted video frame is successful, determining that the video to be audited is a snap video, and when the matching of all the extracted video frames is failed, determining that the video to be audited is not the snap video.

The co-shooting video is essentially the picture combination of a plurality of videos, and the combination of the video pictures and the video pictures has obvious boundaries, so that edge detection enhancement processing can be performed on the extracted video frames, and in order to avoid the defect of a single filter operator in edge detection, the embodiment of the invention further reduces noise and smoothes the images by combining non-maximum suppression and bilinear interpolation on the basis of image filtering.

Fig. 5 is a schematic diagram of an edge detection enhancement process according to an embodiment of the present invention. As shown in fig. 5, the process of performing edge detection enhancement processing on any one video frame may include: carrying out preliminary filtering on the video frame by using an isotropic Sobel operator to obtain a preliminary filtering image; carrying out non-maximum suppression on the gradient intensity of the preliminary filtering image; and denoising and smoothing by using a bilinear difference algorithm to obtain a filtering image corresponding to the video frame.

In the process of preliminarily filtering the video frame by using the isotropic Sobel operator, the horizontal template G of the Sobel operator is used_xAnd a vertical template G_yCalculating gradient amplification in each pixel receptive field in sliding window mode

And direction of gradient

These gradient increases and the weighted values of the gradient directions constitute a preliminary filtered image. The isotropic Sobel operator has direction invariance, the position weighting coefficient is more accurate, and the mutual influence of gradients in different directions can be avoided. In the process of carrying out non-maximum suppression on the gradient strength of the primary filtering image, carrying out non-maximum suppression on some areas with higher template overlapping degree, and reserving the gradient information of the receptive field calculation with the maximum gradient amplification, so that the false recognition of the image content edge is reduced, and the noise is reduced. Using bilinear differencesIn the process of denoising and smoothing by using the value algorithm, in order to enable the edge enhanced by filtering to be smoother and continuous and denoise, part of pixel values are corrected by using bilinear interpolation to obtain a filtered image with enhanced edge.

And step 403, matching the filtered image with the close-up template.

In the embodiment of the invention, at least one close-shot template can be preset according to actual experience. Fig. 6 is a schematic diagram of a close-up template according to an embodiment of the present invention. Fig. 6 shows 6 types of pre-arranged close-up templates, but in practical applications, more types of close-up templates may be arranged.

After edge detection enhancement, the obtained filtered image can be matched with different close-up templates. Specifically, for each close-up template, extracting a first pixel corresponding to the boundary position of the current close-up template from the current close-up template, extracting a second pixel corresponding to the same position as the boundary position of the current close-up template from the filtered image, and calculating the distance between the first pixel and the second pixel; if the minimum distance in the calculated distances is smaller than a preset distance threshold, determining that the filtering image is successfully matched with the snapshot template corresponding to the minimum distance; and if the minimum distance in the calculated distances is larger than or equal to a preset distance threshold value, determining that the filtering image and each close-shot template are failed to be matched. Since the number of the first pixels may be plural, and the number of the second pixels may be plural, when calculating the distance between the first pixel and the second pixel, the first pixel vector is composed of the plural first pixels, the second pixel vector is composed of the plural second pixels, and then the distance between the first pixel vector and the second pixel vector is calculated.

Fig. 7 is a schematic diagram of a process of matching a filtered image with a close-up template according to an embodiment of the present invention. As shown in fig. 7, the process of matching any one of the filtered images with any one of the snap templates may include: reserving pixels of the filtered image according to different close-up templates; calculating Euclidean distances between each reserved pixel image and the corresponding close-up template; and taking the minimum Euclidean distance, and if the minimum value is smaller than a preset distance threshold, judging that the filtering image is successfully matched with the close-shot template corresponding to the minimum value. As can be seen from the snapshot template shown in fig. 6, the layout of the snapshot video in the snapshot template is relatively fixed and the boundary is obvious, so that the pixel retaining operation can be performed on the filtered image according to the position of the boundary in the snapshot template, the pixel value at the corresponding position on the filtered image corresponding to the boundary is retained, and the other pixel values are set to 0. And respectively calculating a pixel vector by the reserved pixel map and the pixels at the corresponding boundary line positions of the snap-shot templates, and calculating the Euclidean distance between every two pixel vectors. Taking a snap template corresponding to the minimum Euclidean distance, judging whether the Euclidean distance is lower than a preset distance threshold, if so, judging that the matching with the snap template is successful, and determining that the video to be checked is the snap video; otherwise, the video to be audited is not the video in time.

Step 404, determine whether it is a video in close-up. If yes, go to step 405; if not, go to step 406.

When at least one extracted video frame is successfully matched, determining that the video to be audited is a snap-shot video; and when the matching of all the extracted video frames fails, determining that the video to be audited is not a video in time.

And step 405, splitting the snap shot video according to the snap shot template.

After the video to be audited is judged to be the snap video, splitting each video frame in the video to be audited according to the snap template corresponding to the video to be audited (namely the snap template corresponding to the minimum distance successfully matched with the filtering image), selecting the pixel coordinate of the position of the boundary line to cut the video frame, and then combining the cut video frames into a frame set to obtain at least two independent videos split from the video to be audited.

After the video is processed by the co-shooting, a special effect may still exist in the video to be audited, the special effect generally appears as additional information of the video, and the extraction of the actual violation information of the video is still interfered. The embodiment of the invention provides a video special effect processing process according to the possible sources of the special effects, and is used for solving the problem of special effect interference. Generally speaking, the short video special effect has two sources, one is from a business side special effect template, a user uploads a video, and selects a corresponding business side special effect template to generate a special effect video; the second is the self-made special effect from the user. In the process of identifying the special effect video, acquiring video information of a video to be audited, and judging whether a special effect information field exists in the video information; if the video exists, determining the video to be audited as a special effect video containing a business side special effect template; and if not, identifying whether the video to be audited is the special effect video containing the self-made special effect or not by utilizing the pre-trained self-made special effect identification model.

Step 406, determine whether the business party special effect template is included. If yes, go to step 407; if not, go to step 408.

Step 407, eliminating the special effect according to the special effect template of the service party. Step 410 is then performed.

Since the information of the active video/material is generally stored in the storage unit, if the video to be audited includes the business side special effect template, a special effect information field may exist in the video information of the video to be audited, and the special effect information field may include information such as the type of the business side special effect template included in the video to be audited. Therefore, according to the special effect information field in the video information, whether the special effect comes from the special effect template of the business side can be confirmed.

Under the condition that the video to be audited is the special effect video containing the business side special effect template, the special effect in the video to be audited can be eliminated according to the business side special effect template contained in the video to be audited. According to different types of the special effect templates of the business side, different methods for eliminating the special effects are adopted.

Fig. 8 is a diagram illustrating a process of removing special effects according to an embodiment of the present invention. Fig. 8 only lists two elimination manners of the service party special effect template, and the elimination manners of other types of service party special effect templates may be eliminated by using corresponding similar manners. As shown in fig. 8, in the process of eliminating the special effect, firstly, information is extracted from the special effect information field of the service party, and the type of the special effect template of the service party is determined; then, according to the different types of the service side special effect template, eliminating the special effect in a corresponding mode, for example, for a video level dynamic special effect template, according to VID (video identification) of the video and the type of the service side special effect template, inquiring storage information (such as a storage position) of a source video from a database; and taking the source video from the storage unit according to the storage information. For a photo background type special effect template (a special effect background template for photo display), taking out a photo set uploaded by a user in the background template; the set of photos is encoded into a set of frames in presentation order. Since eliminating the business side special effect template does not involve computationally intensive decoding operations, the special effect elimination speed of this portion is fast.

And step 408, judging whether the self-made special effect is included. If yes, go to step 409; if not, go to step 410.

And if the video information of the video to be audited does not have the special effect information field, further identifying whether the video to be audited is the special effect video containing the self-made special effect.

In the embodiment of the invention, a self-made special effect recognition model is trained in advance to recognize whether the video to be audited contains the self-made special effect. The self-made special effect recognition model is trained in the following way: obtaining a second training sample; training a multi-classification self-made special effect recognition model to be trained by utilizing the second training sample, and taking a feature extraction part in the trained multi-classification self-made special effect recognition model as a feature extraction part of a two-classification self-made special effect recognition model to be trained; keeping the parameters of the feature extraction part of the to-be-trained two-class self-made special effect recognition model unchanged, training the to-be-trained two-class self-made special effect recognition model by using the training sample, and taking the trained two-class self-made special effect recognition model as the self-made special effect recognition model. The second training sample may include a second sample video and second label information corresponding to the second sample video, where the second label information is used to indicate whether the second sample video includes a special effect and a special effect video group to which the included special effect belongs.

Considering that video effects have group characteristics, that is, the same effect can be repeatedly applied to different videos, videos using the same homemade effect can be classified into the same group. In combination with the characteristics, in order to obtain better video feature extraction capability, the embodiment of the invention firstly trains the feature extraction part of the network in a way of taking multi-classification as a head, then keeps the feature extraction part unchanged, trains a final classification model, and is used for judging whether the video contains the self-made special effect. Fig. 9 is a schematic diagram of a training process of a self-made special effect recognition model according to an embodiment of the present invention. As shown in fig. 9, a multi-classification head is used as an output, each class corresponds to a video set under a self-made special effect, and is used as a special effect video set, and a special effect-free video set without a special effect is set, so that the training of the multi-classification model is guided by ArcFace Loss. After the model is converged, keeping the weight parameters of the network feature extraction part unchanged, combining all videos in each special effect video group into one large class as a special effect video group, using the video without the special effect as another large class as a special effect-free video group, training a two-class model by the conventional cross entropy Loss guidance, and taking the converged two-class model as a final self-made special effect recognition model. By using the multi-task training method, on one hand, the sensitivity of the network feature extraction part to the special effect can be improved, and on the other hand, the convergence of the model can be accelerated.

In a self-made special effect recognition model, in order to accelerate the discrimination process, the feature extraction part adopts the design of saving more calculation amount on the basis of the lightweight network ShuffleNet. The 50-layer network of the ultra-lightweight shuffLeNet50 feature extraction part is half less than the number of channels of each layer of the conventional bypass branch network, and the convolution operation is reduced by about 70%. Each bypass convolution module comprises an S-Block and an S-projection Block, and as shown in the specific structure of the right part of fig. 10, the S-Block and the S-projection Block are provided, and each convolution branch is responsible for convolution operation of a part of image channels according to the principle of packet convolution in the bypass convolution module, so that the number of convolution operations is further reduced. In addition, different from the conventional ShuffleNet, the embodiment of the invention combines the idea of 'inverse bottleeck' to reduce the information loss in the feature extraction process, keeps the consistency of the number of output channels and the number of input channels at the outlet of the bottleeck, and relieves the memory access overhead.

In addition, the self-made special effect recognition model inherits the idea of partial lightweight network design, for example, 3 × 3 Convolution uniformly uses deep-separation Convolution (dwconv), and channel shuffling and linear 1 × 1 Convolution are used at the position of packet Convolution end to strengthen information exchange between channels, so that bias is avoided. FIG. 11 is a schematic diagram of the principle of a deep separation convolution according to an embodiment of the present invention. As shown in fig. 11, DW Conv reduces the computation amount by 90% by splitting the common convolution into a single-channel space convolution and a cross-channel 1 × 1 convolution, on the premise of achieving the same excellent feature extraction effect. FIG. 12 is a schematic diagram of a channel shuffle according to an embodiment of the present invention. As shown in FIG. 12, the channel shuffle operation facilitates the exchange of convolution group-wise information by shuffling the order of the channels. It should be noted that in the convolution block at each entry of the network, the feature maps are not grouped, but DW Conv is performed in both branches with the same number of channels, which is because the convolution block performs image down-sampling, the number of channels needs to be doubled as the feature map size decreases.

And if the video to be audited is not the snap-shot video, inputting the video to be audited into the self-made special effect recognition model obtained through training, and obtaining a recognition result which is output by the self-made special effect recognition model and used for indicating whether the video to be audited contains the self-made special effect. And if the video to be audited is the snap-shot video, respectively inputting each independent video obtained by splitting the video to be audited into the self-made special effect recognition model obtained by training, and obtaining a recognition result which is output by the self-made special effect recognition model and used for indicating whether each independent video contains the self-made special effect.

And step 409, inhibiting the self-control special effect through mapping.

Under the condition that the video to be audited is the special effect video containing the self-made special effect, the video to be audited is pushed to the pixel2pixel special effect inhibition mapping model, and pixel-to-pixel special effect inhibition mapping operation is executed on the video to be audited, so that the display of the special effect in a picture is weakened or eliminated. The latter needs full-pixel calculation, and the calculation amount is large, so that whether the self-control special effect mode is included or not is identified, and the whole calculation overhead of special effect inhibition can be effectively reduced.

Fig. 13 is a schematic diagram of a pixel-to-pixel special effect suppression mapping model according to an embodiment of the present invention. As shown in fig. 13, the pixel-to-pixel special effect suppression mapping model is a network with symmetrical input and output sizes, the network is composed of two parts, namely, a feature summation operation and an image mapping operation, and the two parts are divided into two networks to be learned and trained in a game countermeasure mode, so that the mapping of special effect suppression is completed. In the characteristic summarizing operation part, performing series of operations such as convolution, down sampling, attention mechanism and the like on a video frame containing the self-made special effect; in the image mapping operation part, operations such as series deconvolution, up-sampling, thermodynamic strategies and the like are carried out, so that a video frame with a special effect suppressed is obtained.

It should be noted that, in the process of identifying whether the video to be audited includes the home-made special effect, each video frame in the video to be audited is identified, and one video frame corresponds to one identification result used for indicating whether the video frame includes the home-made special effect. And carrying out special effect suppression operation on the video frames which are identified to contain the home-made special effect, and not processing the video frames which are identified to not contain the home-made special effect.

And step 410, identifying and obtaining violation scores of all video frames by using a machine review model.

After the two interference suppression modules, the single video without special effects is subjected to violation identification by a scene-excitation-based machine review model.

Different from simply combining a scene recognition model and an audit model, the embodiment of the invention uses the trained multi-label scene recognition model to train the audit model according to the non-relevance 'excitation' of the scene and the illegal object, namely, the multi-label scene recognition model improves the recognition precision of the audit model in a counterstudy mode. Firstly, data needs to be collected to train a high-precision multi-label scene recognition model. In the embodiment of the invention, as shown in fig. 14, the densenert 161 with dense channel connections is used as a multi-tag scene recognition model, and because some pictures have multiple attributes of scenes, such as family scenes, dining scenes, game scenes and war scenes, the tags of the multi-tag scene recognition model are not mutually exclusive in category, and the multi-tag scene recognition model can attribute video pictures to multiple scenes. The scene recognition network can easily realize the function through sigmoid activation of the non-mutually exclusive class. After the multi-label scene recognition model is trained and converged, the multi-label scene recognition model provides non-associated information of scenes and illegal objects in the training process of the machine review model, and the training of the machine review model is guided in a loss function mode.

The training process of the machine check model of the embodiment of the invention can comprise the following steps A1-A4:

a1, a first training sample is obtained.

The first training sample may include a first sample video and first annotation information corresponding to the first sample video, where the first annotation information is used to indicate whether the first sample video is an illegal video.

A2, recognizing and obtaining the probability that the first training sample belongs to the scene corresponding to each label respectively by using a pre-trained multi-label scene recognition model, and calculating a scene confrontation loss value based on the probability.

And aiming at each first training sample, inputting the first training sample into the multi-label scene recognition model to obtain the probability of the multi-label scene recognition model output, wherein the first training sample belongs to the scene corresponding to each label respectively.

After the probability that the first training sample belongs to the scene corresponding to each label is identified, the incidence relation between the first training sample and the scene can be determined based on the probability. For example, if the probability that a certain first training sample belongs to a scene corresponding to a certain label is greater than a preset probability threshold, determining that the association relationship between the first training sample and the scene is association; and if the probability that a certain first training sample belongs to a scene corresponding to a certain label is less than or equal to a preset probability threshold, determining that the incidence relation between the first training sample and the scene is non-incidence.

For example, the association relationship between the first training sample and the scene may be shown in the following table:

	family scene	Catering scene	……	Cartoon scene	Game scene
						Sample
1	0	1	……	1	1
						Sample 2	0	0	……	1	0
……	……	……	……	……	……
						Sample n	1	0	……	0	1

The table records the association relationship between the violation type of the first training sample and the scenario, where 1 represents that the two are not associated, and 0 represents that the two are possibly associated. For example, if the gun for game is not illegal, the gun and the game scene are not related, and therefore the relationship between the gun and the game scene is 1.

Based on the probability that each first training sample belongs to the corresponding scene of each label, a scene confrontation loss value for exciting the training of the machine review model can be calculated. In the embodiment of the invention, the scene confrontation loss value can be calculated by the following formula I:

in formula one, L represents the scene confrontation loss value, P_i,jRepresenting the probability, P, that the ith first training sample belongs to the jth scene_i,kThe probability that the ith first training sample belongs to the kth scene is represented, num _ sample represents the total number of the first training samples, num _ scene represents the total number of the scenes, M (i, j) represents the association relationship between the ith first training sample and the jth scene, and M (i, j) ═ 1 represents that the ith first training sample is not associated with the jth scene.

A3, calculating the loss value of the computer trial model based on the scene confrontation loss value in the process of training the computer trial model by using the first training sample.

As shown in fig. 14, the review model in the embodiment of the present invention is a cascade model, and may include a cascade of a primary model and a secondary model. The primary model comprises a residual error neural network Resnet50 with halved channels, the primary model is mainly used for reducing the calculated amount of the secondary model, videos which are preliminarily judged to be illegal by the primary model are judged by the secondary model, the residual error network is good balance of performance and efficiency, and most of normal videos can be filtered as the primary model. The secondary model is an integrated model, including an integrated channel halved residual neural network Resnet50 and a de-densified residual neural network Resnest 101. The Resnet50 of the primary model is obtained by splitting the secondary model, so that Resnet50 of the secondary model is the same as Resnet50 of the primary model, and Resnet50 of the primary model is integrally trained as a part of the secondary integrated model during training. Different from the conventional resenst 101, in the embodiment of the present invention, when the grouping attention diagram is calculated, 1 × 1 convolution is used instead of the full-connection operation, the inference speed of the network is accelerated, and after the grouping attention operation, a channel shuffling operation similar to ShuffleNet is performed among the channels of each large group, so as to further enhance the information exchange among the channels.

Step A3 may include steps A31-A34:

a31, inputting the first training sample into the secondary model, and obtaining a first output of the channel halved residual error neural network and a second output of the channel halved residual error neural network and the de-densified residual error neural network.

Step a31 may include steps a311 to a 313:

and A311, in the residual error neural network with the halved channel, performing feature extraction on the first training sample to obtain a first feature map group, performing a double-track attention mechanism operation on the first feature map group to obtain a first attention enhancement feature map group, and performing a first processing operation on the first attention enhancement feature map group to obtain the first output.

In the embodiment of the invention, a double-track attention mechanism operation is used for enhancing the key information area in the image from two dimensions of space and channel, and the interference of useless features on final judgment is reduced. The principle of the dual-track attention mechanism is shown in fig. 15, and for the feature map group output by the last convolution block, attention weighting sequences of a space dimension and a channel dimension are respectively obtained through two routes, the weighting sequences are respectively used for multiplying the feature map, and the attention weighting sequences are added pixel by pixel to obtain a final attention enhancement map group. When a weighted sequence of Channel dimensions is solved, a feature map group is firstly summarized into a one-dimensional feature sequence by Channel-level Global Average Pooling (CGAP), and then is mapped through two full-connection layers and activated in a sigmoid mode (considering that attention weights are not mutually exclusive in category), so that an attention weight corresponding to each Channel is obtained, and each attention weight is multiplied by all pixel values on the Channel. When the weighted sequence of the spatial dimension is solved, the feature map groups are firstly summarized into a thermodynamic diagram of a single channel by 1 × 1 convolution of cross channels, and then the thermodynamic diagram is activated by pixel signalood after 3 × 3 convolution is used, so that an attention weight value is obtained at each spatial position, and the attention weight values are multiplied by pixel values at pixel positions corresponding to all the channels.

Accordingly, performing a dual-rail attention mechanism operation on the first set of feature maps to obtain a first set of attention enhancing feature maps may include: converting the first feature map group into a feature sequence with one-dimensional space dimensionality, calculating the one-dimensional feature sequence to obtain an attention weight value corresponding to each channel, and multiplying pixel values on the channels by the attention weight values corresponding to the channels respectively aiming at each channel to obtain a first map group; converting the first characteristic graph group into a thermodynamic diagram with a single channel dimension, calculating the thermodynamic diagram with the single channel to obtain an attention weight value corresponding to each spatial position, and multiplying a pixel value on the spatial position by the attention weight value corresponding to the spatial position respectively aiming at each spatial position to obtain a second graph group; and adding the first map group and the second map group pixel by pixel to obtain the first attention enhancement feature map group.

As shown in fig. 14, the process of performing the first processing operation on the first attention enhancing feature map group may include global avg pool (global average pooling operation), FC (full connection layer operation), and Softmax (classification operation).

And A312, in the de-densified residual error neural network, performing feature extraction on the first training sample to obtain a second feature map group, and performing a double-track attention mechanism operation on the second feature map group to obtain a second attention enhancement feature map group.

The process of performing a dual-rail attention mechanism operation on the second feature map set to obtain a second attention enhancement feature map set may include: converting the second feature map group into a feature sequence with one-dimensional space dimensionality, calculating the one-dimensional feature sequence to obtain an attention weight value corresponding to each channel, and multiplying pixel values on the channels by the attention weight values corresponding to the channels respectively aiming at each channel to obtain a third map group; converting the second characteristic graph group into a thermodynamic diagram with a single channel dimension, calculating the thermodynamic diagram of the single channel to obtain an attention weight value corresponding to each spatial position, and multiplying a pixel value on the spatial position by the attention weight value corresponding to the spatial position respectively aiming at each spatial position to obtain a fourth graph group; and adding the third map group and the fourth map group pixel by pixel to obtain the second attention enhancement feature map group.

And A313, integrating the first attention enhancement feature map group and the second attention enhancement feature map group to obtain an integrated feature map group, and performing a second processing operation on the integrated feature map group to obtain a second output.

As shown in fig. 14, in the process of obtaining the integrated feature map set by integrating the first attention enhanced feature map set and the second attention enhanced feature map set, global avg pool (global average pooling operation) is performed on the first attention enhanced feature map set, global avg pool (global average pooling operation) is performed on the second attention enhanced feature map set, and then Concat (integration operation) is performed to obtain the integrated feature map set. The process of performing the second processing operation on the integrated feature map group may include FC (full connectivity layer operation) and Softmax (classification operation).

A32, when the first output is smaller than a preset fraction threshold value, calculating a first loss value based on the first output, and taking the first loss value as the loss value of the residual error neural network with the halved channel; and when the first output is larger than or equal to a preset fraction threshold value, calculating a second loss value based on the first output, calculating a third loss value based on the second output and the scene confrontation loss value, and taking the sum of the second loss value and the third loss value as the loss value of the residual neural network with the halved channel.

A33, using the third loss value as the loss value of the de-densified residual neural network.

In order to enable Resnet50 in the integrated model to be split, during actual training, the Resnet50 part can separately calculate a cross entropy loss value to strengthen optimization. And controlling the return of the integrated model loss gradient through label selection operation, namely when the primary model score of the sample is lower than a preset score threshold value, the integrated model loss gradient of the sample does not need to be returned.

And when the first output is smaller than the preset fraction threshold value, calculating a first loss value based on the first output, wherein the first loss value can be a cross entropy loss value calculated based on the first output, and the first loss value is used as a loss value of the residual error neural network with half channels. When the first output is greater than or equal to a preset fraction threshold value, calculating a second loss value based on the first output, wherein the second loss value can be a cross entropy loss value calculated based on the first output; and calculating a third loss value based on the second output and the scene opposition loss value, the sum of the second loss value and the third loss value being the loss value of the channel halved residual neural network.

And regarding the de-densified residual error neural network, taking the third loss value calculated based on the second output and the scene confrontation loss value as the loss value of the de-densified residual error neural network.

After a scene is added to resist loss values, the illegal wind directions of some illegal videos are reduced under the restriction of the scene, the illegal scores of the videos are reduced and further the videos are not pushed, and more serious illegal videos are pushed. The complete loss function is composed of cross entropy loss value and scene confrontation learning loss valueCalculating to obtain: l ═ L^ce+L^AdversarialWherein L is^ceFor the cross entropy loss value, L^AdversarialThe learning loss value is combated for the scene. Therefore, in the process of calculating the third Loss value based on the second output and the scene opposition Loss value, a cross entropy Loss value, which is an integration Loss value (Ensemble Loss), is calculated based on the second output, and then the cross entropy Loss value is added to the scene opposition Loss value, and the added sum is taken as the third Loss value.

And A34, taking the sum of the loss value of the residual neural network with the halved channel and the loss value of the de-densified residual neural network as the loss value of the machine trial model.

A4, determining that the training of the machine review model is finished under the condition that the loss value of the machine review model meets the preset condition.

For example, the preset condition may be that the loss value of the machine review model is smaller than the preset loss threshold value at least twice in succession.

The structure of the trained trial model is shown at the bottom of fig. 14. And in the process of identifying the video to be audited, inputting the video to be audited into the machine audit model. In the machine-audited model, identification is carried out through a first-level model channel halving Resnet50 to obtain a first output of the first-level model, for example, the first output is an illegal score. When the first output (violation score) is smaller than a preset score threshold value, taking the first output as the output of the machine review model to obtain the violation score of the video to be reviewed; and when the first output (violation score) is greater than or equal to a preset score threshold, identifying through a secondary model (channel halving Resnet50 and de-densification Resnest101) to obtain a second output of the secondary model, for example, the second output is a violation score, and taking the second output as the output of the machine review model to obtain the violation score of the video to be reviewed.

In step 411, the violation score of the video to be audited is calculated by using the aggregation policy.

In the process of identifying the video to be audited by using the machine audit model, each video frame in the video to be audited is identified, and one video frame corresponds to an identification result used for indicating the violation degree of the video frame. For example, the identification result indicating the violation degree of the video frame may be a violation score corresponding to the video frame.

After the violation score corresponding to each video frame is obtained for each video frame, the score of the video to be audited can be calculated based on the aggregation policy. For example, calculating an average value of violation scores of all video frames, and taking the reward cover average value as the violation score of the video to be audited; or selecting the maximum value of the violation scores of all the video frames, and taking the maximum value as the violation score of the video to be audited, and the like.

And step 412, sorting according to the violation scores, and pushing the violation video candidate set according to a pushing ratio.

And obtaining the violation scores corresponding to the videos to be audited aiming at each video to be audited, sorting the videos to be audited according to the violation scores, selecting part of the violation videos as violation video candidate sets according to a push ratio, and pushing the violation video candidate sets to enter a human auditing link.

In the embodiment of the invention, two interference suppression modules of the snapshot decomposition and the special effect suppression are added before the violation identification, so that the interference of the video additional information is greatly reduced, and the violation identification precision is improved. The edge detection enhancement algorithm is improved, the anti-noise performance and the smooth consistency of edge detection are improved, and therefore the precision of the close-shot type detection is improved. In the special effect suppression module, different types of special effect removal processing are comprehensively covered, and the special effects provided by a service party and self-made special effects of users can be effectively suppressed or removed. When a user trains a special effect classification model, the group characteristics of the special effects are combined, multi-task training is adopted, training of two classes is guided through a multi-class feature extraction module, meanwhile, the model is further designed in an accelerated manner on the basis of a lightweight model, and the performance is basically maintained on the basis of reducing network parameters. The pixel-to-pixel special effect suppression mapping model is combined, the input size and the output size are kept consistent, and the special effect is removed while the video characteristic is kept. When the training machine examines the model, the trained multi-label scene recognition model is utilized, and learning of the violation recognition model is stimulated in a counterstudy mode, so that the recognition capability of the violation recognition model is improved when the scene-violation object is irrelevant, error deduction is reduced, and the precision is improved. The double-gauge attention mechanism operation is originally used, the attention dimension is wider than that of the general attention mechanism operation, and the attention activation area is more accurate. The advantages of the multi-level models are played as much as possible in each module, the video calculation amount of the large-parameter model is reduced, and therefore the calculation overhead is reduced on the whole.

Optionally, the edge detection enhancement algorithm in the beat-to-beat decomposition module may also be replaced by an operator with less computation, and bilinear interpolation is changed into dual-threshold detection, so as to further reduce the computation. The self-made special effect recognition model can directly run in a multi-classification mode, can be sent to a pixel-to-pixel special effect suppression mapping model for special effect suppression by classification except non-effective video groups, can also use ShuffleNet with fewer layers to serve as the self-made special effect recognition model, and even if the video with special effects is pushed to the violation recognition model, the weak special effect picture has little influence on the result. The scene excitation scheme can be replaced by other loss functions, and the classification number of the scene model can be richer, but the method brings larger labeling overhead. Lightweight models can be used to act as scene recognition models, speeding up training. The computer review model can use other two-level model combinations to build the cascade model, and can use a non-local attention mechanism with less calculation amount to reduce the calculation amount of the attention mechanism. The machine review model can still be recombined with the multi-label scene recognition model when being used on line, for example, the multi-label scene recognition model is used as a filter, and the calculation amount is further reduced.

FIG. 16 is a block diagram of an audit model training apparatus according to an embodiment of the present invention. As shown in fig. 16, the apparatus may include:

a first obtaining module 161, configured to obtain a first training sample;

the first identification module 162 is configured to identify, by using a multi-label scene identification model trained in advance, probabilities that the first training samples belong to scenes corresponding to labels respectively, and calculate a scene countermeasure loss value based on the probabilities;

a training module 163, configured to calculate a loss value of the trial model based on the scene confrontation loss value in the process of training the trial model by using the first training sample; and determining that the training of the machine review model is finished under the condition that the loss value of the machine review model meets the preset condition.

Optionally, the machine review model comprises a cascaded first-level model and a cascaded second-level model; the primary model comprises a residual error neural network with half-reduced channels; the secondary model includes a halved residual neural network and a de-densified residual neural network for the channel.

Optionally, the training module 163 comprises: an output obtaining unit, configured to input the first training sample into the secondary model, so as to obtain a first output of the channel halved residual error neural network and a second output of the channel halved residual error neural network and the de-densified residual error neural network; a first calculating unit, configured to calculate a first loss value based on the first output when the first output is smaller than a preset score threshold, and use the first loss value as a loss value of a residual neural network with the half-reduced channel; when the first output is larger than or equal to a preset fraction threshold value, calculating a second loss value based on the first output, calculating a third loss value based on the second output and the scene confrontation loss value, and taking the sum of the second loss value and the third loss value as the loss value of the residual neural network with the halved channel; a second calculation unit, configured to use the third loss value as a loss value of the de-densified residual neural network; and the determining unit is used for taking the sum of the loss value of the residual error neural network with the halved channel and the loss value of the de-densified residual error neural network as the loss value of the machine trial model.

Optionally, the output obtaining unit includes: a first output obtaining unit, configured to perform feature extraction on the first training sample in the residual neural network with halved channels to obtain a first feature map set, perform a dual-rail attention mechanism operation on the first feature map set to obtain a first attention enhancement feature map set, and perform a first processing operation on the first attention enhancement feature map set to obtain the first output; a second output obtaining unit, configured to perform feature extraction on the first training sample in the de-densified residual neural network to obtain a second feature map set, and perform a dual-rail attention mechanism operation on the second feature map set to obtain a second attention enhancement feature map set; and integrating the first attention enhancement feature map group and the second attention enhancement feature map group to obtain an integrated feature map group, and performing second processing operation on the integrated feature map group to obtain second output.

Optionally, the first output obtaining unit includes: a first calculating subunit, configured to convert the first feature map group into a feature sequence with a one-dimensional spatial dimension, calculate the one-dimensional feature sequence to obtain an attention weight corresponding to each channel, and multiply, for each channel, a pixel value on the channel by the attention weight corresponding to the channel, respectively, to obtain a first map group; the second calculating subunit is configured to convert the first feature map group into a thermodynamic diagram with a single channel dimension, calculate the thermodynamic diagram of the single channel to obtain an attention weight corresponding to each spatial position, and multiply, for each spatial position, the pixel value at the spatial position by the attention weight corresponding to the spatial position, respectively, to obtain a second map group; and the third calculation subunit is used for adding the first graph group and the second graph group pixel by pixel to obtain the first attention enhancement feature graph group.

Optionally, the apparatus further comprises: a determining module, configured to determine an association relationship between the first training sample and the scene based on the probability; calculating a scene opposition loss value based on the probability by:

wherein L represents the scene opposition loss value, P_i,jRepresenting the probability, P, that the ith first training sample belongs to the jth scene_i,kIs shown asThe probability that the i first training samples belong to the kth scene, num _ sample represents the total number of the first training samples, num _ scene represents the total number of the scenes, M (i, j) represents the association relationship between the ith first training sample and the jth scene, and M (i, j) ═ 1 represents that the ith first training sample is not associated with the jth scene.

The machine check model training device provided by the embodiment of the invention has the corresponding functional module for executing the machine check model training method, can execute the machine check model training method provided by the embodiment of the invention, and can achieve the same beneficial effects.

Fig. 17 is a block diagram of a video screening apparatus according to an embodiment of the present invention. As shown in fig. 17, the apparatus may include:

a second obtaining module 171, configured to obtain a pre-trained machine review model;

the second identification module 172 is configured to input the video to be audited into the machine audit model, so as to obtain an identification result output by the machine audit model; and the identification result is used for indicating the violation degree of the video to be audited.

Optionally, the apparatus further comprises: the suppression module is used for carrying out interference suppression processing on the video to be audited before the video to be audited is input into the machine audit model; the second identification module is specifically configured to input the video to be audited after the interference suppression processing into the machine audit model.

Optionally, the suppression module comprises: the device comprises a co-shooting inhibiting unit, a video processing unit and a video processing unit, wherein the co-shooting inhibiting unit is used for identifying whether the video to be audited is a co-shooting video or not, and splitting the video to be audited into independent videos when the video to be audited is the co-shooting video; and/or the special effect suppression unit is used for identifying whether the video to be audited is a special effect video or not, and suppressing the special effect in the video to be audited when the video to be audited is the special effect video.

Optionally, the beat suppression unit includes: the extraction subunit is used for extracting video frames from the video to be audited; the matching subunit is used for carrying out edge detection enhancement processing on the video frame to obtain a filtering image corresponding to the video frame and matching the filtering image with a preset close-shot template; and the co-shooting determining subunit is used for determining that the video to be audited is the co-shooting video when the matching is successful.

Optionally, the matching subunit is specifically configured to, for each close-up template, extract a first pixel corresponding to a boundary position of the current close-up template from the current close-up template, extract a second pixel corresponding to a position that is the same as the boundary position of the current close-up template from the filtered image, and calculate a distance between the first pixel and the second pixel; and if the minimum distance in the calculated distances is smaller than a preset distance threshold, determining that the matching between the filtering image and the snapshot template corresponding to the minimum distance is successful.

Optionally, the special effect suppression unit includes: the judging subunit is used for acquiring the video information of the video to be audited and judging whether a special effect information field exists in the video information; a special effect determining subunit, configured to determine, when the judging subunit judges that the video exists, that the video to be audited is a special effect video including a business side special effect template; and the special effect identification subunit is used for identifying whether the video to be audited is a special effect video containing the self-made special effect or not by utilizing a pre-trained self-made special effect identification model when the judgment subunit judges that the video does not exist.

Optionally, the homemade special effect recognition model is trained through the following modules: the third acquisition module is used for acquiring a second training sample; the first training module is used for training a multi-classification self-made special effect recognition model to be trained by utilizing the second training sample, and a feature extraction part in the trained multi-classification self-made special effect recognition model is used as a feature extraction part of the two-classification self-made special effect recognition model to be trained; and the second training module is used for keeping the parameters of the feature extraction part of the to-be-trained classified self-made special effect recognition model unchanged, training the to-be-trained classified self-made special effect recognition model by using the training sample, and taking the trained classified self-made special effect recognition model as the self-made special effect recognition model.

The video machine auditing device provided by the embodiment of the invention has the corresponding functional module for executing the video machine auditing method, can execute the video machine auditing method provided by the embodiment of the invention, and can achieve the same beneficial effects.

In another embodiment provided by the present invention, there is also provided an electronic device, which may include: the processor implements each process of the embodiment of the machine audit model training method or implements each process of the embodiment of the video machine audit method when executing the program, and can achieve the same technical effect, and the process is not repeated here to avoid repetition. For example, as shown in fig. 18, the electronic device may specifically include: a processor 181, a storage device 182, a display screen 183 with touch functionality, an input device 184, an output device 185, and a communication device 186. The number of the processors 181 in the electronic device may be one or more, and one processor 181 is taken as an example in fig. 18. The processor 181, the storage device 182, the display 183, the input device 184, the output device 185, and the communication device 186 of the electronic device may be connected by a bus or other means.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to perform the method for training a review model as described in any of the above embodiments or perform the method for video review as described in any of the above embodiments.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions, which when run on a computer, causes the computer to perform the computer review model training method described in any of the above embodiments or the video computer review method described in any of the above embodiments.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for training an audit model, the method comprising:

obtaining a first training sample;

recognizing the probability that the first training sample belongs to the scene corresponding to each label respectively by using a pre-trained multi-label scene recognition model, and calculating a scene confrontation loss value based on the probability;

calculating a loss value of the computer trial model based on the scene confrontation loss value in the process of training the computer trial model by using the first training sample;

and determining that the training of the machine review model is finished under the condition that the loss value of the machine review model meets the preset condition.

2. The method of claim 1, wherein the screening model comprises a cascaded primary model and a secondary model; the primary model comprises a residual error neural network with half-reduced channels; the secondary model includes a halved residual neural network and a de-densified residual neural network for the channel.

3. The method of claim 2, wherein the calculating a loss value of the computer review model based on the scenario opposition loss value during the training of the computer review model using the first training sample comprises:

inputting the first training sample into the secondary model to obtain a first output of the channel halved residual error neural network and a second output of the channel halved residual error neural network and the de-densified residual error neural network;

when the first output is smaller than a preset fraction threshold value, calculating a first loss value based on the first output, and taking the first loss value as a loss value of the residual error neural network with the halved channel; when the first output is larger than or equal to a preset fraction threshold value, calculating a second loss value based on the first output, calculating a third loss value based on the second output and the scene confrontation loss value, and taking the sum of the second loss value and the third loss value as the loss value of the residual neural network with the halved channel;

taking the third loss value as a loss value of the de-densified residual neural network;

and taking the sum of the loss value of the residual error neural network with the halved channel and the loss value of the de-densified residual error neural network as the loss value of the machine review model.

4. The method of claim 3, wherein inputting the first training sample into the secondary model results in a first output of the channel halved residual neural network and a second output of the channel halved residual neural network and the de-densified residual neural network integrated, comprising:

in the residual error neural network with the halved channel, performing feature extraction on the first training sample to obtain a first feature map group, performing double-track attention mechanism operation on the first feature map group to obtain a first attention enhancement feature map group, and performing first processing operation on the first attention enhancement feature map group to obtain a first output;

in the de-densified residual error neural network, performing feature extraction on the first training sample to obtain a second feature map set, and performing double-track attention mechanism operation on the second feature map set to obtain a second attention enhancement feature map set;

and integrating the first attention enhancement feature map group and the second attention enhancement feature map group to obtain an integrated feature map group, and performing second processing operation on the integrated feature map group to obtain second output.

5. The method of claim 4, wherein the operating a dual rail attention mechanism on the first set of feature maps results in a first set of attention enhancement feature maps comprising:

converting the first feature map group into a feature sequence with one-dimensional space dimensionality, calculating the one-dimensional feature sequence to obtain an attention weight value corresponding to each channel, and multiplying pixel values on the channels by the attention weight values corresponding to the channels respectively aiming at each channel to obtain a first map group;

converting the first characteristic graph group into a thermodynamic diagram with a single channel dimension, calculating the thermodynamic diagram with the single channel to obtain an attention weight value corresponding to each spatial position, and multiplying a pixel value on the spatial position by the attention weight value corresponding to the spatial position respectively aiming at each spatial position to obtain a second graph group;

and adding the first map group and the second map group pixel by pixel to obtain the first attention enhancement feature map group.

6. The method of claim 1,

after the probability that the first training sample belongs to the scene corresponding to each label respectively is obtained by recognition through the pre-trained multi-label scene recognition model, the method further comprises the following steps: determining an incidence relation between the first training sample and the scene based on the probability;

calculating a scene opposition loss value based on the probability by:

wherein L represents the scene opposition loss value, P_i,jRepresenting the probability, P, that the ith first training sample belongs to the jth scene_i,kThe probability that the ith first training sample belongs to the kth scene is represented, num _ sample represents the total number of the first training samples, num _ scene represents the total number of the scenes, M (i, j) represents the association relationship between the ith first training sample and the jth scene, and M (i, j) ═ 1 represents that the ith first training sample is not associated with the jth scene.

7. A video screening method, the method comprising:

acquiring a pre-trained machine review model; the trial model is trained by the method of any one of claims 1 to 6;

inputting a video to be audited into the machine audit model to obtain an identification result output by the machine audit model; and the identification result is used for indicating the violation degree of the video to be audited.

8. The method of claim 7,

before inputting the video to be audited into the machine audit model, the method further comprises the following steps: performing interference suppression processing on the video to be audited;

the inputting the video to be audited into the machine audit model comprises: and inputting the video to be audited after interference suppression processing into the machine audit model.

9. The method according to claim 8, wherein the performing interference suppression processing on the video to be audited includes:

identifying whether the video to be audited is a video in time;

when the video to be audited is a close shot video, splitting the video to be audited into independent videos;

and/or the presence of a gas in the gas,

identifying whether the video to be audited is a special effect video;

and when the video to be audited is the special-effect video, inhibiting the special effect in the video to be audited.

10. The method according to claim 9, wherein the identifying whether the video to be audited is a video in-tune comprises:

extracting video frames from the video to be audited;

performing edge detection enhancement processing on the video frame to obtain a filtering image corresponding to the video frame, and matching the filtering image with a preset close-shot template;

and when the matching is successful, determining the video to be audited as a snap-shot video.

11. The method of claim 10, wherein matching the filtered image with a preset snap template comprises:

for each close-up template, extracting a first pixel corresponding to the position of the boundary of the current close-up template from the current close-up template, extracting a second pixel corresponding to the position same as the position of the boundary of the current close-up template from the filtered image, and calculating the distance between the first pixel and the second pixel;

and if the minimum distance in the calculated distances is smaller than a preset distance threshold, determining that the matching between the filtering image and the snapshot template corresponding to the minimum distance is successful.

12. The method according to claim 9, wherein the identifying whether the video to be audited is a special effect video comprises:

acquiring video information of the video to be audited, and judging whether a special effect information field exists in the video information;

if the video exists, determining the video to be audited as a special effect video containing a business side special effect template;

and if not, identifying whether the video to be audited is a special effect video containing the self-made special effect or not by using a pre-trained self-made special effect identification model.

13. The method of claim 12, wherein the homemade special effects recognition model is trained by:

obtaining a second training sample;

training a multi-classification self-made special effect recognition model to be trained by utilizing the second training sample, and taking a feature extraction part in the trained multi-classification self-made special effect recognition model as a feature extraction part of a two-classification self-made special effect recognition model to be trained;

keeping the parameters of the feature extraction part of the to-be-trained two-class self-made special effect recognition model unchanged, training the to-be-trained two-class self-made special effect recognition model by using the training sample, and taking the trained two-class self-made special effect recognition model as the self-made special effect recognition model.

14. An audit model training apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a first training sample;

the first identification module is used for identifying and obtaining the probability that the first training sample belongs to the scene corresponding to each label respectively by utilizing a pre-trained multi-label scene identification model, and calculating a scene countermeasure loss value based on the probability;

the training module is used for calculating a loss value of the computer review model based on the scene confrontation loss value in the process of training the computer review model by using the first training sample; and determining that the training of the machine review model is finished under the condition that the loss value of the machine review model meets the preset condition.

15. A video screening apparatus, comprising:

the second acquisition module is used for acquiring a pre-trained machine review model; the trial model is trained by the method of any one of claims 1 to 6;

the second identification module is used for inputting the video to be audited into the machine audit model to obtain an identification result output by the machine audit model; and the identification result is used for indicating the violation degree of the video to be audited.

16. An electronic device, comprising:

a processor, a memory, and a computer program stored on the memory and executed on the processor; the processor, when executing the program, implements the machine review model training method of any one of claims 1 to 6, or the video machine review method of any one of claims 7 to 13.

17. A computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the machine review model training method of any of claims 1 to 6 or the video machine review method of any of claims 7 to 13.