CN111325067A

CN111325067A - Illegal video identification method and device and electronic equipment

Info

Publication number: CN111325067A
Application number: CN201811536558.6A
Authority: CN
Inventors: 苏驰; 刘弘也
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd; Beijing Kingsoft Cloud Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd; Beijing Kingsoft Cloud Technology Co Ltd
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2020-06-23
Anticipated expiration: 2038-12-14
Also published as: CN111325067B

Abstract

The embodiment of the invention provides a method and a device for identifying an illegal video and electronic equipment, wherein the identification result is obtained by identifying the video to be identified based on a preset identification model. Wherein the recognition model comprises a first recognition model and a second recognition model; the recognition result comprises: a first recognition result obtained based on the first recognition model, and a second recognition result obtained based on the second recognition model. Determining a video violation if there is at least one violation in the first recognition result and the second recognition result; the first recognition model is obtained by training the initial convolutional neural network model by utilizing initial sample data in advance, and the second recognition model is obtained by training the first recognition model by utilizing updated data of the initial sample data in advance. Therefore, the method and the device can realize the stable and reliable identification of the illegal video without being influenced by long-time updating evolution, and reduce the probability of missed detection.

Description

Illegal video identification method and device and electronic equipment

Technical Field

The invention relates to the technical field of video identification, in particular to a method and a device for identifying an illegal video and electronic equipment.

Background

With the rapid development of the video industry, the number of videos facing video supervision and control work is increased explosively, and the identification mode of manually watching each video to identify illegal videos is difficult to meet the requirement. Meanwhile, since video identification is essentially an image identification process, the video frame is identified by introducing a computer vision technology so as to realize automatic identification of the video, so that the supervision requirements of a large number of videos are met.

In the video identification technology, if the video is identified frame by frame, a large amount of calculation is required, and the identification efficiency of the video is too low. Therefore, in the corresponding video identification method, the video is mostly subjected to frame extraction inspection based on the standard image identification technology, and the adopted technical scheme can be summarized as follows: the method comprises the steps of sampling a video in a frame-drawing mode, inputting a sampled image into a pre-trained convolutional neural network for detection, obtaining violation confidence coefficient that the image belongs to violation, and marking the image or the video as violation when the confidence coefficient is larger than a set threshold value.

However, the above-mentioned pre-trained convolutional neural network often cannot well identify some new offending videos that have not appeared in the sample data; if the new violation video is brought in to update the original sample data, the original convolutional neural network is updated by using the updated sample data, and the problem that the original violation video cannot be well identified by the convolutional neural network trained by using the updated sample data after long-time updating and evolution is possibly caused. For example, in a certain period, the original violation videos do not appear for a long time, and a large number of new violation videos appear, so that the number of new violation videos contained in the updated sample data is far greater than that of the original violation videos, and the convolutional neural network trained by using the updated sample data cannot well identify the original violation videos.

Therefore, how to stably and reliably identify the illegal video without being influenced by long-time update evolution is a problem to be solved by the existing video identification technology.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for identifying an illegal video and electronic equipment, so that the illegal video is not influenced by long-time updating evolution, the illegal video is stably and reliably identified, and the probability of missed detection is reduced. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for identifying an illegal video, where the method includes:

identifying a video to be identified based on a preset identification model to obtain an identification result; wherein the recognition model comprises a first recognition model and a second recognition model; the recognition result comprises: a first recognition result obtained based on the first recognition model and a second recognition result obtained based on the second recognition model;

determining a video violation if there is at least one violation in the first recognition result and the second recognition result;

the first recognition model is obtained by training an initial convolutional neural network model by utilizing initial sample data in advance, and the second recognition model is obtained by training the first recognition model by utilizing updated data of the initial sample data in advance.

In a second aspect, an embodiment of the present invention provides an apparatus for identifying an illegal video, where the apparatus includes:

the identification module is used for identifying the video to be identified based on a preset identification model to obtain an identification result; wherein the recognition model comprises a first recognition model and a second recognition model; the recognition result comprises: a first recognition result obtained based on the first recognition model and a second recognition result obtained based on the second recognition model;

a determining module, configured to determine the video violation when at least one of the first recognition result and the second recognition result violates;

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing all the steps of the illegal video identification method provided by the first aspect when executing the program stored on the memory.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for identifying an illegal video provided in the first aspect.

According to the method and the device for identifying the illegal video and the electronic equipment, provided by the embodiment of the invention, the video to be identified is identified based on the preset identification model, so that the identification result is obtained. Wherein the recognition model comprises a first recognition model and a second recognition model; the recognition result comprises: a first recognition result obtained based on the first recognition model, and a second recognition result obtained based on the second recognition model. Determining a video violation if there is at least one violation in the first recognition result and the second recognition result; the first recognition model is obtained by training the initial convolutional neural network model by utilizing initial sample data in advance, and the second recognition model is obtained by training the first recognition model by utilizing updated data of the initial sample data in advance. Because the preset identification model comprises the first identification model and the second identification model, the new violation video corresponding to the update data of the initial sample data can be identified by the second identification model obtained by training the update data of the initial sample data while the initial sample data is memorized by the first identification model, so that the new violation video can be identified while the original violation video corresponding to the initial sample data is prevented from being forgotten. Therefore, the method and the device can realize the stable and reliable identification of the illegal video without being influenced by long-time updating evolution, and reduce the probability of missed detection.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a schematic flowchart of a method for identifying an offending video according to an embodiment of the invention;

FIG. 2 is a schematic flow chart illustrating a training method for a predetermined recognition model according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a method for identifying an offending video according to another embodiment of the invention;

fig. 4 is a schematic structural diagram of an apparatus for identifying an illegal video according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

First, a method for identifying an illegal video according to an embodiment of the present invention is described below.

It should be noted that the method for identifying an illegal video provided in the embodiment of the present invention may be applied to an electronic device capable of performing data processing, where the electronic device includes a desktop computer, a portable computer, an internet television, an intelligent mobile terminal, a wearable intelligent terminal, a server, and the like.

As shown in fig. 1, a flow of the method for identifying an illegal video according to an embodiment of the present invention may include:

s101, identifying the video to be identified based on a preset identification model to obtain an identification result. The preset identification model comprises a first identification model and a second identification model; the recognition result comprises: a first recognition result obtained based on the first recognition model, and a second recognition result obtained based on the second recognition model.

The first recognition model is obtained by training the initial convolutional neural network model by utilizing initial sample data in advance, and the second recognition model is obtained by training the first recognition model by utilizing updated data of the initial sample data in advance.

Specifically, the initial sample data is a sample data set which is collected before the first recognition model is trained and contains the illegal video and the legal video. The updated data of the initial sample data can be a sample data set which is brought into the new violation video and updates the initial sample data once when the new violation video appears for the first time; the rule violation video is included in the rule violation video and the initial sample data is updated for multiple times. The training process of the first recognition model and the second recognition model can adopt the existing method for training the convolutional neural network, such as a batch random gradient descent algorithm.

In order to obtain a first recognition result and a second recognition result based on a preset recognition model, the recognizing a video to be recognized based on the preset recognition model to obtain a recognition result may specifically include the following steps a1 to a 2:

a1, identifying the video to be identified based on the first identification model to obtain a first identification result;

and A2, identifying the video to be identified based on the second identification model to obtain a second identification result.

The execution sequence of the steps a1 and a2 is not limited in this embodiment, and may be executed simultaneously, or may be executed first in step a1 and then in step a2, or may be executed first in step a2 and then in step a 1.

It can be understood that, the first recognition model is obtained by training the initial convolutional neural network model by using the initial sample data in advance, and the second recognition model is obtained by training the first recognition model by using the updated data of the initial sample in advance; therefore, the first identification result is an identification result of whether the video to be identified is the illegal video contained in the initial sample data, and the second identification result is an identification result of whether the video to be identified is the illegal video contained in the update data of the initial sample.

That is, the recognition using the first recognition model is: whether the video to be identified accords with the characteristic information of the illegal video contained in the initial sample data or not, namely whether the video to be identified is: the corresponding violation type is the video of the type of the violation video contained in the initial sample data; and the recognition with the second recognition model is: whether the video to be identified accords with the feature information of the illegal video contained in the update data of the initial sample or not, namely whether the video to be identified is: the corresponding violation type is a video of the type of violation video included in the update data of the initial sample.

In addition, the first recognition result and the second recognition result may specifically be violation confidence coefficients characterizing the probability that the video to be recognized is a violation video, or may also be identifiers indicating that the video to be recognized is a violation or is legal, for example, 1 or 0.

Of course, the specific way of identifying the input video by the first identification model and the second identification model may be to perform frame extraction inspection on the video to be identified based on the existing standard image identification technology, or to perform video segment inspection on the video to be identified based on the standard image identification technology.

And S102, if at least one violation exists in the first recognition result and the second recognition result, determining the video violation.

And if one violation exists in the first recognition result and the second recognition result, the video to be recognized is recognized as a violation video by the first recognition model, or recognized as a violation video by the second recognition model.

For example, the violation video included in the initial sample data is characterized by a naked body part, and the violation video included in the update data of the initial sample data is characterized by a human action violation in the video. If the video to be recognized is a video with a naked body part, the first recognition result is violation, and the second recognition result is non-violation; and if the video to be identified is the video in which the human movement is illegal, the second identification result is illegal, and the first identification result is non-illegal.

And if the first recognition result and the second recognition result are both illegal, identifying the video to be recognized as the illegal video by the first recognition model and identifying the video to be recognized as the illegal video by the second recognition model.

For example, the violation video included in the initial sample data is characterized by a naked body part, and the violation video included in the update data of the initial sample data is characterized by a human action violation in the video. And if the video to be recognized is a video with a naked body part and an illegal action exists, the first recognition result is an illegal action, and the second recognition result is an illegal action.

As can be seen, if at least one violation exists in the first recognition result and the second recognition result, it indicates that the video to be recognized is recognized as a violation by at least one of the first recognition model and the second recognition model, and therefore, the video violation can be determined.

According to the method for identifying the illegal video, provided by the embodiment of the invention, because the preset identification model comprises the first identification model and the second identification model, the first identification model can be used for memorizing the initial sample data, and meanwhile, the second identification model obtained by training the updated data of the initial sample data is used for identifying the new illegal video corresponding to the updated data of the initial sample data, so that the original illegal video corresponding to the initial sample data can be prevented from being forgotten, and meanwhile, the new illegal video can be identified. Therefore, the method and the device can realize the stable and reliable identification of the illegal video without being influenced by long-time updating evolution, and reduce the probability of missed detection.

Optionally, as shown in fig. 2, a flow of a training method of a recognition model preset in the embodiment of fig. 1 of the present invention may include:

s201, inputting the collected multiple sample images into an initial convolutional neural network model for training, and obtaining the confidence coefficient of the prediction violation of the video segment formed by the multiple sample images.

The prediction violation confidence is the probability that a video clip formed by a plurality of sample images belongs to the violation video after the initial convolutional neural network model processes the input sample images, and is the detection result of the initial convolutional neural network model on the sample images.

And S202, judging whether the convolutional neural network model in the current training stage is converged by using a preset error function according to the obtained prediction violation credibility and the pre-marked class information of whether each sample image belongs to the violation. If converged, step S203 is performed, and if not converged, steps S204 to S205 are performed.

S203, determining the convolutional neural network model in the current training stage as a preset recognition model.

The specific method for judging whether the current target detection model converges by using the preset error function may be that the minimum value of the preset error function is calculated by taking the minimum preset error function as a target, and when the minimum value is obtained, the current target detection model converges, and when the minimum value is not obtained, the current target detection model does not converge.

The preset error function is used for calculating the class information of whether the pre-labeled sample image in each sample image belongs to violation, and the difference between the detection result of the convolutional neural network model in the current training stage on the sample image is smaller, so that the detection result is more accurate. Therefore, when the preset error function obtains the minimum value, the detection result of the convolutional neural network model in the current training stage on the sample image is the same as the pre-labeled class information. Therefore, when the convolutional neural network model of the current training stage converges, the convolutional neural network model of the current training stage may be determined as the preset target detection model.

And S204, adjusting the model parameters of the convolutional neural network model in the current training stage by using a preset gradient function and a random gradient descent algorithm.

S205, inputting the collected multiple sample images into the adjusted convolutional neural network model, and repeating the steps of training and adjusting the model parameters until the adjusted convolutional neural network converges.

The stochastic gradient descent algorithm adjusts model parameters of the convolutional neural network model in the current training stage, so that after the convolutional neural network model is adjusted through the model parameters, the detection result is improved, the difference between the detection result and pre-labeled category information is reduced, and convergence is achieved.

Accordingly, the above steps of training and adjusting the model parameters are repeated until the model in the current training phase converges. Of course, each training is performed on the convolutional neural network model with the newly adjusted model parameters.

Meanwhile, it can be understood that both the first recognition model and the second recognition model can be obtained by training in the manner of the embodiment of fig. 2, and the training difference is that: when training is carried out to obtain a first recognition model, the initial convolutional neural network in the step S201 is an initial convolutional neural network, and the plurality of sample images in the step S201 are initial sample data; when training is performed to obtain the second recognition model, the initial convolutional neural network in step S201 is the first recognition model, and the multiple sample images in step S201 are the updated data of the initial samples.

Furthermore, in a particular application, if a vulgar action occurs in a video, the video also belongs to an offending video. However, the motion characteristics are reflected by the overall characteristics of the video segment composed of a plurality of image frames, and if the convolutional neural network for image frame identification is used for identification, only some single image frames forming the low-grade motion can be identified, but the overall characteristics of the video segment composed of a plurality of image frames cannot be identified, so that the low-grade motion is difficult to identify, and the illegal video is missed to be detected.

For this reason, optionally, at least one of the steps a1 and a2 in the embodiment of fig. 1 of the present invention may specifically include the following steps a11 to a 14:

a11, acquiring a plurality of image frames from the video to be identified.

The obtaining of the plurality of image frames may specifically be acquiring the plurality of image frames from the video to be recognized according to a preset period, so as to obtain a plurality of image frames at equal intervals. Since the motion is formed by continuous image frames, and the difference of the continuous image frames without intervals is not large, compared with the continuous image frames without intervals, the image frames with equal intervals can keep the image frames reflecting the motion characteristics as much as possible, and simultaneously avoid the slow data processing speed caused by the huge data amount formed by acquiring the continuous image frames without intervals.

For example, in the video to be recognized, of all the image frames constituting the action of drinking water by the person, the 1 st to 5 th image frames without intervals may be the action of touching the cup by the hand of the person, the 6 th to 15 th image frames without intervals may be the action of picking up the cup by the person, and the 16 th to 25 th image frames without intervals may be the action of drinking water by the person. When a plurality of image frames are collected according to a preset period, a 5 th image frame a in which a hand of a person touches a cup, a 10 th image frame B and a 15 th image frame C in which the person picks up the cup, a 20 th image frame D and a 25 th image frame E in which the person drinks water can be obtained, so that the motion characteristic of the person drinking water can be reflected in relatively fewer image frames.

A12, respectively extracting the features of a plurality of image frames to obtain an image frame feature matrix of each image frame.

The image frame may be subjected to feature extraction in various ways. For example, each image frame may be subjected to feature extraction by using a preset convolutional neural network, where the preset convolutional neural network is obtained by training in advance by using a plurality of sample images, and the plurality of sample images may constitute a sample video belonging to the violation video. The feature extraction may be performed on each of the plurality of image frames by using a feature extraction algorithm such as a HOG (Histogram of Oriented Gradient) algorithm, or by using a feature extraction algorithm such as an LBP (Local Binary Pattern) algorithm.

Any feature extraction algorithm that can be used to extract the illegal features and non-illegal features of the image can be used in the present invention, and this embodiment does not limit this.

For example, feature extraction is performed on the image frame a, the image frame B, the image frame C, the image frame D, and the image frame E, respectively, to obtain an image feature matrix a of the image frame a, an image feature matrix B of the image frame B, an image feature matrix C of the image frame C, an image feature matrix D of the image frame D, and an image feature matrix E of the image frame E.

Optionally, in a specific application, the feature extraction algorithm for extracting the illegal features and non-illegal features of the image may be used as a sub-network of the recognition model preset in the embodiment of fig. 1 of the present invention, and then the step a12 may specifically include:

and respectively extracting the features of the plurality of image frames by using a feature extraction sub-network based on a preset recognition model to obtain an image frame feature matrix of each image frame.

For example, the image frame a and the image frame B are input into the first recognition model F1 and the second recognition model F2, respectively, and the image frame feature matrix a1, the image frame feature matrix a2, the image frame feature matrix B1 and the image frame feature matrix B2 of the image frame a are obtained. Where n is the number of pre-set recognition models.

And A13, splicing the image frame feature matrixes to obtain a video segment feature matrix.

It will be appreciated that a plurality of image frames may constitute a video segment, and the characteristics of the video segment need to reflect the changing relationship of the respective image frames constituting the video segment in the time dimension. Therefore, the image frame feature matrixes can be spliced to obtain a video segment feature matrix capable of reflecting the features of the video segment consisting of the image frames.

Illustratively, the image frame acquired during violation video identification is a three-channel color image, and correspondingly, the image frame feature matrix is a three-dimensional feature matrix. Therefore, the image frame feature matrix is spliced, specifically, the image frame feature matrices may be spliced into a four-dimensional feature matrix, for example, the image frame feature matrices (c, h, w) of M image frames are spliced into a video segment feature matrix (M, c, h, w) of a video segment composed of M image frames. Wherein h is the length of the matrix, w is the width of the matrix, and c is the number of channels of the matrix.

And A14, identifying the video segment feature matrix to determine whether the video is illegal.

The identification of the video segment feature matrix is performed to determine whether the video violates, that is, whether the video features reflected by the video segment feature matrix violate. Therefore, the specific way of identifying the video segment feature matrix can be any classification algorithm, so as to obtain the identification result that the video feature matrix belongs to violation or legality.

For example, the classification algorithm may be a classifier model, such as a SoftMax (flexible maximization) classifier, a two-classifier model, and the like. Of course, the classification algorithm is obtained by training in advance by using a plurality of sample images containing violations and non-violations. Any classification algorithm that can be used to distinguish between offending and non-offending features can be used with the present invention, and this embodiment is not limited in this respect.

The video segment feature matrix is obtained by splicing the image frame feature matrices of a plurality of image frames, so that the overall features of the video segment consisting of the image frames can be reflected, and the action features in the video segment can be reflected. Therefore, by utilizing the identification of the video segment feature matrix, compared with the mode of only identifying a single image frame in a video, the naked image in the single image frame can be identified, and the vulgar action in the video segment can also be identified, so that the illegal video missing detection probability caused by the incapability of identifying the vulgar action is reduced.

It should be noted that, since the efficiency of identifying a plurality of image frames is lower than that of identifying a single image frame, when the efficiency requirement for the identification of the violation video is prioritized over the reduction of the probability of missing detection, one of the first identification model and the second identification model may perform the identification of the video segment of the video to be identified by using the above-mentioned steps a11 to a14, and the other may perform the identification by using the existing method for identifying a single image frame. Of course, for a recognition model for performing single image frame recognition, one image frame may be selected from a plurality of image frames and input into the recognition model, so as to obtain a recognition result of the recognition model for the image frame.

When the need for reducing the probability of missed detection is prioritized over the need for efficiency in identifying the offending video, the first identification model and the second identification model may each perform identification of the video segment of the video to be identified using the steps a11 through a14 described above.

Optionally, in a specific application, any classification algorithm for identifying the feature matrix of the video segment may be used as a sub-network of the preset identification model in the embodiment of fig. 1 of the present invention, and then the step a14 may specifically include:

and identifying the video segment feature matrix based on a classifier sub-network of a preset identification model to determine whether the video is illegal.

Specifically, the classifier subnetwork of the preset recognition model recognizes the video segment feature matrix, and may input the video segment feature matrix into the classifier subnetwork of the preset recognition model to obtain a confidence coefficient that the video segment feature matrix is illegal, or obtain an identifier that the video segment feature matrix is illegal.

Optionally, the identifying, by the classifier subnetwork based on the preset identification model, the video segment feature matrix to determine whether the video is illegal may include:

inputting the video segment characteristic matrix into a classifier sub-network of a preset recognition model to obtain violation confidence of the video;

and if the violation confidence coefficient meets the preset violation condition, determining that the video is violated.

The classifier subnetwork is used for acquiring the confidence coefficient of the video segment corresponding to the input video feature matrix, and the confidence coefficient is used as the confidence coefficient that the video belongs to the violation type. The preset violation condition may specifically be that the violation confidence belongs to a preset confidence interval, or that the violation confidence is not less than a preset confidence threshold. The preset confidence interval and the preset confidence threshold are determined when the classification algorithm is finished.

For example, M image frames are collected from a video to be identified according to a preset period, wherein M is larger than 1, and the M image frames are all three-channel RGB image frames with length W and height H. Respectively inputting the M image frames into a feature extraction sub-network of a preset identification model, extracting respective image frame feature matrixes (c, h and w) of the M image frames through the operation of the feature extraction sub-network, and splicing into a video clip feature matrix f of a video clip consisting of the M image frames₁(M, c, h, w). Wherein h is the length of the matrix, w is the width of the matrix, and c is the number of channels of the matrix. Feature matrix f of video clip₁Inputting a classifier sub-network of a preset recognition model (M, c, h, w), and obtaining a video segment feature matrix f through the operation of the classifier sub-network₁And (M, c, h, w) as the violation confidence of the video, the violation confidence of the corresponding video segment.

Optionally, the step of inputting the video segment feature matrix into a classifier subnetwork of a preset recognition model to obtain the violation confidence of the video may specifically include:

performing transposition processing on the video segment characteristic matrix to obtain a transposed video segment characteristic matrix;

and (3) inputting the feature matrix of the transposed video segment into a preset first full-connection function to obtain an output, and taking the output as the input of the logistic regression loss function to obtain the violation confidence coefficient of the video.

The transposition processing is carried out on the video segment characteristic matrix so as to transpose the matrix indicating the video segment characteristics into a form which is convenient for inputting the first full-connection function. For example, feature matrix f for video segments₁And (M, c, h and w) performing transposition processing to obtain a feature matrix f of the transposed video segment₂(c, M, h, w). The logistic regression loss function may specifically be a sigmoid activation function.

It will be appreciated that in the transposed video feature matrix used to obtain confidence in the violation, the features represented by different elements are correlated to the violation features to a different degree. For example, if the feature 1 represented by the element 1 is a wall feature and the feature 2 represented by the element 2 is a human body feature, the degree of correlation between the feature represented by the element 1 and the violation feature is low, and the degree of correlation between the feature represented by the element 2 and the violation feature is high.

Therefore, in order to better identify the elements with high correlation degree with the illegal features, optionally, after the step of transposing the video segment feature matrix to obtain the transposed video segment feature matrix, the method for identifying the illegal video according to the embodiment of the present invention may further include the following steps B1 to B3:

and B1, inputting the feature matrix of the transposed video segment into a attention mechanism sub-network of a preset identification model to obtain a space-time response weight matrix.

The attention mechanism subnetwork may specifically be a function for extracting the correlation degree of each element in the transposed video segment feature matrix with the violation feature, so as to subsequently perform weighting processing on the transposed video segment feature matrix by using the spatio-temporal response weight matrix in step B2.

And B2, weighting the feature matrix of the transfer video segment by using the space-time response weight matrix to obtain a video feature vector.

Illustratively, using a spatio-temporal response weight matrix P₁For (M, h, w), the feature matrix f for the transition video segment₂Weighting (c, M, h, w) to obtain video feature vector

Wherein j represents the j-th image frame in the video segment formed by the M image frames, (k, l) represents a rectangular area with coordinates (k, l) in the image frame, and i represents the ith dimension of the c-dimensional video feature vector.

Correspondingly, the above-mentioned inputting the transposed video segment feature matrix into the preset first full-join function to obtain an output, which is used as an input of the logistic regression loss function to obtain the violation confidence of the video, includes:

and B3, inputting the video feature vector into a preset first full-connection function to obtain an output, and taking the output as the input of the logistic regression loss function to obtain the violation confidence of the video.

Illustratively, a video feature vector is combined

Inputting the output obtained after the preset first full-connection function, and inputting a preset activation function, such as a sigmoid activation function, to obtain a video segment feature matrix f₁(M, c, h, w) corresponding confidence in the violation of the video.

Alternatively, considering that the identification of the video violation or the video legality is a two-class classification and the video feature matrix for identifying the motion feature may not pay attention to the color feature, and therefore, the attention mechanism sub-network may also be an algorithm for obtaining a spatio-temporal response weight matrix with a value range of [0,1] and a dimension of M · h · w, the step B1 may include the following steps B11 to B13:

b11, transposing and dimension reduction deformation are carried out on the transposed video segment characteristic matrix to obtain a dimension reduction video segment characteristic matrix;

b12, inputting the dimensionality reduction video clip feature matrix into a preset second full-connection function and a preset activation function to obtain a response weight matrix;

and B13, performing deformation recovery on the response weight matrix to obtain a space-time response weight matrix.

For example, the steps B11 to B13 may include: to the feature matrix f of the transferred video segment₂And (c, M, h and w) transposing and carrying out dimensionality reduction deformation to obtain a dimensionality reduction video segment feature matrix M.h.w.c. The output obtained after the feature matrix of the dimensionality reduction video clip is input into a preset second full-connection function is used as the input of a preset activation function, such as a sigmoid activation function, to obtain a value range of [0,1]]And a response weight matrix with dimension M h w. Performing deformation recovery on the response weight matrix to obtain a space-time response weight matrix P₁＝(M,h,w)。

Of course, the preset first full-join function and the preset second full-join function may be full-join functions with a hidden layer of 1, and are both used to reduce elements in the video segment feature matrix, so as to obtain elements capable of reflecting the illegal or legal features of the video segment, and prevent overfitting. The difference is that the first full join function is used for calculating the feature matrix of the transposed video segment, and the second full join function is used for calculating the feature matrix of the reduced-dimension video segment.

In specific application, in order to reduce missing inspection when an illegal video is identified by using a vision technology, a supervisor generally performs secondary audit on the illegal video identified by an identification model during identification of the illegal video. However, when the illegal area of the illegal video is small, for example, a small area at a corner of a screen, and a supervisor needs to watch multiple illegal videos at the same time, the illegal video with the small illegal area may be judged as a legal video by mistake during review, and detection omission may still be caused.

Therefore, optionally, after the steps B11 to B13 are adopted to determine the video violation, an area with high violation confidence coefficient may be identified in the violation video, and then the coordinates of the violation area are output, so that a supervisor can confirm the violation area, false judgment caused by the fact that the violation area is not easy to be watched is avoided, and the probability of missed detection is reduced, specifically, the following steps C1 to C3 may be adopted to realize that the area with high violation confidence coefficient is labeled in the video:

c1, carrying out normalization processing on the space-time response weight matrix to obtain violation response values of all preset rectangular areas forming the image frame in all image frames corresponding to the space-time response weight matrix;

c2, aiming at each violation response value, judging whether the violation response value is larger than a preset violation threshold value;

and C3, if the violation response value is larger than a preset violation threshold, outputting the coordinate information of the preset rectangular area corresponding to the violation type response value.

For example, the steps C1 to C3 may specifically be: time-space response weight matrix p₁(M, h, w) dot-by-dot divided by matrix p₁The sum of all elements in (a) achieves normalization: p is a radical of_1Re＝p₁/sum(p₁) Obtaining the violation type response value p of the rectangular area with the coordinates of (k, l) of the jth image frame in each image frame_1Re. Judgment of p_1ReWhether the value is greater than a preset violation threshold value; when p is_1ReIf the value is larger than the preset violation threshold value, outputting p_1ReAnd the coordinate information (k, l) of the corresponding rectangular region.

In a specific application, the parameters of the filter in the convolutional neural network are adjusted according to different filtering results of the convolutional neural network on input sample data, so that different sample data can obtain convolutional neural networks with different parameters. However, if it is desired that a single neural network be able to identify the offending image features of different offending samples as comprehensively as possible, the single neural network may complicate or fail to converge the model due to overfitting. Therefore, a plurality of preset recognition models can be adopted, and the problem of overfitting is avoided while omission of violation videos is reduced by using the plurality of preset recognition models capable of recognizing different violation image characteristics as much as possible.

For this reason, as shown in fig. 3, in a flow of a method for identifying an illegal video according to another embodiment of the present invention, the number of preset identification models is multiple, and the method may include:

s301, a plurality of image frames are acquired from a video to be identified.

S301 is the same as a11 in the alternative embodiment of fig. 1, and is not repeated herein, for details, see the description of the alternative embodiment of fig. 1.

S302, inputting the image frames into each preset recognition model respectively for feature extraction, and obtaining a plurality of image frame feature matrixes of each image frame.

For example, the image frame a and the image frame B are respectively input into preset recognition models F1, F2, … … and Fn, and an image frame feature matrix a1, an image frame feature matrix a2 and … …, an image frame feature matrix an, an image frame feature matrix B1, an image frame feature matrix B2 and … … and an image frame feature matrix bn of the image frame a are obtained. Where n is the number of pre-set recognition models.

And S303, splicing image frame characteristic matrixes extracted by the same preset identification model in the obtained image frame characteristic matrixes to obtain a video segment characteristic matrix corresponding to the same preset identification model.

For example, the image frame feature matrix a1 and the image frame feature matrix B1 extracted by the preset recognition model F1 are spliced to obtain the video segment feature matrix a1B1 of the video segment AB composed of the image frame a and the image frame B. And splicing the image frame characteristic matrix a2 and the image frame characteristic matrix B2 extracted by the preset recognition model F2 to obtain a video segment characteristic matrix a2B2 of a video segment AB consisting of an image frame A and an image frame B. And by parity of reasoning, splicing to obtain a plurality of video segment characteristic matrixes of the video segment formed by the plurality of image frames.

S304, respectively inputting the obtained plurality of video segment feature matrixes into a classifier sub-network of a preset identification model corresponding to the video segment feature matrixes to obtain a plurality of violation confidences of the video.

For example, the obtained video segment feature matrix a1b1, the video segment feature matrix a2b2, … … and the video segment feature matrix anbn are respectively input into a classifier sub-network of the preset recognition model F1, a classifier sub-network of the preset recognition model F2, … … and a classifier sub-network of the preset recognition model Fn, and the confidence P1, the confidence P2, the confidence … … and the confidence Pn of the video to be recognized belonging to the violation video are obtained.

S305, fusing the confidence coefficients by using a preset fusion rule to obtain a target confidence coefficient.

Optionally, S305 may specifically include:

and inputting the confidence degrees into a preset weighted average algorithm to obtain a target confidence degree.

The preset weighted average algorithm may be a linear weighted average algorithm or a non-linear weighted average algorithm.

For example, in the linear weighted average algorithm, the weight of the confidence obtained by each preset recognition model is 1, and an average value can be directly calculated based on a plurality of confidence degrees to obtain a target confidence.

In the nonlinear weighted average algorithm, different weights may be set for the confidence level obtained by each preset recognition model according to the importance degree or accuracy of each preset recognition model. For example, the confidence P1 is weighted 0.6, the confidence P2 is weighted 0.2, … …, and the confidence Pn is weighted 0.1. And weighting each confidence coefficient according to the set weight, and calculating an average value based on the weighted confidence coefficients to obtain a target confidence coefficient.

Or, S305 may specifically further include:

and counting the number of the confidence degrees which are the same in the plurality of confidence degrees.

And determining the most number of the same confidence degrees as the target confidence degrees.

It can be understood that a certain fault tolerance exists in the detection result of the preset identification model, or the preset identification models with different model parameters have different identification results for the same video feature matrix, and the larger the number of the same identification results is, the closer the video corresponding to the video feature matrix is to the identification result is. Therefore, the highest number of identical confidences may be determined as the target confidence.

For example, if the number of confidence levels of the obtained 10 confidence levels is 2 with a confidence level of 0.4, 3 with a confidence level of 0.6, and 5 with a confidence level of 0.8, the target confidence level is determined to be 0.8.

S306, if the target confidence coefficient meets the preset recognition condition, determining that the video is the illegal video.

The preset violation condition may specifically be that the target confidence degree belongs to a preset confidence degree interval, or that the target confidence degree is not less than a preset confidence degree threshold. The preset confidence interval and the preset confidence threshold are determined when the preset classification algorithm is finished.

Corresponding to the above method embodiment, the embodiment of the present invention further provides an apparatus for identifying an illegal video.

As shown in fig. 4, a structure of an apparatus for identifying an illegal video according to an embodiment of the present invention may include:

the recognition module 401 is configured to recognize a video to be recognized based on a preset recognition model to obtain a recognition result; wherein the recognition model comprises a first recognition model and a second recognition model; the recognition result comprises: a first recognition result obtained based on the first recognition model and a second recognition result obtained based on the second recognition model;

a determining module 402, configured to determine the video violation when at least one of the first recognition result and the second recognition result is violated;

According to the identification device for the illegal video, provided by the embodiment of the invention, because the preset identification model comprises the first identification model and the second identification model, the first identification model can be used for memorizing the initial sample data, and meanwhile, the second identification model obtained by training the updated data of the initial sample data is used for identifying the new illegal video corresponding to the updated data of the initial sample data, so that the original illegal video corresponding to the initial sample data can be prevented from being forgotten, and meanwhile, the new illegal video can be identified. Therefore, the method and the device can realize the stable and reliable identification of the illegal video without being influenced by long-time updating evolution, and reduce the probability of missed detection.

Optionally, the identification module 401 in the embodiment of fig. 4 of the present invention may include:

the image acquisition submodule is used for acquiring a plurality of image frames from a video to be identified;

the characteristic extraction submodule is used for respectively extracting the characteristics of the plurality of image frames to obtain an image frame characteristic matrix of each image frame;

the splicing submodule is used for splicing the image frame characteristic matrixes to obtain a video segment characteristic matrix;

and the identification submodule is used for identifying the video segment characteristic matrix so as to determine whether the video is illegal.

Optionally, the feature extraction sub-module in the above embodiment may be specifically configured to:

and respectively carrying out feature extraction on the plurality of image frames based on a feature extraction sub-network of a preset identification model to obtain an image frame feature matrix of each image frame.

Optionally, the identifier module in the foregoing embodiment may be specifically configured to:

and identifying the video segment feature matrix based on a classifier sub-network of a preset identification model so as to determine whether the video is illegal.

Optionally, the identification sub-module in the above embodiment may include:

the confidence coefficient obtaining submodule is used for inputting the video segment characteristic matrix into the classifier subnetwork to obtain the violation confidence coefficient of the video; and if the violation confidence coefficient meets a preset violation condition, determining the video violation.

Optionally, the confidence obtaining sub-module in the above embodiment may be specifically configured to:

and inputting the output obtained after the feature matrix of the transposed video segment is input into a preset first full-connection function as the input of a logistic regression loss function to obtain the violation confidence coefficient of the video.

Optionally, the apparatus for identifying an illegal video according to the embodiment of the present invention may further include: a weight matrix obtaining submodule and a feature vector obtaining submodule;

the weight matrix obtaining submodule is used for performing transposition processing on the video segment characteristic matrix in the confidence coefficient obtaining submodule to obtain a transposed video segment characteristic matrix, and then inputting the transposed video segment characteristic matrix into an attention mechanism subnetwork of a preset identification model to obtain a space-time response weight matrix;

the feature vector obtaining submodule is used for carrying out weighting processing on the feature matrix of the transposed video segment by utilizing the space-time response weight matrix to obtain a video feature vector;

correspondingly, the confidence coefficient obtaining sub-module may be specifically configured to:

and inputting the video feature vector into a preset first full-connection function to obtain an output, and taking the output as the input of a logistic regression loss function to obtain the violation confidence of the video.

Optionally, the weight matrix obtaining sub-module in the above embodiment is specifically configured to:

transposing and dimension reduction deformation are carried out on the transposed video segment characteristic matrix to obtain a dimension reduction video segment characteristic matrix;

inputting the dimensionality reduction video clip feature matrix into a preset second full-connection function and a preset activation function to obtain a response weight matrix;

and performing deformation recovery on the response weight matrix to obtain a space-time response weight matrix.

Optionally, the apparatus for identifying an illegal video according to the embodiment of the present invention may further include: an illegal region labeling module, configured to label a region with a high violation confidence in the video by using the following steps:

after the identification submodule determines the video violation, the space-time response weight matrix is subjected to normalization processing to obtain violation response values of preset rectangular areas forming the image frame in the image frames corresponding to the space-time response weight matrix;

judging whether the violation response value is larger than a preset violation threshold value or not aiming at each violation response value;

and if the violation response value is larger than the preset violation threshold, outputting the coordinate information of the preset rectangular area corresponding to the violation type response value.

Optionally, the number of the preset recognition models is multiple;

the feature extraction submodule is specifically configured to:

inputting the plurality of image frames into each preset recognition model respectively for feature extraction to obtain a plurality of image frame feature matrixes of each image frame;

the splicing submodule is specifically used for:

splicing image frame feature matrixes extracted by the same preset recognition model in the obtained multiple image frame feature matrixes to obtain video segment feature matrixes corresponding to the same preset recognition model;

the confidence coefficient obtaining sub-module is specifically configured to:

respectively inputting the obtained multiple video segment characteristic matrixes into a classifier sub-network of a preset identification model corresponding to the video segment characteristic matrixes to obtain multiple violation confidence coefficients of the video;

fusing the multiple violation confidences by using a preset fusion rule to obtain a target confidence;

and if the target confidence coefficient meets a preset identification condition, determining the video violation.

Optionally, the confidence coefficient obtaining sub-module is specifically configured to:

and inputting the multiple violation confidences into a preset weighted average algorithm to obtain a target confidence.

counting the number of the same violation confidences in the plurality of violation confidences;

and determining the confidence level of the same violation with the largest number as the target confidence level.

The embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502, and the memory complete mutual communication through the communication bus 504 via the memory 503;

the memory 503 is used for storing computer programs;

the processor 501 is configured to implement all the steps of the above method for identifying an offending video when executing the computer program stored in the memory 503.

According to the electronic device provided by the embodiment of the invention, the preset identification model comprises the first identification model and the second identification model, so that the first identification model is utilized to memorize the initial sample data, and the second identification model obtained by training the updated data of the initial sample data is utilized to identify the new illegal video corresponding to the updated data of the initial sample data, therefore, the original illegal video corresponding to the initial sample data can be prevented from being forgotten, and the new illegal video can be identified. Therefore, the method and the device can realize the stable and reliable identification of the illegal video without being influenced by long-time updating evolution, and reduce the probability of missed detection.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, all the steps of the method for identifying an illegal video are implemented.

The computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, because the preset identification model includes a first identification model and a second identification model, the computer program can identify a new violation video corresponding to the update data of the initial sample data by using the second identification model obtained by training the update data of the initial sample while memorizing the initial sample data by using the first identification model, so that the computer program can identify the new violation video while avoiding forgetting the original violation video corresponding to the initial sample data. Therefore, the method and the device can realize the stable and reliable identification of the illegal video without being influenced by long-time updating evolution, and reduce the probability of missed detection.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.), the computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more integrated servers, data centers, etc., the available medium may be magnetic medium (e.g., floppy disk, hard disk, magnetic tape), optical medium (e.g., Digital Versatile disk, DVD for short), or a semiconductor medium such as a solid state Disk (SSD for short), etc.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and electronic apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for identifying offending videos, the method comprising:

determining the video violation if there is at least one violation in the first recognition result and the second recognition result;

2. The method according to claim 1, wherein the identifying the video to be identified based on the preset identification model to obtain an identification result comprises:

acquiring a plurality of image frames from a video to be identified;

respectively extracting the features of the plurality of image frames to obtain an image frame feature matrix of each image frame;

splicing the image frame feature matrixes to obtain a video segment feature matrix;

identifying the video segment feature matrix to determine whether the video is illegal.

3. The method of claim 2, wherein the separately performing feature extraction on the plurality of image frames to obtain an image frame feature matrix of each image frame comprises:

4. The method of claim 2, wherein the identifying the video segment feature matrix to determine whether the video is illegal comprises:

5. The method of claim 4, wherein the identifying the video segment feature matrix by the sub-network of feature classifiers based on the preset identification model to determine whether the video is illegal comprises:

inputting the video segment feature matrix into the classifier sub-network to obtain the violation confidence of the video;

and if the violation confidence coefficient meets a preset violation condition, determining the video violation.

6. The method of claim 5, wherein said inputting the video segment feature matrix into the classifier sub-network to obtain a violation confidence for the video comprises:

7. The method according to claim 6, wherein after the step of transposing the feature matrices of the video segments to obtain transposed feature matrices of the video segments, the method further comprises:

inputting the feature matrix of the transposed video segment into a preset attention mechanism sub-network of an identification model to obtain a space-time response weight matrix;

weighting the transposed video segment feature matrix by using the space-time response weight matrix to obtain a video feature vector;

the obtaining of the violation confidence of the video by using the output obtained after the feature matrix of the transposed video segment is input into a preset first full-join function as the input of a logistic regression loss function includes:

8. The method of claim 7, wherein inputting the feature matrix of the transposed video segment into a predetermined attention mechanism sub-network of the recognition model to obtain a spatio-temporal response weight matrix comprises:

9. The method of claim 8, further comprising, after determining the video violation, marking a region in the video with a high violation confidence level by:

normalizing the space-time response weight matrix to obtain violation response values of preset rectangular areas forming the image frame in the image frames corresponding to the space-time response weight matrix;

10. The method according to claim 5, wherein the number of the preset recognition models is plural;

the respectively extracting the features of the plurality of image frames to obtain an image frame feature matrix of each image frame includes:

the splicing the image frame feature matrixes to obtain a video segment feature matrix comprises the following steps:

the step of inputting the video segment feature matrix into a preset classifier sub-network of an identification model to obtain the violation confidence of the video comprises the following steps:

if the violation confidence meets a preset violation condition, determining the video violation, including:

11. The method according to claim 10, wherein fusing the violation confidences using a preset fusion rule to obtain a target confidence comprises:

12. The method according to claim 10, wherein fusing the violation confidences using a preset fusion rule to obtain a target confidence comprises:

13. The method according to claim 1, wherein the preset recognition model is obtained by training through the following steps:

inputting a plurality of collected sample images into an initial convolutional neural network model for training to obtain the confidence coefficient of the prediction violation of a video segment formed by the sample images;

judging whether the convolutional neural network model in the current training stage is converged or not by utilizing a preset error function according to the obtained prediction violation confidence coefficient and the pre-labeled class information of whether each sample image belongs to the violation or not;

if the current training stage is converged, determining the convolutional neural network model in the current training stage as a preset identification model;

if not, adjusting the model parameters of the convolutional neural network model in the current training stage by using a preset gradient function and a random gradient descent algorithm;

and inputting the collected multiple sample images into the adjusted convolutional neural network model, and repeating the steps of training and adjusting the model parameters until the adjusted convolutional neural network converges.

14. An apparatus for identifying offending videos, the apparatus comprising:

15. The apparatus of claim 14, wherein the identification module comprises:

16. The apparatus of claim 15, wherein the feature extraction sub-module is specifically configured to:

17. The apparatus according to claim 15, wherein the identification submodule is specifically configured to:

18. The apparatus of claim 17, wherein the identification submodule comprises:

19. The apparatus according to claim 18, wherein the confidence level obtaining sub-module is specifically configured to:

20. The apparatus of claim 19, further comprising a weight matrix acquisition sub-module and a feature vector acquisition sub-module;

the confidence coefficient obtaining submodule is specifically configured to:

21. The apparatus according to claim 20, wherein the weight matrix obtaining sub-module is specifically configured to:

22. The apparatus of claim 21, further comprising: an illegal region labeling module, configured to identify a region with a high violation confidence in the video by using the following steps:

23. The apparatus according to claim 18, wherein the number of the preset recognition models is plural;

the feature extraction submodule is specifically configured to:

the splicing submodule is specifically used for:

the confidence coefficient obtaining sub-module is specifically configured to:

24. The apparatus according to claim 23, wherein the confidence level obtaining sub-module is specifically configured to:

25. The apparatus according to claim 23, wherein the confidence level obtaining sub-module is specifically configured to:

26. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 13 when executing a program stored in the memory.

27. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 13.