CN111325067B

CN111325067B - Illegal video identification method and device and electronic equipment

Info

Publication number: CN111325067B
Application number: CN201811536558.6A
Authority: CN
Inventors: 苏驰; 刘弘也
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd; Beijing Kingsoft Cloud Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd; Beijing Kingsoft Cloud Technology Co Ltd
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2023-07-07
Anticipated expiration: 2038-12-14
Also published as: CN111325067A

Abstract

The embodiment of the invention provides a method, a device and electronic equipment for identifying illegal videos, which are used for identifying videos to be identified based on a preset identification model to obtain an identification result. The recognition model comprises a first recognition model and a second recognition model; the identification result comprises: a first recognition result based on the first recognition model, and a second recognition result based on the second recognition model. If at least one of the first recognition result and the second recognition result is illegal, determining a video violation; the first recognition model is obtained by training an initial convolutional neural network model by utilizing initial sample data in advance, and the second recognition model is obtained by training the first recognition model by utilizing updated data of the initial sample data in advance. Therefore, the method and the device can realize that the illegal video can be stably and reliably identified without being influenced by long-time updating and evolution, and reduce the probability of missed detection.

Description

Illegal video identification method and device and electronic equipment

Technical Field

The present invention relates to the field of video identification technologies, and in particular, to a method and an apparatus for identifying illegal videos, and an electronic device.

Background

With the rapid development of the video industry, the number of videos facing video supervision work is increased in a explosive manner, and the recognition mode of manually watching each video to recognize illegal videos is difficult to meet the demands. Meanwhile, in view of the fact that video recognition is essentially an image recognition process, the video frames are recognized by introducing computer vision technology, so that automatic recognition of videos is realized, and the supervision requirement of a large number of videos is met.

In the video recognition technology, if video is recognized frame by frame, a large amount of computation is required, and the recognition efficiency of the video is too low. Therefore, in the corresponding video recognition method, the frame extraction inspection is mostly performed on the video based on the standard image recognition technology, and the adopted technical scheme can be summarized as follows: and performing frame extraction sampling on the video, inputting a sampled image into a pre-trained convolutional neural network for detection, obtaining the illegal confidence of the image, and marking the image or the video as illegal when the confidence is larger than a set threshold.

However, the previously trained convolutional neural network often does not identify well some new offending videos that have not been present in the sample data; if the new illegal video is included to update the original sample data, the updated sample data is used to update the original convolutional neural network, and the convolutional neural network trained by using the updated sample data can not identify the original illegal video well after long-time updating and evolution. For example, in a certain period, the original offending video does not appear for a long time and the new offending video appears in a large quantity, so that the number of the new offending video contained in the updated sample data is far more than that of the original offending video, and the convolutional neural network trained by using the updated sample data cannot well identify the original offending video.

Therefore, how to stably and reliably identify the offending video without being affected by the evolution of long-time updating is a problem to be solved by the existing video identification technology.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for identifying illegal videos and electronic equipment, so that the illegal videos are stably and reliably identified without being influenced by long-time updating and evolution, and the probability of omission is reduced. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for identifying an offensive video, where the method includes:

identifying the video to be identified based on a preset identification model to obtain an identification result; the recognition model comprises a first recognition model and a second recognition model; the identification result comprises: a first recognition result obtained based on the first recognition model and a second recognition result obtained based on the second recognition model;

if at least one of the first identification result and the second identification result is out of regulation, determining a video violation;

the first recognition model is obtained by training an initial convolutional neural network model by utilizing initial sample data in advance, and the second recognition model is obtained by training the first recognition model by utilizing updated data of the initial sample data in advance.

In a second aspect, an embodiment of the present invention provides an apparatus for identifying an offensive video, including:

the identification module is used for identifying the video to be identified based on a preset identification model to obtain an identification result; wherein the recognition model comprises a first recognition model and a second recognition model; the identification result comprises: a first recognition result obtained based on the first recognition model and a second recognition result obtained based on the second recognition model;

the determining module is used for determining the video violation when at least one of the first recognition result and the second recognition result is out of regulation;

In a third aspect, an embodiment of the present invention provides an electronic device, where the device includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

A memory for storing a computer program;

and the processor is used for realizing all the steps of the illegal video identification method provided in the first aspect when executing the program stored in the memory.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for identifying offensive video as provided in the first aspect above.

According to the method, the device and the electronic equipment for identifying the illegal video, provided by the embodiment of the invention, the video to be identified is identified based on the preset identification model, so that the identification result is obtained. The recognition model comprises a first recognition model and a second recognition model; the identification result comprises: a first recognition result based on the first recognition model, and a second recognition result based on the second recognition model. If at least one of the first recognition result and the second recognition result is illegal, determining a video violation; the first recognition model is obtained by training an initial convolutional neural network model by utilizing initial sample data in advance, and the second recognition model is obtained by training the first recognition model by utilizing updated data of the initial sample data in advance. Because the preset recognition model comprises the first recognition model and the second recognition model, the first recognition model is utilized to memorize the initial sample data, and the second recognition model obtained by training the updated data of the initial sample data is utilized to recognize the new illegal video corresponding to the updated data of the initial sample data, so that the new illegal video can be recognized while the original illegal video corresponding to the initial sample data is prevented from being forgotten. Therefore, the method and the device can realize that the illegal video can be stably and reliably identified without being influenced by long-time updating and evolution, and reduce the probability of missed detection.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart of a method for identifying offending video according to an embodiment of the present invention;

FIG. 2 is a flow chart of a training method of a preset recognition model according to an embodiment of the invention;

FIG. 3 is a flow chart of a method for identifying offending video according to another embodiment of the present invention;

FIG. 4 is a schematic diagram of a device for identifying offensive video according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present invention, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following first describes a method for identifying illegal videos provided by the embodiment of the invention.

It should be noted that, the method for identifying the offensive video provided by the embodiment of the present invention may be applied to an electronic device capable of performing data processing, where the device includes a desktop computer, a portable computer, an internet television, an intelligent mobile terminal, a wearable intelligent terminal, a server, etc., which are not limited herein, and any electronic device capable of implementing the embodiment of the present invention belongs to the protection scope of the embodiment of the present invention.

As shown in fig. 1, a flow of a method for identifying offensive video according to an embodiment of the present invention may include:

s101, identifying the video to be identified based on a preset identification model to obtain an identification result. The preset recognition model comprises a first recognition model and a second recognition model; the identification result comprises: a first recognition result based on the first recognition model, and a second recognition result based on the second recognition model.

Specifically, the initial sample data is a sample data set containing offending video and legal video collected before training the first recognition model. The updating data of the initial sample data can be a sample data set which is used for incorporating the new illegal video and updating the initial sample data once when the new illegal video first appears; the new offending video can be included each time the new offending video occurs, and the sample data set is updated for a plurality of times. The training process of the first recognition model and the second recognition model can adopt the existing method for training the convolutional neural network, such as a batch random gradient descent algorithm.

In order to obtain the first recognition result and the second recognition result based on the preset recognition model, the method for recognizing the video to be recognized based on the preset recognition model to obtain the recognition result specifically may include the following steps A1 to A2:

a1, identifying the video to be identified based on a first identification model to obtain a first identification result;

a2, identifying the video to be identified based on the second identification model to obtain a second identification result.

The execution sequence of the steps A1 and A2 is not limited in this embodiment, and may be executed simultaneously, or may be executed first, then the step A1 is executed, or may be executed first, then the step A2 is executed, and then the step A1 is executed.

It can be understood that, since the first recognition model is obtained by training the initial convolutional neural network model by using initial sample data in advance, the second recognition model is obtained by training the first recognition model by using updated data of the initial sample in advance; therefore, the first recognition result is a recognition result of whether the video to be recognized is the offending video contained in the initial sample data or not, and the second recognition result is a recognition result of whether the video to be recognized is the offending video contained in the update data of the initial sample or not.

That is, the recognition by the first recognition model is: whether the video to be identified accords with the feature information of the illegal video contained in the initial sample data, namely whether the video to be identified is: the corresponding violation type is the video of the type of the violation video contained in the initial sample data; and the recognition using the second recognition model is: whether the video to be identified accords with the feature information of the illegal video contained in the update data of the initial sample, namely whether the video to be identified is: the corresponding violation type is a video of the type of violation video contained in the update data of the initial sample.

In addition, the first recognition result and the second recognition result may specifically be a violation confidence that characterizes a probability that the video to be recognized is a violation video, or may be an identifier that indicates that the video to be recognized is a violation or legal, for example, 1 or 0.

Of course, the specific mode of the first recognition model and the second recognition model for recognizing the input video may be that the existing standard image recognition technology-based frame extraction inspection is performed on the video to be recognized, or that the standard image recognition technology-based video clip inspection is performed on the video to be recognized.

S102, if at least one of the first recognition result and the second recognition result is violated, determining that the video is violated.

If one of the first recognition result and the second recognition result is illegal, at the moment, the video to be recognized is recognized as the illegal video by the first recognition model or is recognized as the illegal video by the second recognition model.

Illustratively, the offending video contained in the initial sample data is characterized by an exposed body part, and the offending video contained in the updated data of the initial sample data is characterized by a human action offending in the video. If the video to be identified is a video with a naked body part, the first identification result is illegal, and the second identification result is non-illegal; if the video to be identified is the video with the violation of the human action in the video, the second identification result is the violation, and the first identification result is the non-violation.

If the first recognition result and the second recognition result are both illegal, at this time, the video to be recognized is recognized as an illegal video by the first recognition model and is also recognized as an illegal video by the second recognition model.

Illustratively, the offending video contained in the initial sample data is characterized by an exposed body part, and the offending video contained in the updated data of the initial sample data is characterized by a human action offending in the video. If the video to be identified is the video with the exposed body part and the illegal action exists, the first identification result is illegal, and the second identification result is illegal.

It can be seen that at least one of the first recognition result and the second recognition result is violated, which indicates that the video to be recognized is recognized as being violated by at least one of the first recognition model and the second recognition model, and therefore, the video violation can be determined.

According to the method for identifying the illegal video, provided by the embodiment of the invention, the preset identification model comprises the first identification model and the second identification model, so that the first identification model is utilized to memorize initial sample data, and the second identification model obtained by training the updated data of the initial sample data is utilized to identify new illegal video corresponding to the updated data of the initial sample data, and therefore, the new illegal video can be identified while the original illegal video corresponding to the initial sample data is prevented from being forgotten. Therefore, the method and the device can realize that the illegal video can be stably and reliably identified without being influenced by long-time updating and evolution, and reduce the probability of missed detection.

Optionally, as shown in fig. 2, the process of the training method of the preset recognition model in the embodiment of fig. 1 of the present invention may include:

s201, inputting the collected multiple sample images into an initial convolutional neural network model for training, and obtaining prediction violation confidence of a video segment formed by the multiple sample images.

The prediction violation confidence is the probability that a video segment formed by a plurality of obtained sample images belongs to a violation video after the initial convolutional neural network model processes the input sample images, and is the detection result of the initial convolutional neural network model on the sample images.

S202, judging whether a convolutional neural network model in the current training stage is converged or not by utilizing a preset error function according to the obtained prediction violation confidence and the preset classification information of whether each sample image belongs to violations or not. If convergence, step S203 is performed, and if not, steps S204 to S205 are performed.

And S203, determining the convolutional neural network model in the current training stage as a preset recognition model.

Judging whether the current target detection model converges or not by utilizing a preset error function can be specifically that the minimum value of the preset error function is calculated by taking the minimum preset error function as a target, when the minimum value is obtained, the current target detection model converges, and when the minimum value is not obtained, the current target detection model does not converge.

The preset error function is used for calculating whether the pre-marked sample image in each sample image belongs to the illegal category information or not, and the detection result is more accurate as the difference between the pre-marked sample image and the detection result of the convolutional neural network model on the sample image in the current training stage is smaller. Therefore, when the preset error function obtains the minimum value, the detection result of the convolutional neural network model of the current training stage on the sample image is the same as the pre-labeled category information. Therefore, when the convolutional neural network model of the current training stage converges, the convolutional neural network model of the current training stage can be determined to be a preset target detection model.

S204, adjusting model parameters of the convolutional neural network model in the current training stage by using a preset gradient function and adopting a random gradient descent algorithm.

S205, inputting the collected multiple sample images into the adjusted convolutional neural network model, and repeating the steps of training and adjusting model parameters until the adjusted convolutional neural network converges.

The random gradient descent algorithm adjusts model parameters of the convolutional neural network model in the current training stage, so that after the convolutional neural network model is adjusted by the model parameters, a detection result is improved, and differences between the convolutional neural network model and pre-labeled category information are reduced, so that convergence is achieved.

Accordingly, the above steps of training and adjusting the model parameters are repeated until the model in the current training phase converges. Of course, each training is performed on the convolutional neural network model with the model parameters adjusted up to date.

Meanwhile, it can be understood that the first recognition model and the second recognition model may be obtained by training in the foregoing manner in the embodiment of fig. 2, where the training is different from the training: when training to obtain a first recognition model, the initial convolutional neural network in step S201 is the initial convolutional neural network, and the plurality of sample images in step S201 are the initial sample data; when training to obtain the second recognition model, the initial convolutional neural network in step S201 is the first recognition model, and the plurality of sample images in step S201 are the updated data of the initial samples.

Furthermore, in a specific application, if a trivial action occurs in a video, that video also belongs to the offending video. However, if the convolutional neural network for identifying the image frames is used for identifying, only some single image frames forming the popular actions can be identified, but the integral characteristics of the video clips formed by the image frames cannot be identified, so that the popular actions are difficult to identify, and illegal videos are missed.

For this purpose, at least one of the steps A1 and A2 in the above-described embodiment of fig. 1 of the present invention may specifically include the following steps a11 to a14:

a11, acquiring a plurality of image frames from the video to be identified.

The acquiring the plurality of image frames may specifically be acquiring the plurality of image frames from the video to be identified according to a preset period, so as to obtain a plurality of image frames with equal intervals. Since the motion is made up of consecutive image frames and the differences in the non-spaced consecutive image frames may not be large, equally spaced image frames can avoid slow data processing speed caused by acquiring a huge amount of data formed by the non-spaced consecutive image frames while retaining the image frames reflecting the motion characteristics as much as possible, as compared with the non-spaced consecutive image frames.

For example, in the video to be identified, all of the image frames constituting the drinking motion of the person, the 1 st to 5 th image frames without interval may be the motion of the hand of the person touching the cup, the 6 th to 15 th image frames without interval may be the motion of the person picking up the cup, and the 16 th to 25 th image frames without interval may be the motion of the person drinking water. When a plurality of image frames are acquired according to a preset period, a 5 th image frame a in which a person's hand touches the cup, a 10 th image frame B and a 15 th image frame C in which the person picks up the cup, a 20 th image frame D and a 25 th image frame E in which the person drinks water can be obtained, thereby reflecting the action characteristics of the person drinking water with relatively few image frames.

And A12, respectively extracting the characteristics of the plurality of image frames to obtain an image frame characteristic matrix of each image frame.

The feature extraction method for the image frame can be various. For example, feature extraction may be performed on each image frame by using a preset convolutional neural network, where the preset convolutional neural network is trained by using a plurality of sample images in advance, and the plurality of sample images may constitute a sample video belonging to the offending video. The feature extraction may be performed on a plurality of image frames by using a HOG (Histogram of Oriented Gradient) feature algorithm or a feature extraction algorithm such as an LBP (Local Binary Pattern) feature extraction algorithm.

Any feature extraction algorithm that can be used to extract both offending and non-offending features of an image can be used with the present invention, which is not limited in this embodiment.

For example, the feature extraction is performed on the image frame a, the image frame B, the image frame C, the image frame D, and the image frame E, respectively, to obtain an image feature matrix a of the image frame a, an image feature matrix B of the image frame B, an image feature matrix C of the image frame C, an image feature matrix D of the image frame D, and an image feature matrix E of the image frame E.

Optionally, in a specific application, the feature extraction algorithm for extracting the offending features and the non-offending features of the image may be used as a sub-network of the preset recognition model in the embodiment of fig. 1 of the present invention, and the step a12 may specifically include:

and respectively carrying out feature extraction on the plurality of image frames based on a feature extraction sub-network of a preset identification model to obtain an image frame feature matrix of each image frame.

For example, the image frame a and the image frame B are input into the first recognition model F1 and the second recognition model F2, respectively, to obtain an image frame feature matrix a1, an image frame feature matrix a2, an image frame feature matrix B1 and an image frame feature matrix B2 of the image frame a. Where n is the number of preset recognition models.

And A13, splicing the characteristic matrixes of the plurality of image frames to obtain a characteristic matrix of the video segment.

It will be appreciated that a plurality of image frames may constitute a video clip, while the characteristics of the video clip need to reflect the changing relationship in the time dimension of the individual image frames that constitute the video clip. Therefore, the image frame feature matrices can be spliced to obtain the video segment feature matrix capable of reflecting the video segment features formed by the image frames.

The image frames acquired during illegal video identification are three-channel color images, and the corresponding image frame feature matrix is a three-dimensional feature matrix. Therefore, the image frame feature matrices are spliced, specifically, a plurality of image frame feature matrices may be spliced into a four-dimensional feature matrix, for example, image frame feature matrices (c, h, w) of M image frames are spliced into video segment feature matrices (M, c, h, w) of video segments composed of M image frames. Where h is the length of the matrix, w is the width of the matrix, and c is the number of channels of the matrix.

A14, identifying the feature matrix of the video fragment to determine whether the video is illegal.

The identification of the video segment feature matrix to determine whether the video is illegal is performed, that is, whether the video feature reflected by the video segment feature matrix is illegal is identified. Therefore, the specific mode of identifying the video segment feature matrix can be any classification algorithm so as to obtain the identification result that the video feature matrix belongs to the rule violation or legal.

By way of example, the classification algorithm may be a classifier model, such as a SoftMax (flexible maximization) classifier, a classifier model, and the like. Of course, the classification algorithm is trained beforehand using a plurality of sample images including violations and non-violations. Any classification algorithm that can be used to distinguish between offending and non-offending features can be used with the present invention, which is not limited in this embodiment.

The video segment feature matrix is obtained by splicing the image frame feature matrices of the plurality of image frames, so that the integral feature of the video segment formed by the plurality of image frames can be reflected, and the action feature in the video segment can be reflected. Therefore, compared with a mode of only identifying a single image frame in a video, the method can identify not only the naked picture in the single image frame, but also the popular actions in the video clips, so that the probability of illegal video missing detection caused by the fact that the popular actions cannot be identified is reduced.

It should be noted that, since the efficiency of identifying a plurality of image frames is lower than that of identifying a single image frame, when the efficiency requirement for identifying the offending video is higher than that of reducing the probability of missing detection, one of the first identification model and the second identification model may perform the identification of the video segment of the video to be identified by using the steps a11 to a14 described above, and the other may perform the identification by using the existing method of identifying a single image frame. Of course, for the recognition model for performing the recognition of a single image frame, one input of the recognition model may be selected from a plurality of image frames to obtain a recognition result of the recognition model for the image frame.

When the need to reduce the probability of missed detection takes precedence over the need for efficiency of offending video identification, both the first identification model and the second identification model may employ the steps a11 to a14 described above to identify video segments of the video to be identified.

Optionally, in a specific application, any classification algorithm for identifying the feature matrix of the video clip may be used as a sub-network of the preset identification model in the embodiment of fig. 1 of the present invention, and step a14 may specifically include:

and the classifier subnetwork based on a preset recognition model recognizes the video segment feature matrix to determine whether the video is illegal.

Specifically, the preset classifier subnetwork of the recognition model can recognize the video segment feature matrix, which can be that the video segment feature matrix is input into the preset classifier subnetwork of the recognition model to obtain the confidence that the video segment feature matrix is illegal, or can be that the video segment feature matrix is obtained as the illegal identifier.

Optionally, the identifying, by the classifier subnetwork based on the preset identification model, the feature matrix of the video segment to determine whether the video is illegal may include:

inputting the feature matrix of the video fragment into a classifier sub-network of a preset recognition model to obtain the violation confidence of the video;

If the violation confidence meets the preset violation conditions, determining the video violation.

The classifier sub-network is used for acquiring the confidence coefficient of the video segment corresponding to the input video feature matrix, and the confidence coefficient is used as the confidence coefficient of the video belonging to the illegal type. The preset violation condition may specifically be that the violation confidence belongs to a preset confidence interval, or the violation confidence is not less than a preset confidence threshold. The preset confidence interval and the preset confidence threshold are determined when the classification algorithm is trained.

For example, M image frames are acquired from the video to be identified according to a preset period, wherein M > 1, and the M image frames are three-channel RGB image frames with length W and height H. Respectively inputting M-frame image frames into a feature extraction sub-network of a preset identification model, extracting respective image frame feature matrixes (c, h, w) of the M-frame image frames through the operation of the feature extraction sub-network, and splicing the image frame feature matrixes into a video segment feature matrix f of a video segment formed by the M-frame image frames ₁ = (M, c, h, w). Wherein h is the length of the matrix and w is the momentThe width of the matrix, c, is the number of channels of the matrix. Video segment feature matrix f ₁ = (M, c, h, w) inputting a classifier sub-network of a preset recognition model, and obtaining a video segment feature matrix f through operation of the classifier sub-network ₁ The violation confidence of the corresponding video clip of= (M, c, h, w) is used as the violation confidence of the video.

Optionally, the step of inputting the feature matrix of the video segment into a classifier subnetwork of a preset recognition model to obtain the violation confidence of the video may specifically include:

performing transposition processing on the video segment feature matrix to obtain a transposed video segment feature matrix;

and inputting the transposed video segment feature matrix into the output obtained after the preset first full-connection function, and taking the output as the input of the logistic regression loss function to obtain the violation confidence of the video.

The video segment feature matrix is transposed to transpose the matrix indicating the features of the video segment into a form convenient for inputting a first full-connection function. For example, a video clip feature matrix f ₁ Transposition is carried out on the= (M, c, h, w) to obtain a transposed video segment feature matrix f ₂ = (c, M, h, w). The logistic regression loss function may specifically be a sigmoid activation function.

It can be appreciated that in the transposed video feature matrix for obtaining the violation confidence, the correlation degree between the feature represented by the different elements and the violation feature is different. For example, if the feature 1 represented by the element 1 is a wall feature and the feature 2 represented by the element 2 is a human body feature, the correlation degree between the feature represented by the element 1 and the offending feature is low, and the correlation degree between the feature represented by the element 2 and the offending feature is high.

Therefore, in order to better identify the elements with high correlation degree to the offensive feature, optionally, after the step of transposing the video segment feature matrix to obtain the transposed video segment feature matrix, the method for identifying the offensive video according to the embodiment of the present invention may further include steps B1 to B3 as follows:

and B1, inputting the transposed video segment feature matrix into a attention mechanism sub-network of a preset recognition model to obtain a space-time response weight matrix.

The attention mechanism subnetwork may specifically be a function for extracting the correlation degree between each element and the offending feature in the transposed video segment feature matrix, so that the transposed video segment feature matrix is weighted by using the spatio-temporal response weight matrix in step B2.

And B2, weighting the transposed video segment feature matrix by using the space-time response weight matrix to obtain a video feature vector.

Illustratively, a spatio-temporal response weight matrix P is utilized ₁ = (M, h, w) for transposed video segment feature matrix f ₂ = (c, M, h, w) weighting to obtain video feature vector

Wherein j represents a j-th frame image frame in the video segment formed by the M frame image frames, (k, l) represents a rectangular area with coordinates of (k, l) in the image frames, and i represents an i-th dimension of the c-dimensional video feature vector.

Correspondingly, the outputting obtained after inputting the transposed video segment feature matrix into the preset first full-connection function is used as the input of the logistic regression loss function to obtain the illegal confidence of the video, and the method comprises the following steps:

and B3, inputting the video feature vector into an output obtained after a preset first full-connection function, and obtaining the violation confidence of the video by taking the output as the input of a logistic regression loss function.

Illustratively, video feature vectors

Inputting the output obtained after the preset first full-connection function, and inputting a preset activation function such as sigmoid activation function to obtain a video segment feature matrix f ₁ = (M, c, h, w) violation confidence of the corresponding video.

Alternatively, considering that the recognition of the video violation or legal is two-classification, and the video feature matrix for recognizing the motion feature may not pay attention to the color feature, the attention mechanism subnetwork may also be an algorithm for obtaining a spatiotemporal response weight matrix with value range of [0,1] and dimension of m·h·w, and the step B1 may include the following steps B11 to B13:

b11, transposing and dimension-reducing deformation is carried out on the transposed video segment feature matrix, and a dimension-reducing video segment feature matrix is obtained;

B12, inputting the feature matrix of the dimension-reduced video segment into a preset second full-connection function and a preset activation function to obtain a response weight matrix;

and B13, performing deformation recovery on the response weight matrix to obtain a space-time response weight matrix.

Illustratively, the steps B11 to B13 may include: for transposed video segment feature matrix f ₂ And (c), performing transposition and dimension reduction deformation on the= (c, M, h, w) to obtain a dimension reduction video segment characteristic matrix M.h.w.c. The output obtained after the feature matrix of the dimension-reduced video segment is input into a preset second full-connection function is used as the input of a preset activation function, such as a sigmoid activation function, so as to obtain the value range of [0,1 ]]And the dimension is M.h.w. The response weight matrix is deformed and recovered to obtain a space-time response weight matrix P ₁ ＝(M,h,w)。

Of course, the preset first full-connection function and the preset second full-connection function may be full-connection functions with the hidden layer being 1, which are used for reducing elements in the feature matrix of the video segment, so as to obtain elements capable of reflecting the illegal or legal features of the video segment, and prevent overfitting. The difference is that the first full-connection function is used for calculating the transposed video segment feature matrix, and the second full-connection function is used for calculating the dimension-reduced video segment feature matrix.

In a specific application, in order to reduce detection omission when the illegal video is identified by using a visual technology, a supervisor generally performs secondary verification on the illegal video identified by the identification model when the illegal video is identified. However, when the illegal area of the illegal video is very small, for example, a small area on the corner of the picture, and the supervisory personnel needs to watch multiple paths of illegal videos at the same time, the video with the very small illegal area may be misjudged as a legal video during the auditing, and still can cause omission.

Therefore, optionally, after steps B11 to B13 are adopted to determine the video violation, the embodiment of the present invention may further identify the region with high violation confidence in the violation video, and further output the coordinates of the violation region, so as to facilitate the supervisory personnel to confirm the violation region, avoid erroneous judgment caused by that the violation region is not easy to be watched, reduce the missing detection probability, and specifically may adopt steps C1 to C3 to implement labeling the region with high violation confidence in the video:

c1, carrying out normalization processing on the space-time response weight matrix to obtain illegal response values of each preset rectangular area of each image frame which corresponds to the space-time response weight matrix;

C2, judging whether the violation response value is larger than a preset violation threshold or not according to each violation response value;

and C3, outputting coordinate information of a preset rectangular area corresponding to the violation type response value if the violation response value is larger than a preset violation threshold.

Illustratively, steps C1 through C3 may specifically be: time-space response weight matrix p ₁ = (M, h, w) point-wise divided by matrix p ₁ The sum of all elements in (a) realizes normalization: p is p _1Re ＝p ₁ /sum(p ₁ ) Obtaining the offending type response value p of the rectangular region with the coordinates of (k, l) of the j-th image frame in each image frame _1Re . Judgment of p _1Re Whether the preset violation threshold is larger than the preset violation threshold; when p is _1Re When the threshold value is larger than the preset violation threshold value, outputting p _1Re And the corresponding rectangular area coordinates information (k, l).

In a specific application, because the parameters of the filter in the convolutional neural network are adjusted according to different filtering results of the convolutional neural network on the input sample data, different sample data can obtain the convolutional neural network with different parameters. However, if it is desired that a single neural network be able to identify the offending image features of the different offending samples as comprehensively as possible, the single neural network may be complex in model or unable to converge due to over-fitting. Therefore, a plurality of preset recognition models can be adopted, and the problem of over fitting is avoided while the omission of the illegal video is reduced by utilizing the plurality of preset recognition models capable of recognizing as many different illegal image features as possible.

For this purpose, as shown in fig. 3, in another embodiment of the present invention, the method for identifying offensive video includes:

s301, acquiring a plurality of image frames from a video to be identified.

S301 is the same step as A11 in the alternative embodiment of FIG. 1, and is not described herein again, and is described in detail in the alternative embodiment of FIG. 1.

S302, inputting a plurality of image frames into each preset recognition model to perform feature extraction, and obtaining a plurality of image frame feature matrixes of each image frame.

For example, the image frame a and the image frame B are input into preset recognition models F1, F2, … …, fn, respectively, to obtain an image frame feature matrix a1, an image frame feature matrix a2, … …, an image frame feature matrix an, an image frame feature matrix B1, an image frame feature matrix B2, … …, and an image frame feature matrix bn of the image frame a. Where n is the number of preset recognition models.

S303, splicing the image frame feature matrixes extracted by the same preset recognition model in the obtained multiple image frame feature matrixes to obtain a video segment feature matrix corresponding to the same preset recognition model.

For example, the image frame feature matrix a1 and the image frame feature matrix B1 extracted by the preset recognition model F1 are spliced to obtain a video segment feature matrix a1B1 of the video segment AB formed by the image frame a and the image frame B. And splicing the image frame feature matrix a2 and the image frame feature matrix B2 extracted by the preset recognition model F2 to obtain a video segment feature matrix a2B2 of a video segment AB formed by the image frame A and the image frame B. And by analogy, splicing to obtain a plurality of video segment feature matrixes of the video segments formed by a plurality of image frames.

S304, respectively inputting the obtained multiple video segment feature matrixes into a classifier sub-network of a preset recognition model corresponding to the video segment feature matrixes, and obtaining multiple violation confidences of the video.

For example, the obtained video segment feature matrix a1b1, the video segment feature matrices a2b2 and … … and the video segment feature matrix anbn are respectively input into a classifier sub-network of a preset recognition model F1, a classifier sub-network of a preset recognition model F2, … … and a classifier sub-network of a preset recognition model Fn to obtain the confidence coefficient P1, the confidence coefficient P2, … … and the confidence coefficient Pn of the video to be recognized belonging to the illegal video.

S305, fusing a plurality of confidence coefficients by utilizing a preset fusion rule to obtain a target confidence coefficient.

Optionally, S305 may specifically include:

and inputting the multiple confidence degrees into a preset weighted average algorithm to obtain the target confidence degrees.

The preset weighted average algorithm may be a linear weighted average algorithm or a nonlinear weighted average algorithm.

For example, in the linear weighted average algorithm, the weight of the confidence coefficient obtained by each preset recognition model is 1, and the average value can be directly calculated based on a plurality of confidence coefficients to obtain the target confidence coefficient.

In the nonlinear weighted average algorithm, different weights can be set for the confidence obtained by each preset recognition model according to the importance degree or accuracy of each preset recognition model. For example, the confidence level P1 has a weight of 0.6, the confidence level P2 has a weight of 0.2, … …, and the confidence level Pn has a weight of 0.1. And weighting each confidence coefficient according to the set weight, and calculating an average value based on the weighted confidence coefficient to obtain the target confidence coefficient.

Or, S305 may specifically further include:

and counting the number of the same confidence degrees in the plurality of confidence degrees.

And determining the same confidence coefficient with the largest number as the target confidence coefficient.

It can be understood that a certain fault tolerance exists in the detection and detection results of the preset recognition model, or the preset recognition models with different model parameters have different recognition results on the same video feature matrix, and the more the number of the same recognition results is, the closer the video corresponding to the video feature matrix is to the recognition result. Thus, the most number of identical confidence degrees may be determined as the target confidence degree.

For example, among the 10 obtained confidence degrees, the confidence degree of 0.4 is 2, the confidence degree of 0.6 is 3, the confidence degree of 0.8 is 5, and the target confidence degree is 0.8.

S306, if the target confidence degree meets the preset recognition condition, determining that the video is illegal.

The preset violation condition may specifically be that the target confidence belongs to a preset confidence interval, or that the target confidence is not less than a preset confidence threshold. The preset confidence interval and the preset confidence threshold are determined when training is completed by a preset classification algorithm.

Corresponding to the above method embodiment, the embodiment of the invention also provides a device for identifying the illegal video.

As shown in fig. 4, the structure of the device for identifying offensive video according to an embodiment of the present invention may include:

The recognition module 401 is configured to recognize a video to be recognized based on a preset recognition model, so as to obtain a recognition result; wherein the recognition model comprises a first recognition model and a second recognition model; the identification result comprises: a first recognition result obtained based on the first recognition model and a second recognition result obtained based on the second recognition model;

a determining module 402, configured to determine, when at least one of the first recognition result and the second recognition result is violated, the video violation;

According to the device for identifying the illegal video, provided by the embodiment of the invention, the preset identification model comprises the first identification model and the second identification model, so that the first identification model is utilized to memorize initial sample data, and the second identification model obtained by training the updated data of the initial sample data is utilized to identify new illegal video corresponding to the updated data of the initial sample data, and therefore, the new illegal video can be identified while the original illegal video corresponding to the initial sample data is prevented from being forgotten. Therefore, the method and the device can realize that the illegal video can be stably and reliably identified without being influenced by long-time updating and evolution, and reduce the probability of missed detection.

Optionally, the identification module 401 in the embodiment of fig. 4 of the present invention may include:

the image acquisition sub-module is used for acquiring a plurality of image frames from the video to be identified;

the characteristic extraction sub-module is used for respectively extracting the characteristics of the plurality of image frames to obtain an image frame characteristic matrix of each image frame;

the splicing sub-module is used for splicing the image frame feature matrixes to obtain a video segment feature matrix;

and the identification sub-module is used for identifying the video segment feature matrix to determine whether the video is illegal.

Optionally, the feature extraction submodule in the above embodiment may be specifically configured to:

Optionally, the identifying sub-module in the foregoing embodiment may be specifically configured to:

and identifying the video segment feature matrix by using a classifier subnetwork based on a preset identification model so as to determine whether the video is illegal.

Optionally, the identifying sub-module in the foregoing embodiment may include:

the confidence coefficient acquisition sub-module is used for inputting the video segment feature matrix into the classifier sub-network to obtain the violation confidence coefficient of the video; and if the violation confidence meets a preset violation condition, determining the video violation.

Optionally, the confidence acquiring sub-module in the foregoing embodiment may be specifically configured to:

and inputting the transposed video segment feature matrix into an output obtained after a preset first full-connection function, and obtaining the illegal confidence of the video by taking the output as the input of a logistic regression loss function.

Optionally, the device for identifying the offensive video provided by the embodiment of the present invention may further include: a weight matrix acquisition sub-module and a feature vector acquisition sub-module;

the weight matrix acquisition sub-module is used for performing transposition processing on the video segment feature matrix by the confidence coefficient acquisition sub-module to obtain a transposed video segment feature matrix, and inputting the transposed video segment feature matrix into a attention mechanism sub-network of a preset recognition model to obtain a space-time response weight matrix;

the feature vector acquisition sub-module is used for weighting the transposed video segment feature matrix by utilizing the space-time response weight matrix to obtain a video feature vector;

correspondingly, the confidence coefficient obtaining sub-module can be specifically used for:

And inputting the video feature vector into an output obtained after a preset first full-connection function, and obtaining the violation confidence of the video by taking the output as the input of a logistic regression loss function.

Optionally, the weight matrix acquiring submodule in the above embodiment is specifically configured to:

transposing the transposed video segment feature matrix and reducing the dimension and deforming to obtain a dimension-reduced video segment feature matrix;

inputting the feature matrix of the dimension-reduced video segment into a preset second full-connection function and a preset activation function to obtain a response weight matrix;

and carrying out deformation recovery on the response weight matrix to obtain a space-time response weight matrix.

Optionally, the device for identifying the offensive video provided by the embodiment of the present invention may further include: the illegal region labeling module is used for labeling regions with high illegal confidence in the video by adopting the following steps:

after the recognition submodule determines the video violation, carrying out normalization processing on the space-time response weight matrix to obtain a violation response value of each preset rectangular area of each image frame which is formed in each image frame corresponding to the space-time response weight matrix;

judging whether the violation response value is larger than a preset violation threshold or not according to each violation response value;

And if the violation response value is larger than the preset violation threshold, outputting coordinate information of a preset rectangular area corresponding to the violation type response value.

Optionally, the number of the preset recognition models is a plurality of;

the feature extraction submodule is specifically configured to:

inputting the plurality of image frames into each preset recognition model for feature extraction to obtain a plurality of image frame feature matrixes of each image frame;

the splicing submodule is specifically configured to:

splicing the image frame feature matrixes extracted by the same preset recognition model from the obtained multiple image frame feature matrixes to obtain a video segment feature matrix corresponding to the same preset recognition model;

the confidence coefficient obtaining sub-module is specifically configured to:

respectively inputting the obtained multiple video segment feature matrixes into a classifier sub-network of a preset recognition model corresponding to the video segment feature matrixes to obtain multiple violation confidences of the video;

fusing the plurality of violation confidences by utilizing a preset fusion rule to obtain a target confidence coefficient;

and if the target confidence coefficient meets a preset recognition condition, determining the video violation.

Optionally, the confidence coefficient obtaining sub-module is specifically configured to:

And inputting the multiple illegal confidence degrees into a preset weighted average algorithm to obtain a target confidence degree.

counting the number of the same violation confidence in the violation confidence;

and determining the same violation confidence with the largest number as the target confidence.

The embodiment of the invention also provides an electronic device, as shown in fig. 5, which comprises a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory complete the communication with each other through the communication bus 504;

the memory 503 is used for storing a computer program;

the processor 501 is configured to implement all the steps of the method for identifying offensive video when executing the computer program stored in the memory 503.

According to the electronic equipment provided by the embodiment of the invention, the preset recognition model comprises the first recognition model and the second recognition model, and the new illegal video corresponding to the update data of the initial sample data can be recognized by utilizing the second recognition model obtained by training the update data of the initial sample data while the initial sample data is memorized by utilizing the first recognition model, so that the original illegal video corresponding to the initial sample data can be prevented from being forgotten, and the new illegal video can be recognized. Therefore, the method and the device can realize that the illegal video can be stably and reliably identified without being influenced by long-time updating and evolution, and reduce the probability of missed detection.

The communication bus mentioned by the above electronic device may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

An embodiment of the present invention also provides a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements all the steps of the above-described method for identifying offensive video.

According to the computer readable storage medium provided by the embodiment of the invention, the computer program is stored, and when the computer program is executed by the processor, the preset recognition model comprises the first recognition model and the second recognition model, so that the first recognition model is utilized to memorize initial sample data, and the second recognition model obtained by training the updated data of the initial sample is utilized to recognize new illegal videos corresponding to the updated data of the initial sample data, so that the new illegal videos can be recognized while the original illegal videos corresponding to the initial sample data are prevented from being forgotten. Therefore, the method and the device can realize that the illegal video can be stably and reliably identified without being influenced by long-time updating and evolution, and reduce the probability of missed detection.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (Digital Subscriber Line, DSL), or wireless (e.g., infrared, wireless, microwave, etc.) means, the computer-readable storage medium may be any available medium that can be accessed by the computer or a data storage device such as a server, data center, etc., that contains an integration of one or more available media, the available media may be magnetic media, (e.g., floppy Disk, hard Disk, magnetic tape), optical media (e.g., digital versatile Disk (Digital Versatile Disc, DVD)), or semiconductor media (e.g., solid State Disk (SSD)).

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus and electronic device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and references to the parts of the description of the method embodiments are only required.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of identifying offensive video, the method comprising:

identifying the video to be identified based on a preset identification model to obtain an identification result; wherein the recognition model comprises a first recognition model and a second recognition model; the identification result comprises: a first recognition result obtained based on the first recognition model and a second recognition result obtained based on the second recognition model;

if at least one of the first recognition result and the second recognition result is violated, determining the video violation to be recognized;

the first recognition model is obtained by training an initial convolutional neural network model by utilizing initial sample data in advance, the second recognition model is obtained by training the first recognition model by utilizing update data of the initial sample data in advance, and the update data of the initial sample data are as follows: when a new offending video appears, a sample data set of the new offending video is included, which updates the initial sample data.

2. The method according to claim 1, wherein the identifying the video to be identified based on the preset identification model to obtain the identification result includes:

acquiring a plurality of image frames from a video to be identified;

respectively extracting the characteristics of the plurality of image frames to obtain an image frame characteristic matrix of each image frame;

splicing a plurality of image frame feature matrixes to obtain a video segment feature matrix;

and identifying the video segment feature matrix to determine whether the video to be identified is illegal.

3. The method according to claim 2, wherein the performing feature extraction on the plurality of image frames to obtain an image frame feature matrix of each image frame includes:

4. The method of claim 2, wherein said identifying the video clip feature matrix to determine whether the video to be identified is offending comprises:

and identifying the video segment feature matrix by using a classifier subnetwork based on a preset identification model so as to determine whether the video to be identified is illegal.

5. The method of claim 4, wherein the feature classifier sub-network based on a preset recognition model recognizes the video clip feature matrix to determine whether the video to be recognized is illegal, comprising:

inputting the video segment feature matrix into the classifier sub-network to obtain the illegal confidence of the video to be identified;

and if the violation confidence meets a preset violation condition, determining the video violation to be identified.

6. The method of claim 5, wherein inputting the video segment feature matrix into the classifier subnetwork results in the offending confidence of the video to be identified, comprising:

and inputting the transposed video segment feature matrix into an output obtained after a preset first full-connection function, and taking the output as an input of a logistic regression loss function to obtain the illegal confidence of the video to be identified.

7. The method of claim 6, further comprising, after the step of transposing the video segment feature matrix to obtain a transposed video segment feature matrix:

Inputting the transposed video segment feature matrix into a attention mechanism sub-network of a preset recognition model to obtain a space-time response weight matrix;

weighting the transposed video segment feature matrix by using the space-time response weight matrix to obtain a video feature vector;

the obtaining of the violation confidence of the video to be identified by inputting the transposed video segment feature matrix into a preset first full-connection function and using the output as an input of a logistic regression loss function comprises the following steps:

and inputting the video feature vector into an output obtained after a preset first full-connection function, and obtaining the illegal confidence of the video to be identified as the input of a logistic regression loss function.

8. The method of claim 7, wherein inputting the transposed video segment feature matrix into a pre-set attention mechanism sub-network of the recognition model to obtain a spatio-temporal response weight matrix comprises:

9. The method of claim 8, further comprising, after determining the video violation to be identified, marking a region of high violation confidence in the video to be identified with the following steps:

normalizing the space-time response weight matrix to obtain the illegal response values of each preset rectangular area of each image frame corresponding to the space-time response weight matrix;

10. The method according to claim 5, wherein the number of the preset recognition models is a plurality;

the step of extracting features of the plurality of image frames to obtain an image frame feature matrix of each image frame includes:

The step of splicing the image frame feature matrixes to obtain a video segment feature matrix comprises the following steps:

inputting the video segment feature matrix into a classifier sub-network of a preset recognition model to obtain the violation confidence of the video to be recognized, wherein the method comprises the following steps:

respectively inputting the obtained multiple video segment feature matrixes into a classifier sub-network of a preset recognition model corresponding to the video segment feature matrixes to obtain multiple violation confidences of the video to be recognized;

and if the violation confidence meets a preset violation condition, determining that the video to be identified is violated, including:

and if the target confidence coefficient meets a preset recognition condition, determining the video violation to be recognized.

11. The method of claim 10, wherein fusing the plurality of offending confidences using a preset fusion rule to obtain a target confidence comprises:

And inputting the plurality of violation confidences into a preset weighted average algorithm to obtain target confidence.

12. The method of claim 10, wherein fusing the plurality of offending confidences using a preset fusion rule to obtain a target confidence comprises:

13. The method according to claim 1, wherein the predetermined recognition model is obtained by training the following steps:

inputting the collected multiple sample images into an initial convolutional neural network model for training to obtain prediction violation confidence of a video segment formed by the multiple sample images;

judging whether a convolutional neural network model in the current training stage is converged or not by utilizing a preset error function according to the obtained prediction violation confidence and preset marked class information of whether each sample image belongs to the violation or not;

if the model is converged, determining the convolutional neural network model in the current training stage as a preset recognition model;

if the model parameters are not converged, a preset gradient function is utilized, and a random gradient descent algorithm is adopted to adjust the model parameters of the convolutional neural network model in the current training stage;

And inputting the collected multiple sample images into the adjusted convolutional neural network model, and repeating the steps of training and adjusting the model parameters until the adjusted convolutional neural network converges.

14. An apparatus for identifying offensive video, said apparatus comprising:

the determining module is used for determining the video violation to be identified when at least one of the first identification result and the second identification result is out of regulation;

15. The apparatus of claim 14, wherein the identification module comprises:

and the identification sub-module is used for identifying the video segment feature matrix to determine whether the video to be identified is illegal or not.

16. The apparatus of claim 15, wherein the feature extraction submodule is specifically configured to:

17. The apparatus of claim 15, wherein the identification sub-module is specifically configured to:

18. The apparatus of claim 17, wherein the identification sub-module comprises:

the confidence coefficient acquisition sub-module is used for inputting the video segment feature matrix into the classifier sub-network to obtain the violation confidence coefficient of the video to be identified; and if the violation confidence meets a preset violation condition, determining the video violation to be identified.

19. The apparatus of claim 18, wherein the confidence acquisition sub-module is specifically configured to:

20. The apparatus of claim 19, further comprising a weight matrix acquisition sub-module and a feature vector acquisition sub-module;

the confidence coefficient obtaining sub-module is specifically configured to:

21. The apparatus of claim 20, wherein the weight matrix acquisition submodule is specifically configured to:

22. The apparatus of claim 21, wherein the apparatus further comprises: the illegal region labeling module is used for identifying regions with high illegal confidence in the video to be identified by adopting the following steps:

After the recognition submodule determines the video violation to be recognized, carrying out normalization processing on the space-time response weight matrix to obtain the violation response value of each preset rectangular area of each image frame which is formed in each image frame corresponding to the space-time response weight matrix;

23. The apparatus of claim 18, wherein the number of the predetermined identification models is a plurality;

the feature extraction submodule is specifically configured to:

the splicing submodule is specifically configured to:

the confidence coefficient obtaining sub-module is specifically configured to:

24. The apparatus of claim 23, wherein the confidence acquisition sub-module is specifically configured to:

25. The apparatus of claim 23, wherein the confidence acquisition sub-module is specifically configured to:

26. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

A processor for carrying out the method steps of any one of the preceding claims 1-13 when executing a program stored on a memory.

27. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-13.