CN113297934B

CN113297934B - Multi-mode video behavior analysis method for detecting Internet violence harmful scene

Info

Publication number: CN113297934B
Application number: CN202110512224.0A
Authority: CN
Inventors: 郭承禹; 鲍泽民; 潘进; 王磊
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2024-03-29
Anticipated expiration: 2041-05-11
Also published as: CN113297934A

Abstract

The invention discloses a multi-mode video behavior analysis method for detecting Internet violence harmful scenes, which mainly comprises three stages of rapid positioning detection of video scene characters, video scene behavior discrimination and video scene harmful degree qualitative.

Description

Multi-mode video behavior analysis method for detecting Internet violence harmful scene

Technical Field

The invention belongs to the technical field of information security, and particularly relates to a multi-mode video behavior analysis method for detecting Internet violence harmful scenes.

Background

With the development of multimedia technology, various emerging rapid and diversified media manifestations are presented in people's daily social activities. The emerging media brings convenience to life of people, and a large amount of negative information can be rapidly spread among people by means of rapidly-developed network technology and widely-popularized mobile intelligent terminals. How to find out the negative information in time, the propagation of the negative information is killed in the sprouting stage, which is a common concern of novel media and network supervision departments, can prevent the social masses from being poisoned by the negative information, and effectively purify the network ecology.

In the massive user generated videos, the ratio of harmful violent videos is extremely low, and the identification difficulty of the harmful violent videos is increased due to unbalanced distribution of sample categories. The current active discovery method of the harmful video mainly crawls information such as audio and video scenes, topics, station marks, captions and the like with certain limiting conditions, returns large data volume and most redundant contents, and increases the working difficulty for further manual judgment. And the research of harmful videos is mainly aimed at scenes such as pornography, and the research of harmful judgment of violence content is relatively late.

The traditional violent video detection method mainly aims at the audio and image characteristics of the video, utilizes a visual word bag model and a pooling technology to optimally construct video content representation characteristics, and is still limited to scene mode characteristics of the video. Information for the high-level semantic layer is still difficult to capture, so that content harmful to the public cannot be distinguished from video-type and educational program content. In addition, as a new characteristic and a core function of interaction among users in new media, comment information of videos can effectively assist in screening and judging video contents. Therefore, the emotional characteristics of the characters and the video comment information are introduced, a multi-mode characteristic fusion multi-task learning model is established, and the benefits of all subtasks and the whole tasks are maximized by integrating all the characteristics.

Disclosure of Invention

In view of the above, the invention provides a multi-mode video behavior analysis method for detecting Internet violence harmful scenes, which can quickly and accurately discover videos with harmful scenes from massive user-generated videos.

The technical scheme for realizing the invention is as follows:

a multimode video behavior analysis method for detecting Internet violence harmful scenes comprises the following steps:

step one, detecting a character target by taking an apparent feature and a rotation invariant feature of the apparent feature as feature descriptors;

dividing the whole human body into n regions, sequentially reorganizing adjacent regions to generate human body region detection templates with different scales, and respectively training the human body region detection templates with different scales by using CNN (computer numerical network), wherein the input of the training process is character videos with different shielding degrees;

step three, human body target detection is carried out, and the detection process is expressed as follows in an abstract way:

mapping an original video x to a feature matrix M through a feature mapping function k, calculating scoring parameters s through a component detector g, recording the probability of each component in a detection area, which is obtained according to apparent features, calculating the visibility parameters v of each component of a human body in a scene through a layered CNN model f obtained through training in the step two, correcting the scoring parameters s, and finally passing through a discriminant function in a CNN networkJudging whether a human body target exists in the detection area to calculate a detection result y;

step four, taking action characteristics, scene characteristics and emotion characteristics as the input of a LSTM (Long Short Term Memory) cyclic neural network, taking target behavior words as the output, training an LSTM model, realizing the initial judgment of target behaviors in videos, eliminating videos without harmful scenes, and executing the operation of step five aiming at the videos with harmful scenes;

fifthly, marking basic scores of words in the basic emotion word library to form a basic emotion word dictionary, extracting basic emotion words in the video input barrage, and inquiring the basic scores of the basic emotion words from the basic emotion word dictionary to carry out assignment;

step six, dividing emotion categories in the basic emotion word dictionary into 7 dimensions of 'happiness, anger, fear and sadness', and independently calculating emotion scores in each dimension; calculating the emotion value of each bullet screen by using the following formula;

S＝∑a _j Q(b _j ×c _j ,b)+∑α _i +∑β _m +∑ε _l

wherein J takes a value of 1-J, and J is the total number of emotion words; b _j For the basic emotion score of the jth emotion word, the basic emotion word dictionary is directly matched and searchedPolling, value range [0,1 ]]；c _j = {1, -1} is a passive verb for judging whether the emotion word j is a reverse emotion of the emotion word; b is emotion score matrix of all emotion words, pigment words, harmonic words and continuous symbols of the barrage; the Q function is a cross-correlation function and is used for calculating the correlation degree of emotion tendencies of other emotion words b in the barrage, a _j Weighting scores for the degree adverbs before and after the jth emotion word, and taking value ranges of [0, N ]]N can be specified according to actual requirements and is generally not more than 10; alpha _i 、β _m 、ε _l The emotion parameters of three special barrages of pigment characters, harmonic words, continuous punctuations or digital symbols are respectively represented, I is 1-I, M is 1-M, L is 1-L, and I, M, L is the number of the three special barrages;

step seven, after the emotion value of each barrage is calculated, carrying out outlier detection by adopting an Isolation Forest method, clustering all barrage emotion values in the same time period, eliminating the barrages with abnormal emotion values, summing the emotion values of other normal barrages to obtain emotion parameters of the whole video, wherein the emotion parameters are 7-dimensional emotion category parameter vectors, the highest dimension of the score is the overall emotion tendency of the video, and the value is the final emotion score; and when the emotion of 'fear and dislike' appears in the whole video and exceeds 1/4 of the duration of the video, recommending the video.

Further, in the first step, when the apparent feature is constructed, selecting a YUV feature and a HOG feature; when the rotation invariant feature is constructed, the polar coordinate representation method is adopted to transform the image feature from the Cartesian coordinate system to the polar coordinate system, and the space invariance of the feature is kept.

Further, n=10.

In the fourth step, the action features select optical flow features, the scene features select DeCAF features, and among the emotion features, the overall facial expression recognition features of the person adopt PCA (Principal Component Analysis) features, and the local features adopt facial motion coding analysis features.

In the fifth step, the words which are not recorded in the basic emotion word dictionary are manually marked and then added into the basic emotion word dictionary.

The beneficial effects are that:

1. at present, the negative information detection of the internet cannot be solved by the traditional scene content detection or identification method, because judging whether the negative influence of the internet information on society is caused by the fact that the judgment dimension is complex, most of information cannot be judged through shallow semantic features, and the emotion of a transmitter and a transmitted person is highly relevant. The method of the invention utilizes the scene information of the video on one hand, and establishes high-level semantic information, such as emotion conveyed by video content and true emotion expressed by audience on the other hand, so as to judge whether the video is harmful violent video, and the accuracy is superior to that of the traditional method.

2. The traditional scene person detection method is poor in applicability aiming at complex scenes, is difficult to be applied to finding out harmful information in massive videos of the Internet, and a large amount of missing information can cause missed detection of the traditional method.

3. Aiming at the problems, the method provides a barrage emotion analysis method for accurately finding a target scene to a violent scene, and can be better applied to the application of finding a harmful scene compared with the traditional method.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a split template for a human body region.

Detailed Description

The invention will now be described in detail by way of example with reference to the accompanying drawings.

The invention provides a multimode video behavior analysis method for finding Internet violence harmful scenes, which mainly comprises the following steps: quick positioning and detection of a character in a video scene, judgment of behavior of the video scene and qualitative harmful degree of the video scene.

In the rapid detection of a character in a video scene, the invention provides a two-dimensional human body rapid detection method based on layered deep learning aiming at the complexity of the behavior of the character in the Internet.

When a feature descriptor is selected, the method disclosed by the invention is used for carrying out character target detection by taking the apparent features (YUV features and HOG features) and the rotation invariant features of the apparent features as feature descriptors, and when an invariance feature is constructed, a polar coordinate representation method is adopted for transforming the image features from a Cartesian coordinate system to a polar coordinate system, so that the spatial invariance of the features is kept.

Based on the characteristics, the convolutional neural network device is selected to perform category calibration on the image characteristics of the whole area, in order to avoid false detection and omission caused by shielding of human body parts in a complex scene, the method divides the whole human body into 10 areas based on human body structures, sequentially recombines adjacent areas as shown in fig. 2, generates human body area detection templates with different scales, respectively detects and analyzes the human body areas contained in each template in each layer by layering the human body area templates with different scales with containing relations, and finally transmits context information through the mutual containing relations between the layer templates, corrects the false judgment brought by the part detector, increases the detection rate of the partially shielded human body, judges whether each part of the human body is in a visible state in the scene, and corrects the apparent characteristics obtained in the apparent model according to the visible state.

The general human detection method process can be expressed abstractly as:

wherein x is an input image to be detected, k is a feature mapping function, M is a feature map obtained after feature extraction and learning, g is a component detector, s is a scoring parameter, the probability of each component obtained according to apparent features existing in the detection area is recorded,and judging whether the whole human body target exists in the detection area or not according to the scoring parameters of each human body part as a judging function, and finally obtaining a detection result y.

In this process, the scoring parameter si e S acquired by the component detector represents the probability that a certain region in the apparent feature M obtained through the feature map f is detected as the component i. However, the direct use of the scoring parameter to determine the target may cause errors due to complicated background, occlusion, etc., so that another parameter v is added in the process to measure the possibility of occlusion of each region of the human body in the original image, where the parameter is defined as a visibility parameter, and the whole detection process may be modified as follows:

thus, the probability distribution function is used to represent the objective function of the modelThe method comprises the following steps:

wherein p (y|v, s) corresponds to the discriminant functionp (v|s) corresponds to the visibility coefficient estimation function f. Discriminant functionThe probability that the target area is detected is judged directly according to the visibility parameter v and the scoring parameter s and is corrected asTherefore, the main problem for solving the human body target detection result y is positioned in the process of calculating the visibility parameter v corresponding to each layer of human body region template in the MLMM and calculating the expected value +.>The mapping relation between s and v is described by the limited Boltzmann machine, and related contents are not described in detail herein. According to the scoring parameters of different templates, the method can detect that when people in a scene are blocked or not fully, the judging scoring of the judging template is higher than that of other templates, so that the target people in the complex scene can be accurately detected.

In the video scene behavior discrimination, the method introduces the emotion characteristics of the video character, and adopts LSTM (Long Short Term Memory) cyclic neural network to respectively carry out scene recognition on the three characteristics of the bottom action characteristics, the scene characteristics and the emotion characteristics of the video character. The motion features select optical flow features, and tracks of front and rear frames are used as the motion features; the scene features select DeCAF features, and whether a target object to be detected is contained in the target video is detected through a specific scene object related to a predefined violent scene; the emotional features select facial expressions of the person to identify global PCA (Principal Component Analysis) features and local features facial motion coding analysis features. Taking the three groups of characteristics as input, taking target behavior words as output, training an LSTM model, realizing initial judgment of target behaviors, eliminating videos without harmful scenes, and judging the harmful degree of the video scenes aiming at the videos with harmful scenes.

In the aspect of harmful judgment, the method provides a barrage comment emotion assessment method aiming at a violent scene. For the input barrage, punctuation in the text is removed by a word filtering method, extracting the Chinese words in the comments by using the Chinese word bank, and then removing background words which occur at high frequency and are irrelevant to views in the video by using an implicit Dirichlet distribution model based on background removal, and finally, the rest words are basic emotion words, and inquiring initial scores of the basic emotion words from a basic emotion word dictionary for assignment. And manually labeling words which are not recorded in the dictionary, and then adding the words into the dictionary again.

In the selection of emotion words, the emotion classification is divided into 7 dimensions of 'happiness, anger, sadness and sadness' according to the emotion word library of the known network, and emotion scores are calculated independently for each dimension. When the emotion value of each barrage is calculated, the method designs the true emotion of the multidimensional feature judgment word according to the characteristics of the network barrage: s= Σa _j Q(b _j ×c _j ,b)+∑α _i +∑β _m +∑ε _l J takes the value of 1-J, J is the total number of emotion words; b _j For the basic emotion score of the jth emotion word, the basic emotion word dictionary is directly matched with the query, and the value range is [0,1 ]]；c _j = {1, -1} is a passive verb for judging whether the emotion word j is a reverse emotion of the emotion word; b is emotion score matrix of all emotion words, pigment words, harmonic words and continuous symbols of the barrage; the Q function is a cross-correlation function and is used for calculating the correlation degree of emotion tendency of other emotion words b in the bullet screen when emotion words are in, the method selects a chi-square test and a T test method to calculate the data correlation, if the correlation is smaller, the probability that the bullet screen is a reverse bullet screen is higher, and the emotion score of the emotion words is reduced by taking the test parameters as weights; a, a _j Weighting scores for the degree adverbs before and after the jth emotion word, and taking value ranges of [0, N ]]N can be specified according to actual requirements and is generally not more than 10; alpha _i 、β _m 、ε _l Emotion parameters of three special barrages of pigment characters, harmonic words, continuous punctuation or digital symbols respectively, wherein i takes a value of 1 to moreAnd I, M takes values of 1-M, and L takes values of 1-L, wherein I, M, L are the occurrence numbers of the three special barrages respectively.

In order to reduce influence of individual views on emotions of the overall video program barrage, the method comprises the steps of detecting abnormal points by adopting an Isolation Forest method after calculating emotion parameters of a single barrage, clustering all barrage emotion parameters in the same time period, eliminating abnormal barrages in emotion clusters to reduce influence on the overall video emotion parameters, and carrying out summation calculation on barrages with other normal emotion values on the basis to obtain emotion parameters about the overall program, wherein the parameters are 7-dimensional emotion category parameter vectors, the highest dimension of the score is the overall emotion tendency of the video, and the value is the final emotion score. And when the emotion of 'fear and dislike' appears in the whole video and exceeds 1/4 of the duration of the video, recommending the video.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The multimode video behavior analysis method for detecting the Internet violence harmful scene is characterized by comprising the following steps of:

step one, detecting human body targets by taking apparent features and rotation invariant features of the apparent features as feature descriptors;

step three, detecting human body target, mapping the original video x to a feature matrix M through a feature mapping function k, calculating a scoring parameter s through a component detector g, recording the probability of each component in a detection area obtained according to apparent features,calculating the visibility parameter v of each part of the human body in the scene through the layered CNN model f obtained through training in the second step, correcting the scoring parameter s, and finally passing through the discriminant function in the CNN networkJudging whether a human body target exists in the detection area to calculate a detection result y;

step four, taking action characteristics, scene characteristics and emotion characteristics as inputs of an LSTM circulating neural network, taking target behavior words as outputs, training an LSTM model, realizing initial judgment of target behaviors in videos, eliminating videos without harmful scenes, and executing the operation of step five aiming at the videos with harmful scenes; the method comprises the steps of selecting action features to select optical flow features, selecting scene features to select DeCAF features, wherein in emotion features, PCA features are adopted as overall features for character facial expression recognition, and facial motion coding analysis features are adopted as local features;

S＝∑a _j Q(b _j ×c _j ,b)+∑α _i +∑β _m +∑ε _l

wherein J takes a value of 1-J, and J is the total number of emotion words; b _j For the basic emotion score of the jth emotion word, the basic emotion word dictionary is directly matched with the query, and the value range is [0,1 ]]；c _j = {1, -1} is a passive verb for judging whether the emotion word j is a reverse emotion of the emotion word; b is emotion score matrix of all emotion words, pigment words, harmonic words and continuous symbols of the barrage; the Q function is a cross-correlation function and is used for calculating the correlation degree of the current emotion word and the emotion tendencies of other emotion words b in the barrage, a _j Is the j thThe weighted score of the degree adverbs before and after the emotion word, and the value range [0, N ]]N can be specified according to actual requirements and is generally not more than 10; alpha _i 、β _m 、ε _l The emotion parameters of three special barrages of pigment characters, harmonic words, continuous punctuations or digital symbols are respectively represented, I is 1-I, M is 1-M, L is 1-L, and I, M, L is the number of the three special barrages;

2. The method for multi-modal video behavior analysis for detecting Internet violence and harmfulness scenes according to claim 1, wherein in the first step, YUV features and HOG features are selected when the apparent features are constructed; when the rotation invariant feature is constructed, the polar coordinate representation method is adopted to transform the image feature from the Cartesian coordinate system to the polar coordinate system, and the space invariance of the feature is kept.

3. The method for multimodal video behavior analysis for detecting Internet violence nuisance scenes of claim 1, wherein n = 10.

4. The method for analyzing multi-modal video behavior for detecting Internet violence and harmfulness scenes according to claim 1, wherein in the fifth step, words which are not recorded in the basic emotion word dictionary are manually marked and then added into the basic emotion word dictionary.