CN112380999B

CN112380999B - Detection system and method for inductivity bad behavior in live broadcast process

Info

Publication number: CN112380999B
Application number: CN202011279463.8A
Authority: CN
Inventors: 张斌; 陈禹奇; 刘思源; 刘莹
Original assignee: 东北大学
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2023-08-01
Anticipated expiration: 2040-11-16
Also published as: CN112380999A

Abstract

The invention discloses a detection system and a detection method for inductivity bad behaviors in a live broadcast process, wherein the detection system comprises a video set processing module, a spatial feature processing module, a temporal feature processing module and a fusion module.

Description

Detection system and method for inductivity bad behavior in live broadcast process

Technical Field

The invention relates to the field of computer vision recognition technology and convolutional neural networks, in particular to a system and a method for detecting inductivity bad behaviors in a live broadcast process, which are used for detecting whether the inductivity bad behaviors are mixed in a continuous action sequence in the live broadcast process.

Background

With the development of information technology and the popularization of intelligent hardware, especially the popularization of mobile intelligent terminal equipment, smart phones and personal palm computers have gradually become the best choice for people in selecting office equipment and entertainment equipment. The online live broadcast platform integrates traditional offline social games, tea houses and talk show theatres, and a live broadcast host who enters the network live broadcast platform can display the online social games in real time through the live broadcast platform, and obtains corresponding incomes through gift gifts of live broadcast audiences. According to published data, the scale of the Chinese network live broadcast users reaches 4.33 hundred million, and the Chinese network live broadcast users account for 50.7% of the whole netizen, and the newly-increased number of hosts on the head live broadcast platform reaches more than 200 ten thousand in 2018. The live broadcast industry forms a perfect industrial chain combining software and hardware, and the agency of human beings defines a live broadcast salesman as one of the new work types in 7 months in 2020. The rapid increase of the number of network anchor and the rapid development of the live broadcast platform bring about the increase of live broadcast times and the storm of the total live broadcast duration of the platform.

Network hosting or to attract more viewing traffic at the live broadcast room or affected by its own behavioral habits, may perform some induced adverse actions at the live broadcast room, such as: smoking, self-disability, abuse, etc., can seriously impair physical and mental health of adolescent viewers if such behavior is mimicked by the adolescent viewers. These actions are mixed in a conventional continuous action sequence, and have a problem that the duration is long and short, and the actions are difficult to be found. Aiming at the good and bad live broadcast content, the traditional small-sized live broadcast platform realizes the identification and examination of the induced bad behavior of the anchor through an irregular patrol mechanism of a live broadcast platform manager and a report mechanism of a live broadcast audience.

However, in the face of rapid growth of live broadcast times and live broadcast duration, the conventional induced bad behavior recognition method has great dependence on the demand of a live broadcast platform patrol manager, and aggravates the operation burden of a live broadcast platform. Meanwhile, the traditional manual auditing mechanism has weaker capability of identifying part of details and lower accuracy of identifying illegal behaviors. From the aspect of recognition efficiency of illegal videos, in the process of recognizing induced bad behaviors in videos, a recognizer needs to watch a whole video, and needs to watch and judge video fragments with partial unclear judgment repeatedly, so that the problem of low efficiency is easily generated. In addition, aiming at the manually identified illegal video content, the platform provides the function of illegal complaints for the anchor in order to prevent manual misjudgment, so that the conditions of filtering out flow of air, escape penalties and the like of the anchor and an administrator can exist.

Therefore, a method with high recognition accuracy is needed to detect whether there is bad inducible behavior in the live broadcast process.

Disclosure of Invention

The invention aims to solve the problems of low identification speed and low accuracy of the existing detection method and provides a detection system and method for induced bad behaviors in a live broadcast process.

In order to achieve the above purpose, the invention is implemented according to the following technical scheme:

a detection system for induced adverse behavior in a live broadcast process, comprising:

the output end of the video set processing module is connected with the input end of the spatial feature processing module, and the video set processing module is used for processing video set contents, including obtaining illegal induction bad behavior video cases stored in a live broadcast platform database, and capturing real-time live broadcast contents to be stored as videos to be identified; dividing the video cases confirmed to be the induced bad behaviors, and marking each segmented video according to the rule-breaking type labels; dividing the long-duration video to be identified into a plurality of sections of short-duration videos according to an equal-length method, naming the divided short-duration videos according to a unified format, and ensuring the continuity and readability of a plurality of sectional videos;

the output end of the spatial feature processing module is respectively connected with the input ends of the temporal feature processing module and the fusion module, and the spatial feature processing module is used for carrying out video single frame interception RGB single frame images on the processed short-duration video, extracting spatial features from the intercepted RGB single frame images, inputting the spatial features into an induction bad behavior recognition model aiming at the spatial features and outputting a prediction result;

the output end of the time feature processing module is connected with the input end of the fusion module, and the time feature processing module is used for intercepting calculation between two RGB single-frame images with adjacent time sequences, obtaining instantaneous optical flow information through calculation and synthesizing an optical flow graph; extracting time features from the synthesized optical flow diagram, inputting the time features into an induced bad behavior recognition model aiming at the time features, and outputting a prediction result;

and the fusion module is used for fusing the obtained prediction result of the induction bad behavior recognition model aiming at the spatial characteristics with the prediction result of the induction bad behavior recognition model aiming at the time characteristics to obtain data fused with the spatial characteristics and the time characteristics, and classifying the fused data to obtain the prediction result of the segmented video. After the prediction is completed on all the segmented videos, the prediction results of the segmented videos are fused and calculated to obtain a final prediction result, and the final prediction result is a long-duration video identification result obtained from the live broadcast server.

Further, the spatial feature processing module includes:

the input end of the RGB single-frame image intercepting sub-module is connected with the output end of the video set processing module, and the output end of the RGB single-frame image intercepting sub-module is respectively connected with the input ends of the spatial feature model processing sub-module and the temporal feature processing module; the RGB single-frame image intercepting submodule is used for intercepting an RGB single-frame image of a video single frame of the processed short-duration video;

the output end of the spatial feature model processing sub-module is connected with the input end of the fusion module, and the spatial feature model processing sub-module is used for extracting spatial features from the intercepted RGB single-frame image, and then inputting the spatial features into the induction bad behavior recognition model aiming at the spatial features, and outputting a prediction result.

Further, the time feature processing module includes:

the input end of the optical flow diagram synthesis submodule is connected with the output end of the RGB single-frame diagram interception submodule, the output end of the optical flow diagram synthesis submodule is connected with the input end of the time characteristic model processing submodule, and the optical flow diagram synthesis submodule is used for calculating between two adjacent RGB single-frame diagrams with time sequences obtained by interception in a video to obtain an instantaneous optical flow information synthesis optical flow diagram;

the output end of the time feature model processing sub-module is connected with the input end of the fusion module, and the time feature model processing sub-module is used for extracting time features from the synthesized optical flow graph, and then inputting the time features into the induction bad behavior recognition model aiming at the time features, and outputting a prediction result.

Further, the fusion module includes:

the input end of the space-time characteristic fusion submodule is respectively connected with the output ends of the space characteristic model processing submodule and the time characteristic model processing submodule, and the space-time characteristic fusion submodule is used for fusing the obtained prediction result of the induction bad behavior recognition model aiming at the space characteristic with the prediction result of the induction bad behavior recognition model aiming at the time characteristic to obtain data fused with the space characteristic and the time characteristic, and the fused data is subjected to classification processing to obtain a prediction result of a segmented video;

the prediction result fusion submodule is used for carrying out fusion calculation on the prediction results of the segmented videos after all the segmented videos are predicted, so as to obtain a final prediction result, wherein the final prediction result is the identification result of the long-duration video obtained from the live broadcast server.

Further, the original model of the induced bad behavior recognition model aiming at the spatial characteristics and the original model of the induced bad behavior recognition model aiming at the temporal characteristics are both convolutional neural network models ResNet152.

In addition, the invention also provides a method for detecting the induced bad behavior in the live broadcast process, which utilizes the system for detecting the induced bad behavior in the live broadcast process to detect the induced bad behavior in the live broadcast process, and comprises the following steps:

step 1: extracting and processing the violation video cases stored in the live broadcast platform video database, selecting a target video to be divided into a plurality of short-time long videos containing the violation inducibility behavior, and recording the type label of the violation inducibility behavior;

step 2: processing the video to obtain an RGB single frame image and an optical flow image of the video;

step 3: training a model for identifying spatial features and a model for identifying time features by using space-time features in an RGB single-frame image and an optical flow image respectively to obtain an induced bad behavior identification model for the spatial features and an induced bad behavior identification model for the time features;

step 4: acquiring live video segments, acquiring a real-time live broadcast buffer memory from a live broadcast platform server, and cutting the live broadcast buffer memory into live video segments with a multi-segment duration of 2-3 seconds;

step 5: repeating the content of the step 2 aiming at the live video segment cut in the step 4 to obtain an RGB single-frame image and an optical flow image of the live video segment;

step 6: randomly selecting an RGB single frame image obtained in the step 5, putting the RGB single frame image into the induction bad behavior recognition model aiming at the space characteristics obtained in the step 3, and outputting a prediction result;

step 7: putting the optical flow diagram obtained in the step 5 into the induced bad behavior recognition model aiming at the time characteristics obtained in the step 3, and outputting a prediction result;

step 8: carrying out data fusion on the data obtained in the step 6 and the step 7, fusing the two results through an average method, outputting the fused result, judging the fused result, and obtaining a prediction result of a certain video segment;

step 9: and fusing prediction results of the multiple segments of video segments divided by the long-duration video, and if the prediction result of at least one segment of video segment is 'bad behavior', judging that the current video to be identified has inductivity bad behavior.

Further, the step 2 specifically includes:

step 2.1: the method comprises the steps of obtaining RGB single-frame images of video segments, extracting frames of the video segments, and extracting all RGB single-frame images contained in the video according to video frame rate characteristics;

step 2.2: performing optical flow information processing on the RGB single-frame images, and synthesizing an optical flow image through calculation between two adjacent single-frame images;

step 2.3: and processing the obtained RGB single frame image and optical flow image, and storing the same type of induced bad behaviors together.

Further, the step 3 specifically includes:

step 3.1: loading a convolutional neural network model ResNet152 pre-trained by an ImageNet training set;

step 3.2: training a ResNet152 model in a targeted manner by using the processed RGB single frame image and the marked video tag, continuously adjusting training parameters, updating the model to achieve the best model identification accuracy, and storing the obtained induced bad behavior identification model aiming at the spatial characteristics;

step 3.3: the ResNet152 model is trained in a targeted mode by using the processed light flow graph and the processed video label, training parameters in the training process are adjusted, the model is updated to achieve the best model identification accuracy, and the obtained induced bad behavior identification model aiming at time characteristics is stored.

Compared with the prior art, the method has the advantages that the long video is divided into a plurality of short-duration segments, the space-time characteristics are fused, the recognition results of the segmented videos are fused, the mixed inductivity bad behaviors in the conventional continuous action sequence are ensured to be recognized timely and effectively, and the recognition accuracy in a complex state is greatly improved.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a system block diagram according to an embodiment of the present invention;

FIG. 3 is a flowchart of a spatial stream training method according to an embodiment of the present invention;

fig. 4 is a flowchart of a time-stream training method according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. The specific embodiments described herein are for purposes of illustration only and are not intended to limit the invention.

In order to meet the supervision requirement of the network live broadcast platform on increasing live broadcast times, a set of detection service aiming at induced bad behaviors in live broadcast videos, which is faster in recognition speed and higher in recognition accuracy, needs to be improved in recognition efficiency through an informatization tool, and reduces dependence on manual auditing. The apparent characteristics contained in the video image frames are obtained simply through the RGB single-frame images, so that whether the live video content has bad inducibility behavior or not can be judged, and the problem of low recognition accuracy can be generated. The video is a continuous frame set with time characteristics, besides the apparent characteristics provided by a single frame RGB picture, the video provides additional time characteristic information, namely the motion information of an object, and the motion information of the object can be obtained by means of the optical flow information stored in the optical flow diagram. The optical flow of an image can be divided into an X-direction, which contains the components of the displacement vector field in the horizontal direction, and a Y-direction, which contains the components of the displacement vector field in the vertical direction. The optical flow diagrams store the motion information in two different directions separately, and the optical flow diagrams in two different directions can be obtained through calculation between two adjacent frames of single-frame diagrams.

The space-time feature is obtained by respectively inputting the obtained RGB single frame image and optical flow image into a convolutional neural network for feature extraction, and obtaining the space feature on the RGB single frame and the time feature contained in the optical flow image. The space-time feature combination considers the prediction results obtained according to the time feature and the prediction results obtained according to the space feature, the combination method adopts an average fusion method, the two obtained prediction results are summed and then averaged to obtain fusion data, the fusion data is compared with a preset value, and a final judgment result is output.

The duration of a video to be identified may be long, a plurality of actions may be included in the video with long duration, and how to accurately and efficiently identify bad inducibility behaviors in a continuous action sequence is a key of the embodiment of the invention.

Specifically, as shown in fig. 1, the embodiment describes in detail a detection system for an induced bad behavior in a live broadcast process, and the modules cooperate with each other to realize the detection work of the induced bad behavior in the live broadcast process; the detection system for induced bad behavior in the live broadcast process of the embodiment includes:

the spatial feature processing module comprises:

the output end of the spatial feature model processing sub-module is connected with the input end of the fusion module, and the spatial feature model processing sub-module is used for extracting spatial features from the intercepted RGB single-frame image, inputting the extracted spatial features into an induction bad behavior recognition model aiming at the spatial features and outputting a prediction result;

the time feature processing module comprises:

the input end of the optical flow diagram synthesis submodule is connected with the output end of the RGB single-frame diagram interception submodule, the output end of the optical flow diagram synthesis submodule is connected with the input end of the time characteristic model processing submodule, and the optical flow diagram synthesis submodule is used for calculating between two frames of RGB single-frame diagrams adjacent in time sequence to obtain an instantaneous optical flow information synthesis optical flow diagram;

the output end of the time feature model processing submodule is connected with the input end of the fusion module, and the time feature model processing submodule is used for extracting time features from the synthesized optical flow graph, inputting the time features into the induction bad behavior recognition model aiming at the time features and outputting a prediction result;

the fusion module comprises:

the prediction result fusion submodule is used for carrying out fusion calculation on the prediction results of the plurality of segmented videos after the prediction of all the segmented videos is completed, so as to obtain a final prediction result, wherein the final prediction result is the identification result of the long-duration video obtained from the live broadcast server.

The method for detecting the induced bad behavior in the live broadcast process by using the system for detecting the induced bad behavior in the live broadcast process according to the embodiment is shown in fig. 2, and specifically comprises the following steps:

step 2: processing the video to obtain an RGB single frame image and an optical flow image of the video:

step 2.3: processing the obtained RGB single frame image and optical flow image, and storing the same type of induced bad behaviors together;

step 3: training a model for identifying spatial features and a model for identifying temporal features by using space-time features in an RGB single-frame image and an optical flow image respectively to obtain an induced bad behavior identification model for the spatial features and an induced bad behavior identification model for the temporal features, as shown in fig. 3 and 4:

step 3.2: training the ResNet152 model in a targeted manner by using the processed RGB single frame image and the marked video label, continuously adjusting training parameters, updating the model to achieve the best model identification accuracy, and storing the obtained induced bad behavior identification model aiming at the spatial characteristics, as shown in figure 3;

step 3.3: training the ResNet152 model in a targeted manner by using the processed light flow graph and the processed video tag, adjusting training parameters of the training process, updating the model to achieve the best model identification accuracy, and storing the obtained induced bad behavior identification model aiming at the time characteristic, as shown in fig. 4;

step 8: carrying out data fusion on the data obtained in the step 6 and the step 7, fusing the two results through an average value fusion method, outputting the fused result, judging the fusion result, and obtaining a prediction result of a certain video segment;

step 9: and fusing prediction results of the multiple video segments divided by the long-duration video, and if the prediction result of at least one video segment is ' bad behavior ', judging that the current video to be identified has inductivity bad behavior '.

In summary, the invention divides the long video into a plurality of short time segments, and through space-time feature fusion processing and fusion of recognition results of the segmented video, ensures that the hybrid inductivity bad behavior in the conventional continuous action sequence can be timely and effectively recognized, and greatly improves the recognition accuracy in a complex state.

The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.

Claims

1. A detection system for induced adverse behavior in a live broadcast process, comprising:

the output end of the spatial feature processing module is respectively connected with the input ends of the temporal feature processing module and the fusion module, and the spatial feature processing module is used for carrying out video single frame interception RGB single frame images on the processed short-duration video segments, extracting spatial features from the intercepted RGB single frame images, inputting the spatial features into an induction bad behavior recognition model aiming at the spatial features and outputting a prediction result;

the output end of the time feature processing module is connected with the input end of the fusion module, and the time feature processing module is used for calculating between two adjacent frames of RGB single frames of the intercepted time sequence to obtain instantaneous optical flow information, synthesizing an optical flow graph, extracting time features from the optical flow graph obtained by synthesis, inputting the time features into an induced bad behavior recognition model aiming at the time features, and outputting a prediction result;

the fusion module is used for fusing the obtained prediction result of the induction bad behavior recognition model aiming at the spatial features with the prediction result of the induction bad behavior recognition model aiming at the time features to obtain data fused with the spatial features and the time features, classifying the fused data to obtain a prediction result of one segmented video, and carrying out fusion calculation on the prediction results of a plurality of segmented videos after the prediction of all the segmented videos is completed to obtain a final prediction result, wherein the final prediction result is a long-duration video recognition result obtained from a live broadcast server;

the original models of the induced bad behavior recognition model aiming at the spatial characteristics and the induced bad behavior recognition model aiming at the time characteristics are convolutional neural network model ResNet152; the specific processes of the induction bad behavior recognition model aiming at the spatial characteristics and the induction bad behavior recognition model aiming at the time characteristics are as follows:

2. The system for detecting induced adverse behavior in a live broadcast procedure according to claim 1, wherein the spatial signature processing module comprises:

the input end of the RGB single-frame image intercepting sub-module is connected with the output end of the video set processing module, and the output end of the RGB single-frame image intercepting sub-module is respectively connected with the input ends of the spatial feature model processing sub-module and the temporal feature processing module; the RGB single-frame image interception submodule is used for carrying out video single-frame interception RGB single-frame image operation on the processed short-duration video;

3. The system for detecting induced adverse behavior in a live broadcast procedure according to claim 2, wherein the temporal feature processing module comprises:

the input end of the optical flow diagram synthesizing sub-module is connected with the output end of the RGB single-frame diagram intercepting sub-module, the output end of the optical flow diagram synthesizing sub-module is connected with the input end of the time characteristic model processing sub-module, and the optical flow diagram synthesizing sub-module is used for calculating between two RGB single-frame diagrams with adjacent intercepting time sequences, obtaining instantaneous optical flow information through calculation, and synthesizing an optical flow diagram;

4. The system for detecting induced adverse behavior in a live procedure according to claim 3, wherein the fusion module comprises:

the prediction result fusion submodule is used for carrying out fusion calculation on the prediction results of the segmented videos after the prediction of all the segmented videos is completed, so as to obtain a final prediction result, wherein the final prediction result is a long-duration video identification result obtained from the live broadcast server.

5. A method for detecting induced bad behavior in a live broadcast process, characterized in that the method for detecting induced bad behavior in a live broadcast process by using the system for detecting induced bad behavior in a live broadcast process according to claim 4 comprises the following steps:

step 3: training a model for identifying spatial features and a model for identifying time features by using space-time features in an RGB single-frame image and an optical flow image respectively to obtain an induced bad behavior identification model for the spatial features and an induced bad behavior identification model for the time features; the original models of the induced bad behavior recognition model aiming at the spatial characteristics and the induced bad behavior recognition model aiming at the time characteristics are convolutional neural network model ResNet152; the specific processes of the induction bad behavior recognition model aiming at the spatial characteristics and the induction bad behavior recognition model aiming at the time characteristics are as follows:

step 3.3: training the ResNet152 model in a targeted manner by using the processed light flow graph and the processed video tag, adjusting training parameters in the training process, updating the model to achieve the best model identification accuracy, and storing the obtained induced bad behavior identification model aiming at the time characteristic;

step 8: carrying out data fusion on the data obtained in the step 6 and the step 7, fusing two results by an average value fusion method, outputting the fused results, and carrying out classification judgment on the fused results to obtain a prediction result of a certain video segment;

6. The method for detecting induced adverse behavior in a live broadcast process according to claim 5, wherein the step 2 specifically comprises:

step 2.1: the method comprises the steps of obtaining an RGB single frame image of a video segment, extracting frames of the video segment, and extracting all RGB single frames contained in the video according to video frame rate characteristics;

step 2.2: performing optical flow information processing on the RGB single-frame images, and synthesizing an optical flow chart through calculation between two adjacent RGB single-frame images;

step 2.3: and processing the obtained RGB single frame images and the optical flow images, and storing the related RGB single frame images and the optical flow images of the same type of induced bad behaviors together.