CN115775363A

CN115775363A - Illegal video detection method based on text and video fusion

Info

Publication number: CN115775363A
Application number: CN202210453529.3A
Authority: CN
Inventors: 于碧辉; 魏靖烜; 王克; 刘畅; 贺少婕; 孙华军; 卜立平
Original assignee: Shenyang Institute of Computing Technology of CAS
Current assignee: Shenyang Institute of Computing Technology of CAS
Priority date: 2022-04-27
Filing date: 2022-04-27
Publication date: 2023-03-10

Abstract

The invention relates to a violation video detection method based on text and video fusion. The method comprises the following steps: acquiring a video of a live broadcast platform and text data of a corresponding live broadcast theme; respectively processing the obtained video text data by adopting a double-flow model, and sending the processed data into a feature extraction layer for shallow information extraction; then, the collected feature information is sent to a deep feature extractor for carrying out postamble learning before association on the extracted features of the previous layer, so as to effectively learn the deep semantic relation expressed by the continuous frames; the extracted deep semantic relation is sent to a fusion layer for information fusion, so that violation video identification can be better carried out through modal complementation; and finally, sending the fused features into a full connection layer, and outputting final classification information. According to the method and the device, the violation behaviors in the Internet live broadcast can be timely and accurately identified, the labor cost is saved, the algorithm can be popularized in the related field, various text and video classification tasks can be adaptively matched, and the method and the device have good ductility.

Description

Illegal video detection method based on text and video fusion

Technical Field

The invention relates to image and text detection, in particular to a violation video detection method based on text and video fusion.

Background

In recent years, watching live webcasts has become one of the entertainment and leisure ways for internet users. In a network live broadcast platform, in order to attract audiences to have different types of live broadcasts, the live broadcasts of sensitive pictures and barrage characters violating laws and regulations, moral fashion and the like gradually appear. The core means of the live broadcast is to bear the illegal content through the real-time video stream of the live broadcast platform. According to the specific situation of the past and present network live broadcast platform, when the illegal contents appear in the live broadcast process, the popularity and the heat of the live broadcast room can be increased, the live broadcast barrage quantity is increased, and barrage comments of the illegal contents are mixed. Therefore, when the network live video content is audited, the live video content and the corresponding live subject text data can be fused for multi-mode violation detection.

At present, the internet live broadcast violation detection is generally screened in a manual mode, or a simple image recognition method is adopted to recognize images on the whole, and the method has the following problems in practical application:

1. the manual examination workload is huge, and the effective examination of each video is difficult to ensure;

2. through manual examination, the subjectivity is strong, and the quality is difficult to guarantee;

3. the existing image identification and classification method only considers relevant parameters such as the skin color ratio of an image in a video frame, ignores the high-level semantic features of the video and does not utilize information of a corresponding subject text when a sensitive picture appears in live broadcasting.

Disclosure of Invention

The invention aims to provide a rule-breaking video detection algorithm for text-video fusion aiming at the defects of high cost, poor timeliness and accuracy and the like in the prior art, and the extracted deep semantic relation between a video and a text is sent to a fusion layer for information fusion, so that the text information and the video information are combined, and the rule-breaking video can be better identified through modal complementation. The method based on text and video fusion is used for researching the internet live broadcast violation, has important research value and significance, and has the advantages of manpower cost saving, high stability, simple operation and good real-time property.

In order to achieve the purpose, the invention designs the illegal video detection algorithm for fusing the text and the video, and can accurately identify the simulated actions and behaviors of the contents in the network illegal live video and the dance performance containing the actions.

The technical scheme adopted by the invention for realizing the purpose is as follows: a violation video detection method based on text and video fusion comprises the following steps:

data acquisition and preprocessing steps: crawling video stream data in live webcasting, and carrying out denoising pretreatment to obtain a single-frame picture and text data appearing on the picture; the video stream data contains image picture data and text data;

establishing a data set: classifying and marking the single-frame picture and text data as an illegal action or text and a non-illegal action or text, and making a picture-text pair sample set of the single-frame picture and text data with labels;

establishing a network model for detecting illegal behaviors: establishing a network model structure which comprises a shallow layer feature extraction module, a deep layer feature extraction module, a feature fusion module, a full connection module and a classifier module which are connected in sequence; repeatedly training the network on the sample set data by using pictures and texts in batches, iteratively tuning the model and setting a violation judgment probability threshold until the optimized network model is obtained; the model is used for judging whether continuous multiframe dynamic image-text pictures contain illegal actions or text information;

a real-time video detection step: and (4) acquiring a video stream in real time, preprocessing the video stream, inputting the video stream into the optimized network model, and acquiring an illegal judgment result.

The image sample data set and the text sample data set are divided into: a training set, a verification set and a test set;

the denoising pretreatment comprises frame extraction treatment and text cleaning:

removing noise from the text data;

and performing frame cutting operation on the continuous frames of the video data, and extracting and storing the continuous frames as continuous frames every 10 s.

The shallow feature extraction module comprises: the method comprises the following steps of extracting a BERT model of semantic shallow text characteristics of a current sentence and a RestNet model of semantic characteristics of a shallow video frame picture;

the deep feature extraction module includes: the TEXTCNN model and the LSTM model are used for continuously learning the shallow video features and the shallow text features, and the deep text semantic features and the deep video frame semantic features of the context associated information of the continuous video frames and the text contents are obtained;

the feature fusion module is as follows: the self-attention module is used for fusing the acquired deep video frame characteristics and the text characteristics;

the full-connection module is used for splicing output characteristics;

and the classifier module adopts a Sigmoid classifier function to judge the probability of the data of the multi-frame image-text input in the current batch and output a classification result of whether violation occurs or not.

The RestNet model for extracting the picture characteristics of the shallow video frame comprises the following steps: performing shallow video frame feature extraction by using a ResNet152 network embedded into an SE module; the embedded SE module is used for readjusting the features, measuring the importance of the extracted features by using the extracted global information, and calculating to obtain the correlation of each channel.

The BERT model for extracting the shallow text features comprises the following steps:

the BERT is formed by stacking a plurality of transform encoder layers and is used for processing the input of the BERT; the encoder of each layer consists of a layer of Multi-head attachment and a layer of feedforward neural network, the number of models is 12, and each layer has 12 attachments;

the input of the BERT is that word codes, position codes and segment codes are extracted from the input text information and then summed.

The LSTM model is used for extracting shallow video features by adopting bidirectional LSTM, and extracting context associated information of continuous video frames to obtain deeper semantic features.

The textCNN is divided into a coding layer, a convolutional layer, a maximum pooling layer and a full-link layer and is used for learning the context relationship of the text to obtain deeper extracted text features.

Learning the deep video features and the deep text features to obtain deep fusion features, which specifically comprises the following steps:

in the early-stage fusion, self-learning is carried out on the input feature map based on self-attribute, weight is distributed, important information points in the feature map are obtained, dependence on external information is reduced, and alignment and fusion of deep video features and deep text features are achieved;

and classifying the video pictures containing the text information by a full connection layer and then by Sigmoid, and judging whether the current given video pictures containing the text information are illegal.

A multimodal violation video detection system based on text and video fusion, the system comprising:

a data acquisition and preprocessing program module: video stream data in live webcasting is crawled, and a single-frame picture and text data appearing on the picture are obtained through denoising pretreatment; the video stream data contains image picture data and text data;

establishing a data set program module: classifying and marking the single-frame picture and text data as an illegal action or text and a non-illegal action or text, and making a picture-text pair sample set of the single-frame picture and text data with labels;

establishing a network model program module for detecting illegal behaviors: establishing a network model structure which comprises a shallow layer feature extraction module, a deep layer feature extraction module, a feature fusion module, a full connection module and a classifier module which are sequentially connected; repeatedly training the network on the sample set data by using pictures and texts in batches, iteratively tuning the model and setting a violation judgment probability threshold until the optimized network model is obtained; the model is used for judging whether continuous multiframe dynamic image-text pictures contain illegal actions or text information;

a real-time video detection program module: and (4) acquiring a video stream in real time, preprocessing the video stream, inputting the video stream into the optimized network model, and acquiring an illegal judgment result.

A multi-modal violation video detection device based on text and video fusion, comprising: a processor and a computer readable storage medium at the server side; the processor is configured to implement instructions, and the computer-readable storage medium is configured to store a plurality of instructions, which are adapted to be loaded by the processor and to perform the method for multimodal violation video detection based on text and video fusion as described above.

The invention has the following beneficial effects and advantages:

1. the method adopts the rule-breaking video detection algorithm based on the fusion of the text and the video, and acquires the live video data, the live theme and other text data through the crawler, so as to train on the basis, thereby effectively improving the accuracy of rule-breaking detection in live broadcasting.

2. The invention adopts a BERT + TextCNN mode to extract text characteristics. Extracting relevant information of the current live broadcast through a BERT model, but extracting more vague meanings in the live broadcast, wherein extracting deeper meaning characteristics is carried out by a TextCNN model later, so that the extracted contents are closer to meaning contents expressed by the Internet live broadcast theme.

3. The ResNet network used by the invention is the ResNet152 network embedded in the SE module, the characteristics can be readjusted by embedding the SE module, the extracted characteristic importance is measured by extracting global information, and the relevance of each channel is obtained by calculation, so that the extraction of the characteristics and the understanding of the contents of the front text and the back text of the video frame are facilitated.

4. The invention uses the LSTM model to extract the deep semantic features of the video features, because the video frames possibly contain continuous semantic information and are not learned after the SE-ResNet extracts the image features, the LSTM model is followed to extract the deep semantic features of the video features.

5. The algorithm has high identification accuracy, and is suitable for large-area popularization and application in the current live broadcast prevalent years.

Drawings

FIG. 1 is a model structure diagram of a violation video detection algorithm for text and video fusion according to the present invention;

fig. 2 is a schematic flow diagram of a text-video fusion violation video detection algorithm provided in the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, embodiments of the present invention are described in detail below with reference to the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather should be construed as modified in the spirit and scope of the present invention as set forth in the appended claims.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The violation video is as follows: the method comprises sensitive pictures and bullet screen characters violating laws and regulations, moral fashion and the like, or simulated actions and behaviors of the contents and dance performances comprising the actions. And aiming at the live broadcast illegal content, carrying out illegal detection by combining the live broadcast theme and the video semantics.

The violation video detection system is internally provided with: and the data acquisition and storage module is used for preparing samples of live videos and corresponding theme texts. The violation video detection system is internally provided with: and the data preprocessing module is used for performing frame extraction processing and text cleaning on the video. The violation video detection system is internally provided with: and the text feature and video frame feature extraction module is used for extracting the live video content and the corresponding live topic text semantics. The violation video detection system is internally provided with: and the deep characteristic extraction module is used for performing deep association extraction on the video and text semantic information. The violation video detection system is internally provided with: and the characteristic fusion module is used for fusing the video and the text semantic information. The detection model installed in the violation video detection system is a violation video detection model fusing text and video multi-mode information.

The text video based multi-modal violation video detection model adopts a textCNN, a bidirectional LSTM, a large-scale image and text data set pre-training model.

The weight parameters of the model installed in the illegal video detection system are obtained by training the detection model by using normal and abnormal video stream data of an internet live broadcast platform.

The video acquisition program used in the illegal video detection system is used for collecting video stream data which is publicly shared by an internet platform. The video analysis program used in the illegal video detection system can process video streams into image data frame by frame, and compresses analysis amount by using a sparse sampling mode. The video abnormal behavior detection program used in the illegal video detection system takes the key video frame image and the corresponding live text theme processed by the video analysis program as input, uses a deep learning neural network model to carry out identification detection, and finally outputs whether the input video is abnormal or not.

An accompanying drawing of a text and video fused illegal video detection algorithm is disclosed.

The following describes in detail a detection model of an illegal video with text and video fusion according to the present invention with reference to fig. 1.

According to the invention, text features are extracted by adopting a BERT + TextCNN mode, video frame features are extracted by adopting a ResNet152+ bidirectional LSTM mode, deep feature semantics are fused in a pre-fusion mode, and self-attribute mode is adopted in the pre-fusion to fuse the video frame features and the text features, because self-attribute performs autonomous learning on an input feature map and weight is distributed, important information point positions in the feature map can be obtained, dependence on external information is reduced, feature alignment and fusion can be realized by the mode, a network focuses more on capturing the relevance inside of information, and after final model fusion, classification is performed through a full-connection layer, and illegal texts and videos are effectively classified.

After the input characters and videos are preprocessed, some cluttered symbols, uncommon characters and the like in the characters are removed, and the characters can be regarded as purer texts to enter a text feature extractor for text feature extraction. After data preprocessing is carried out on the input video characteristics, continuous frames are extracted in the video every 10S to be used as the expression of the content of the video, and the processed data are stored in a corresponding file to be read by a video characteristic extractor. In the aspect of text feature extraction, BERT is adopted to extract features of information such as live topics, and as the topics contain live main content, the relevance of other live video content is considered to be high, the live topics are selected as the input of text information. After inputting, extracting relevant information of the current live broadcast through a BERT model, but extracting more vague meanings in the live broadcast, wherein extracting deeper meaning characteristics is carried out by a TextCNN model later, so that the extracted contents are closer to meaning contents expressed by the Internet live broadcast theme. The BERT is formed by stacking a plurality of transform encoder layers, each encoder layer consists of a Multi-head attachment and a feedforward neural network, the number of models is 12, and each layer has 12 attachments. The inputs to the BERT are summed up by three different embeddings, respectively, word encoding, position encoding and fragment encoding. The textCNN has four layers, namely a coding layer, a convolution layer, a maximum pooling layer and a full-link layer, so that the context relationship of the text can be better learned, and text features can be extracted from a deeper layer. In the aspect of video feature extraction, a video frame after video feature preprocessing is obtained, and a RenNet152 network is used as a feature extraction network of the video frame. The ResNet network used by the invention is a ResNet152 network embedded with an SE module, the characteristics can be readjusted by embedding the SE module, the extracted characteristics are measured by extracting global information, and the relevance of each channel is calculated, so that the extraction of the characteristics and the understanding of the text content before and after the video frame are assisted, and after the image characteristics are extracted by the SE-ResNet, because continuous semantic information possibly contained in the video frame is not learned, the deep semantic characteristics of the video characteristics are extracted by an LSTM model. After the extraction of the text deep semantic features and the image deep semantic features is finished, the text deep semantic features and the image deep semantic features are fused by adopting a pre-fusion strategy, and the video frame features and the text features are fused by adopting a self-excitation mode in the pre-fusion. After fusion, through a full connection layer, classification is carried out by means of sigmoid, and whether the given text information and the corresponding video are illegal videos is judged.

The following describes in detail a detection process of the illegal video resulting from the fusion of the text and the video, with reference to fig. 2.

1. The data acquisition and storage module: the illegal video and the non-illegal video in the live webcast are crawled in a crawler mode, data are respectively marked to be used as a training set, a verification set and a test set, and the data are stored. And after the marking is finished, manually auditing, checking and judging whether the marking is reasonable and effective.

2. A data preprocessing module: for the input video and text data, because both contain a lot of noise and the video is a continuous segment, the noise of the text data needs to be removed, and the video data is subjected to continuous frame cutting operation, so as to finally obtain the data sample required by the model.

3. The text feature and video frame feature extraction module: and performing feature extraction on the text by adopting BERT, performing feature extraction on the video frame by adopting RE-ResNet, and extracting the semantic features of the current sentence of the text and the semantic features of the current frame of the video frame.

4. Deep layer characteristic extraction module: the text adopts TextCNN, the video frames adopt bidirectional LSTM for extraction, and the context associated information of the continuous video frames and the text content is extracted, so that the deeper semantic features are obtained.

5. A feature fusion module: and for the extracted text and the deep features of the image, performing feature fusion through a feature fusion module, and fusing the features of the video frame and the text features in a previous fusion mode in which a self-attenuation mode is adopted.

6. A classification module: and finally inputting the fused features into a full-connection layer, classifying the features through Sigmoid, and judging whether the given video and text violate rules.

The invention discloses multi-mode violation video detection equipment based on text and video fusion, which comprises: a processor and a computer readable storage medium at the server side; the processor is configured to implement instructions, and the computer-readable storage medium is configured to store a plurality of instructions, which are adapted to be loaded by the processor and to execute the multimodal violation video detection method based on text and video fusion.

The logic instructions in the computer-readable storage medium according to the present invention may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when they are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The embodiments described in the above description will assist those skilled in the art in further understanding the invention, but do not limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Claims

1. A violation video detection method based on text and video fusion is characterized by comprising the following steps:

establishing a network model for detecting the illegal behavior: establishing a network model structure which comprises a shallow layer feature extraction module, a deep layer feature extraction module, a feature fusion module, a full connection module and a classifier module which are sequentially connected; repeatedly training the network on the sample set data by using pictures and texts in batches, iteratively tuning the model and setting a violation judgment probability threshold until the optimized network model is obtained; the model is used for judging whether continuous multiframe dynamic image-text pictures contain illegal actions or text information;

a real-time video detection step: and acquiring a video stream in real time, preprocessing the video stream, inputting the optimized network model, and acquiring an illegal judgment result.

2. The method according to claim 1, wherein the image sample data set and the text sample data set are divided into: a training set, a verification set and a test set;

denoising the text data;

3. The method for detecting illegal video based on text and video fusion according to claim 1,

the shallow feature extraction module comprises: a BERT model used for extracting semantic shallow text characteristics of a current sentence and a RestNet model used for extracting semantic characteristics of a shallow video frame picture;

the deep layer feature extraction module comprises: the TEXTCNN model and the LSTM model are used for continuously learning the shallow video features and the shallow text features, and the deep text semantic features and the deep video frame semantic features of the context associated information of the continuous video frames and the text contents are obtained;

the full-connection module is used for splicing output characteristics;

and the classifier module adopts a Sigmoid classifier function to judge and score the probability of the data of the multi-frame image-text input in the current batch and output a classification result of whether violation occurs or not.

4. The method according to claim 3, wherein the RestNet model for extracting the picture features of the shallow video frames comprises: performing shallow video frame feature extraction by using a ResNet152 network embedded with an SE module; the embedded SE module is used for readjusting the features, measuring the importance of the extracted features by using the extracted global information, and calculating to obtain the correlation of each channel.

5. The method for detecting the illegal video based on the text-video fusion according to claim 3, wherein the BERT model for extracting the shallow text features comprises:

the BERT is formed by stacking a plurality of transform encoder layers and is used for processing the input of the BERT; the encoder of each layer consists of a layer of Multi-head Attention and a layer of feedforward neural network, the number of models is 12, and each layer has 12 attentions;

6. The method for detecting the illegal video based on the fusion of the text and the video according to claim 3, wherein the LSTM model is used for extracting shallow video features by adopting a bidirectional LSTM and extracting context associated information of continuous video frames to obtain deeper semantic features.

7. The method for detecting the illegal video based on the text and video fusion of claim 3, wherein the TextCNN is divided into a coding layer, a convolutional layer, a maximum pooling layer and a full connection layer, and is used for learning the context relationship of the text to obtain deeper extracted text features.

8. The method for detecting the violation video based on text-video fusion according to claim 3, wherein learning the deep video features and the deep text features to obtain deep fusion features specifically comprises:

and (4) classifying the video pictures which are given at present and contain the text information by a full connection layer and then connecting Sigmoid, and judging whether the video pictures are illegal.

9. A multimodal violation video detection system based on text and video fusion, the system comprising:

establishing a data set program module: classifying and marking the single-frame picture and text data as illegal action or text and non-illegal action or text, and making a picture-text pair sample set of the single-frame picture and text data with labels;

establishing a network model program module for detecting the illegal behavior: establishing a network model structure which comprises a shallow layer feature extraction module, a deep layer feature extraction module, a feature fusion module, a full connection module and a classifier module which are sequentially connected; repeatedly training the network on the sample set data by using pictures and texts in batches, iteratively tuning the model and setting a violation judgment probability threshold until the optimized network model is obtained; the model is used for judging whether continuous multiframe dynamic image-text pictures contain illegal actions or text information;

a real-time video detection program module: and acquiring a video stream in real time, preprocessing the video stream, inputting the optimized network model, and acquiring an illegal judgment result.

10. A multimodal violation video detection device based on text and video fusion, comprising: a processor and a computer readable storage medium at the server side; the processor is configured to implement instructions, and the computer-readable storage medium is configured to store a plurality of instructions, wherein the instructions are adapted to be loaded by the processor and to perform the text and video fusion based multimodal violation video detection method of any of claims 1-8.