CN112364743A

CN112364743A - Video classification method based on semi-supervised learning and bullet screen analysis

Info

Publication number: CN112364743A
Application number: CN202011204098.4A
Authority: CN
Inventors: 刘瑞军; 张伦; 王俊; 章博华
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2021-02-12

Abstract

The barrage video is a product of entertainment interaction emerging in recent years, the prior knowledge can be used for classification of the barrage video, emotion information in the barrage video is mined, and prediction is made on video content. The method comprises the following steps: firstly, acquiring and preprocessing a bullet screen data set. Secondly, a semi-supervised learning mode is used for extracting emotion and theme information in the bullet screen by using a small part of labeled data and a large amount of unlabeled data training models. And thirdly, detecting the content of the video according to the result, using the result in combination with a time axis to generate a linear sequence containing emotion labels and theme labels, and completing a video classification task by comparing the sequence similarity of different videos.

Description

Video classification method based on semi-supervised learning and bullet screen analysis

Technical Field

The application belongs to the technical field of video processing, and particularly relates to a video classification method and device of a bullet screen data processing method based on a deep learning model BERT and semi-supervised learning.

Background

In recent years, watching online videos has become a main way for mass entertainment, and the communication way of watching video bulletin screens is more and more popular with people. People participate in this interactive mode while generating huge amounts of barrage data and video data. However, how to mine valuable information in massive bullet screen data is a problem faced at present.

A video classification task belongs to the technical field of video processing, and the traditional video classification task is mainly completed by using a machine learning method and comprises the following steps: training of classification models using annotation data. Specifically, the method comprises the following steps: extracting a key frame sequence from a video, carrying out multi-dimensional embedding processing on image content characteristics to obtain a multi-dimensional video content characteristic vector of the target video, sending the characteristic vector into a neural network model, and learning characteristics of a sample from a large amount of training data. And predicting a final classification of the actual data using the trained model.

Barrage video is a product of entertainment interaction emerging in recent years, and refers to a barrage containing a time sequence feature embedded in video, and the barrage often contains understanding of video content by a video viewer (user). For the classification of the barrage video, the barrage information can be used, the emotion information in the barrage information is mined, the prediction is made on the video content, and the classification problem of the video is solved from the viewpoint of barrage analysis by applying the theory and the method of Natural Language Processing (NLP).

Disclosure of Invention

In order to solve the technical problems, the application provides a video classification method based on semi-supervised learning and barrage analysis, which can extract emotion information in a barrage so as to detect the content of a video, and uses the result in combination with a time axis to generate a linear sequence containing emotion labels and theme labels, and completes a video classification task by comparing the sequence similarity of different videos. And finally obtaining a classification result. The method has better accuracy on the classification result of the bullet screen video, is a supplement to the traditional video classification method, and has certain practical value.

According to an aspect of the present application, a video classification method based on semi-supervised learning and barrage analysis is provided, the method including:

and S1, acquiring bullet screen data and preprocessing the bullet screen to construct a bullet screen data set. And performing word segmentation on the processed bullet screen data of the training set by using jieba word segmentation, marking sentences by using a dictionary encoder, generating an input sequence and vectorizing and expressing the input sequence.

And S2, sending the spliced feature vectors into a pre-training language model BERT, and updating specific parameters in a subjective and objective bullet screen classification task, an emotion multi-classification task and a theme classification task in a deep space through learning of a marked bullet screen data set L. The Teacher model was obtained.

And S3, labeling the unmarked data set by using a Teacher model to generate a pseudo label data set P.

S4, training a larger model Student on L + P, and adding noise data to the dataset before training, this process may force the new model to be insensitive to noise data.

And S5, returning to S4, taking the Student model as a Teacher model, continuously labeling the unmarked data set to generate a new pseudo label data set P, and obtaining a new Student model until the model converges or the computing resources are exhausted.

And S6, classifying the test samples by using the trained model, fusing the time characteristics with the classification result to obtain corresponding graph structures of different videos, obtaining corresponding sequences through traversal, and finishing video classification by comparing the sequence similarity of the different videos.

Specifically, step S1 specifically includes:

s101, special characters such as expressions are transferred into characters. And the data set was divided into a training set and a test set at a ratio of 9 to 1.

S102, marking the bullet screen data of the test set, wherein the marking is divided into two parts, including: and classifying and labeling the subjective and objective bullet screen data, and dividing the bullet screen into a subjective bullet screen and a bullet screen without the subjective emotional information and an objective bullet screen according to whether the bullet screen contains the subjective emotional information.

S103, subjective barrage emotion multi-classification marking, such as: label like happy, anger, grief, happy, frightened, thinking, fear, etc. And custom bullet screen theme annotations, such as: cook, fight, chat, kissing, etc. Approximately annotate the bullet screen data of the training set 1/10.

S104, one word or character per action in the dictionary D.

S105, marking mode is as follows: connecting a [ CLS ] mark at the beginning of a sentence, and adding an [ SEP ] mark between an auxiliary sentence and an original sentence, wherein the [ SEP ] mark specifically comprises the following steps: [ CLS ] original sentence sequence [ SEP ] auxiliary sentence sequence [ SEP ].

S106, vectorization representation specifically comprises the following steps: word vectors, position vectors, and segment vectors.

Specifically, step S2 specifically includes:

s201, the Teacher model uses a BERT pre-training language model published by Google. The parameters used were: the transform _ block is 6, the Embedding _ distribution is 384, the num _ headers is 12, and the TotalParameters is 23M. The pre-training anticipates used the Chinese Wikipedia anticipates nlp _ chinese _ corpus.

S202, adding dynamic padding to the model to optimize training speed.

Specifically, step S4 specifically includes:

s401, the same Student model uses the BERT pre-training language model published by Google. But here the Student model is larger than the Teacher model in order to maximize the training. The parameters used were: 12 for transform _ block, 768 for embedded _ dimension, 12 for num _ headers, 110M for TotalParameters

S402, adding noise data to the data set L + P before training the Student model is a data enhancement transformation. The specific implementation in the bullet screen text is back _ translation and TF-IDF word replacement.

Specifically, step S5 specifically includes:

s501, reducing the consistency loss to the minimum gradually propagates the tag information from the tagged data to the untagged data.

Specifically, step S6 specifically includes:

s601, dividing bullet screen data into two types according to whether subjective emotional information is contained or not by using a subjective and objective classification model. And further, the subjective bullet screen is sent into a multi-emotion classification model for prediction, and the subjective prediction is carried out on the objective bullet screen.

S602, in actual prediction, because the user' S view and feel of the same video at the same time and the emotion expressed by the launched barrage are not completely consistent, multiple emotions or topics may be predicted at the same time of the same time axis, here, a directed graph is used to record classification results, and votes of the prediction results are used as weights of directed graph edges. The predicted result is taken as a point. Finally, a bullet screen emotion-time digraph and a bullet screen theme-time digraph are obtained.

And S603, traversing the directed graph, and selecting four modes of breadth-first search (BFS), depth-first search (DFS), breadth-first search (WBFS) based on weight and depth-first search (WDFS) based on weight for traversing. Tests have shown that weight-based depth first search (WDFS) is better in the actual classification effect.

S604, for the two sequences, the similarity can be determined by comparing the Levenshtein (Levenshtein) distance of the two sequences of different videos. And finally finishing the classification task of the video.

The invention has the beneficial effects that: mainly comprises the following aspects:

firstly, the method uses the theoretical knowledge of NLP to complete the video classification task in the technical field of video processing, and compared with the traditional video classification method, the method is an attempt to completely classify the video with the barrage from another angle. And certain effect is obtained in practical application.

Secondly, when the BERT model used by the method is proposed by Google, the BERT model is pre-trained on a large number of text data sets, and compared with models such as CNN, RNN and LSTM, the method can reduce pre-training steps and reduce complex workload.

And thirdly, a semi-supervised learning mode is used, so that the defect of small labeled data amount is overcome, and more features can be learned from massive unlabeled data sets. On the other hand, the overfitting phenomenon which is easy to occur under the condition that the data quantity is insufficient in supervised learning is also solved.

And fourthly, the video classification model uses the prior knowledge of the user, so that the effect in the video classification task with the bullet screen is better.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic flow chart diagram of a video classification method based on semi-supervised learning and barrage analysis according to one embodiment of the present application;

FIG. 2 is a schematic flow chart diagram of model training according to one embodiment of the present application;

FIG. 3 is a diagram of a model architecture according to one embodiment of the present application;

Detailed Description

The implementation process mainly comprises two steps: training a network by using the preprocessed bullet screen data; and then classifying the actual bullet screen video by using the trained model.

Specifically, step S1 specifically includes:

S104, one word or character per action in the dictionary D.

Specifically, step S2 specifically includes:

S202, adding dynamic padding to the model to optimize training speed.

Specifically, step S4 specifically includes:

And S403, adopting a consistency training mode in the whole training process. Firstly, enhancing the non-label data, then sending the data before enhancement and the data after enhancement into the network, obtaining a prediction result, calculating the two results into a KL divergence as unsupervised cross entropy loss, and adding the unsupervised consistency loss together for back propagation.

Further, the objective function used in step S403 is as follows:

the former term is the loss with label, the latter term is the loss without label, and λ represents the ratio between the two.

Specifically, step S6 specifically includes:

And S604, determining the similarity of different videos by comparing the Levenshtein (Levenshtein) distances of the two sequences of the different videos. And finishing the classification task of the videos according to the similarity scores of different videos.

Further, the calculation formula of the levenstan distance of the two sequences in step S604 is as follows:

where a, b are two strings and i/j is the array index.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A video classification method based on semi-supervised learning and barrage analysis comprises the following steps:

and acquiring bullet screen data, preprocessing the bullet screen and constructing a bullet screen data set. And performing word segmentation on the processed bullet screen data of the training set by using jieba word segmentation, marking sentences by using a dictionary encoder, generating an input sequence and vectorizing and expressing the input sequence.

2. The method according to claim 1, wherein the sequence feature vector is fed into a pre-training language model BERT, and specific parameters in the subjective and objective bullet screen classification task, the emotion multi-classification task and the theme classification task in the deep space are updated through learning of the labeled bullet screen data set L. The Teacher model was obtained.

3. The method of claim 2, wherein the Teacher model labels the unlabeled dataset to produce a pseudo-labeled dataset P.

4. The method of claim 3, wherein a larger model Student is trained on the L + P dataset and noise data is added to the dataset before training.

5. The method of claims 3 to 4, wherein the Student model continues to label the unlabeled dataset as the Teacher model to generate a new pseudo label dataset P and obtain a new Student model until the model converges or the computing resources are exhausted.

6. The method as claimed in claims 1 to 5, wherein the trained classifier classifies the test samples, integrates the time characteristics with the classification results to obtain the corresponding graph structures of different videos, obtains the corresponding sequences through traversal, and completes video classification by comparing the sequence similarity of different videos.