CN111031330A

CN111031330A - Live webcast content analysis method based on multi-mode fusion

Info

Publication number: CN111031330A
Application number: CN201911039049.7A
Authority: CN
Inventors: 黄庆明; 苏荔; 周志达; 杨士杰; 吴益灵
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-04-17

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a live webcast content analysis method based on multi-mode fusion, which aims at the characteristics of live webcast video content analysis, provides a powerful auxiliary tool for monitoring live webcast content and reduces the consumption of human resources; s1, preprocessing a video stream; s2, element extraction; s3, training a video feature learning network, a bullet screen feature learning network and an audio feature learning network; s4, respectively sending the video feature representation and the audio feature representation of the video slice into a long-term and short-term memory network to obtain the global feature vector representation of the video slice and the audio slice; s5, simultaneously sending the two feature vector representations and the bullet screen feature representation into a multi-expert judgment decision model, and learning self-adaptive weight to fuse multi-modal features to obtain the final fusion feature representation of the live content; and S6, inputting the fusion feature representation into a classifier to obtain a final analysis result of the live content.

Description

Live webcast content analysis method based on multi-mode fusion

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a live webcast content analysis method based on multi-mode fusion.

Background

The great success of the deep learning technology in the visual field makes it possible for the deep learning technology to analyze and supervise the live webcast content automatically. The analysis of the live content of the network is different from the single offline image content analysis (such as segmentation, detection and identification), and the live video content cannot meet the requirements of the common single-mode content analysis technology due to the diversification of the expression modes (video images, sounds and text barrages) and the requirement on real-time performance. According to the characteristics of the live video, a multi-mode technology is utilized to solve the problem of analyzing the live content. In one aspect, the existing multi-expert model (multi-expert decision model)^[1](N.Shazeer, A. Mirhoseini, K.Maziarz, A, Davis, Q.le, G.Hinton, and J.dean.Outage large neural networks: The sparse mixture of-expert International Conference on Learning retrieval, 2017.) multimodality information can be fully utilized; on the other hand, video is different from image, and there is some time sequence information and continuity between video frames, and when extracting features, we can use long-short term memory network LSTM^[2](Hochreiter S, Schmidhuber,Jürgen.Long Short-Term Memory[J]Neural Computation,1997,9(8): 1735-.

Disclosure of Invention

In order to solve the technical problems, the invention provides a practical and effective network live broadcast content analysis method based on multi-mode fusion, which aims at the characteristics of live broadcast video content analysis, provides a powerful auxiliary tool for network live broadcast content supervision, and reduces the human resource consumption.

The invention discloses a network live broadcast content analysis method based on multi-mode fusion, which comprises the following steps of:

s1, for input original video stream V_sFirst cut it into video slices V of 10 seconds segments_t1，V_t2…V_tm；

S2, for video slice V_t1，V_t2…V_tmFirst of all, its audio stream S is extracted_t1，S_t2…S_tmSecondly, extracting the bullet screen sequence T therein_t1，T_t2…T_tmFinally, extracting the slice key frame

S3, training a video feature learning network, a barrage feature learning network and an audio feature learning network, and extracting the features of the video slice key frames, the barrages and the audio streams;

s4, sending the audio features and the key frame features extracted in the step S3 into the LSTM network to obtain the global representation of the audio features and the key frame features

And

s5, mixing

And F^tInputting the data into a multi-expert decision model MoE to obtain a final fusion feature F_finalThe formula is as follows:

s6, input F_finalAnd obtaining a final analysis result in a classifier.

The invention discloses a network live content analysis method based on multi-mode fusion, which comprises the following specific operation flows of the step S3:

(1) input S_t1，S_t2…S_tmTo an audio feature learning network, extracting features F_t1，F_t2…F_tm；

(2) Input T_t1，T_t2…T_tmTo the bullet screen feature learning network, extracting the feature F^t；

(3) Input device

To a video feature learning network,

to obtain

Further, in step S5, if F is detected_finalAnd

in a linear relationship, it can be simplified as follows:

wherein, λ 1 and λ 2 are weights obtained by self-adaptive learning of a multi-expert decision model.

Compared with the prior art, the invention has the beneficial effects that: the analysis method of the invention fully utilizes the time sequence information and the interframe relation of the live video, takes the multi-modal characteristics of the live video into consideration, extracts the modal information of the live video, extracts the characteristics of the live video by using a special deep network, and sends the live video to the LSTM^[1]The global representation of the modal characteristics is obtained by learning in the network and then is sent to a multi-expert decision model^[2]In the network, the characteristics are respectively learned by a multi-expert decision algorithm, different weights are given to the characteristics of each mode to be fused into a final representation, and finally the final representation is sent into a softmax layer to obtain a judgment result. The advantage of directly obtaining the discrimination result without using a multi-expert decision model network is that the live broadcast content is changedAfter the indexes are analyzed, because the specific judged layer is softmax, a new analysis result can be easily obtained under the condition of not modifying the multi-expert decision model network. By adopting the design, the applicability of the whole algorithm is greatly improved.

Drawings

FIG. 1 is a flow chart of an analytical method of the present invention.

Detailed Description

The following describes the present invention in further detail with reference to fig. 1 and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Examples

S1, for input original video stream V₃First cut it into video slices V of 10 seconds segments_t1，V_t2…V_tm；

S3, training a video feature learning network, a barrage feature learning network and an audio feature learning network, and extracting the features of the video slice key frames, the barrages and the audio streams:

(3) Input device

To a video feature learning network,

to obtain

And

s5, mixing

if F_finalAnd

in a linear relationship, it can be simplified as follows:

S6, input F_finalAnd obtaining a final analysis result in a classifier.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A live webcast content analysis method based on multi-modal fusion is characterized by comprising the following steps:

s1, video stream preprocessing: firstly, preprocessing an input original network live broadcast video stream, and cutting the original network live broadcast video stream into ten-second long video slices;

s2, element extraction: extracting key frames, audio slices and bullet screen characters from the video slices;

s4, respectively sending the video feature representation and the audio feature representation of the video slice into a long-term and short-term memory network, and respectively obtaining 1024-dimensional feature vector representations of the video slice and the audio slice;

s5, simultaneously sending the two 1024-dimensional video and audio feature vector representations and the bullet screen feature representation into a multi-expert judgment decision model, and learning self-adaptive weights to fuse multi-modal features to obtain the final fusion feature representation of the live content;

and S6, inputting the fusion feature representation into a classifier to obtain a final analysis result of the live content.

2. The live webcast content analysis method based on multi-modal fusion as claimed in claim 1, wherein the specific operation flow of step S3 is as follows:

(1) sending the extracted video key frames into a deep residual error learning network, and mapping the extracted video key frames into a feature space to obtain video feature representation;

(2) sending the extracted audio slices into an audio feature deep learning network, and mapping the audio slices to a feature space to obtain audio feature representation;

(3) and (4) sending the bullet screen characters into a character feature extraction model, and mapping to a feature space to obtain text feature representation.