CN111031330A - Live webcast content analysis method based on multi-mode fusion - Google Patents

Live webcast content analysis method based on multi-mode fusion Download PDF

Info

Publication number
CN111031330A
CN111031330A CN201911039049.7A CN201911039049A CN111031330A CN 111031330 A CN111031330 A CN 111031330A CN 201911039049 A CN201911039049 A CN 201911039049A CN 111031330 A CN111031330 A CN 111031330A
Authority
CN
China
Prior art keywords
video
feature
audio
live
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911039049.7A
Other languages
Chinese (zh)
Inventor
黄庆明
苏荔
周志达
杨士杰
吴益灵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Chinese Academy of Sciences
Original Assignee
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Chinese Academy of Sciences filed Critical University of Chinese Academy of Sciences
Priority to CN201911039049.7A priority Critical patent/CN111031330A/en
Publication of CN111031330A publication Critical patent/CN111031330A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a live webcast content analysis method based on multi-mode fusion, which aims at the characteristics of live webcast video content analysis, provides a powerful auxiliary tool for monitoring live webcast content and reduces the consumption of human resources; s1, preprocessing a video stream; s2, element extraction; s3, training a video feature learning network, a bullet screen feature learning network and an audio feature learning network; s4, respectively sending the video feature representation and the audio feature representation of the video slice into a long-term and short-term memory network to obtain the global feature vector representation of the video slice and the audio slice; s5, simultaneously sending the two feature vector representations and the bullet screen feature representation into a multi-expert judgment decision model, and learning self-adaptive weight to fuse multi-modal features to obtain the final fusion feature representation of the live content; and S6, inputting the fusion feature representation into a classifier to obtain a final analysis result of the live content.

Description

Live webcast content analysis method based on multi-mode fusion
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a live webcast content analysis method based on multi-mode fusion.
Background
The great success of the deep learning technology in the visual field makes it possible for the deep learning technology to analyze and supervise the live webcast content automatically. The analysis of the live content of the network is different from the single offline image content analysis (such as segmentation, detection and identification), and the live video content cannot meet the requirements of the common single-mode content analysis technology due to the diversification of the expression modes (video images, sounds and text barrages) and the requirement on real-time performance. According to the characteristics of the live video, a multi-mode technology is utilized to solve the problem of analyzing the live content. In one aspect, the existing multi-expert model (multi-expert decision model)[1](N.Shazeer, A. Mirhoseini, K.Maziarz, A, Davis, Q.le, G.Hinton, and J.dean.Outage large neural networks: The sparse mixture of-expert International Conference on Learning retrieval, 2017.) multimodality information can be fully utilized; on the other hand, video is different from image, and there is some time sequence information and continuity between video frames, and when extracting features, we can use long-short term memory network LSTM[2](Hochreiter S, Schmidhuber,Jürgen.Long Short-Term Memory[J]Neural Computation,1997,9(8): 1735-.
Disclosure of Invention
In order to solve the technical problems, the invention provides a practical and effective network live broadcast content analysis method based on multi-mode fusion, which aims at the characteristics of live broadcast video content analysis, provides a powerful auxiliary tool for network live broadcast content supervision, and reduces the human resource consumption.
The invention discloses a network live broadcast content analysis method based on multi-mode fusion, which comprises the following steps of:
s1, for input original video stream VsFirst cut it into video slices V of 10 seconds segmentst1,Vt2…Vtm
S2, for video slice Vt1,Vt2…VtmFirst of all, its audio stream S is extractedt1,St2…StmSecondly, extracting the bullet screen sequence T thereint1,Tt2…TtmFinally, extracting the slice key frame
Figure BDA0002252350530000021
S3, training a video feature learning network, a barrage feature learning network and an audio feature learning network, and extracting the features of the video slice key frames, the barrages and the audio streams;
s4, sending the audio features and the key frame features extracted in the step S3 into the LSTM network to obtain the global representation of the audio features and the key frame features
Figure BDA0002252350530000022
And
Figure BDA0002252350530000023
s5, mixing
Figure BDA0002252350530000024
And FtInputting the data into a multi-expert decision model MoE to obtain a final fusion feature FfinalThe formula is as follows:
Figure BDA0002252350530000025
s6, input FfinalAnd obtaining a final analysis result in a classifier.
The invention discloses a network live content analysis method based on multi-mode fusion, which comprises the following specific operation flows of the step S3:
(1) input St1,St2…StmTo an audio feature learning network, extracting features Ft1,Ft2…Ftm
(2) Input Tt1,Tt2…TtmTo the bullet screen feature learning network, extracting the feature Ft
(3) Input device
Figure BDA0002252350530000026
To a video feature learning network,
to obtain
Figure BDA0002252350530000027
Further, in step S5, if F is detectedfinalAnd
Figure BDA0002252350530000028
in a linear relationship, it can be simplified as follows:
Figure BDA0002252350530000031
wherein, λ 1 and λ 2 are weights obtained by self-adaptive learning of a multi-expert decision model.
Compared with the prior art, the invention has the beneficial effects that: the analysis method of the invention fully utilizes the time sequence information and the interframe relation of the live video, takes the multi-modal characteristics of the live video into consideration, extracts the modal information of the live video, extracts the characteristics of the live video by using a special deep network, and sends the live video to the LSTM[1]The global representation of the modal characteristics is obtained by learning in the network and then is sent to a multi-expert decision model[2]In the network, the characteristics are respectively learned by a multi-expert decision algorithm, different weights are given to the characteristics of each mode to be fused into a final representation, and finally the final representation is sent into a softmax layer to obtain a judgment result. The advantage of directly obtaining the discrimination result without using a multi-expert decision model network is that the live broadcast content is changedAfter the indexes are analyzed, because the specific judged layer is softmax, a new analysis result can be easily obtained under the condition of not modifying the multi-expert decision model network. By adopting the design, the applicability of the whole algorithm is greatly improved.
Drawings
FIG. 1 is a flow chart of an analytical method of the present invention.
Detailed Description
The following describes the present invention in further detail with reference to fig. 1 and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Examples
S1, for input original video stream V3First cut it into video slices V of 10 seconds segmentst1,Vt2…Vtm
S2, for video slice Vt1,Vt2…VtmFirst of all, its audio stream S is extractedt1,St2…StmSecondly, extracting the bullet screen sequence T thereint1,Tt2…TtmFinally, extracting the slice key frame
Figure BDA0002252350530000032
S3, training a video feature learning network, a barrage feature learning network and an audio feature learning network, and extracting the features of the video slice key frames, the barrages and the audio streams:
(1) input St1,St2…StmTo an audio feature learning network, extracting features Ft1,Ft2…Ftm
(2) Input Tt1,Tt2…TtmTo the bullet screen feature learning network, extracting the feature Ft
(3) Input device
Figure BDA0002252350530000041
To a video feature learning network,
to obtain
Figure BDA0002252350530000042
S4, sending the audio features and the key frame features extracted in the step S3 into the LSTM network to obtain the global representation of the audio features and the key frame features
Figure BDA0002252350530000043
And
Figure BDA0002252350530000044
s5, mixing
Figure BDA0002252350530000045
And FtInputting the data into a multi-expert decision model MoE to obtain a final fusion feature FfinalThe formula is as follows:
Figure BDA0002252350530000046
if FfinalAnd
Figure BDA0002252350530000047
in a linear relationship, it can be simplified as follows:
Figure BDA0002252350530000048
wherein, λ 1 and λ 2 are weights obtained by self-adaptive learning of a multi-expert decision model.
S6, input FfinalAnd obtaining a final analysis result in a classifier.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (2)

1. A live webcast content analysis method based on multi-modal fusion is characterized by comprising the following steps:
s1, video stream preprocessing: firstly, preprocessing an input original network live broadcast video stream, and cutting the original network live broadcast video stream into ten-second long video slices;
s2, element extraction: extracting key frames, audio slices and bullet screen characters from the video slices;
s3, training a video feature learning network, a barrage feature learning network and an audio feature learning network, and extracting the features of the video slice key frames, the barrages and the audio streams;
s4, respectively sending the video feature representation and the audio feature representation of the video slice into a long-term and short-term memory network, and respectively obtaining 1024-dimensional feature vector representations of the video slice and the audio slice;
s5, simultaneously sending the two 1024-dimensional video and audio feature vector representations and the bullet screen feature representation into a multi-expert judgment decision model, and learning self-adaptive weights to fuse multi-modal features to obtain the final fusion feature representation of the live content;
and S6, inputting the fusion feature representation into a classifier to obtain a final analysis result of the live content.
2. The live webcast content analysis method based on multi-modal fusion as claimed in claim 1, wherein the specific operation flow of step S3 is as follows:
(1) sending the extracted video key frames into a deep residual error learning network, and mapping the extracted video key frames into a feature space to obtain video feature representation;
(2) sending the extracted audio slices into an audio feature deep learning network, and mapping the audio slices to a feature space to obtain audio feature representation;
(3) and (4) sending the bullet screen characters into a character feature extraction model, and mapping to a feature space to obtain text feature representation.
CN201911039049.7A 2019-10-29 2019-10-29 Live webcast content analysis method based on multi-mode fusion Pending CN111031330A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911039049.7A CN111031330A (en) 2019-10-29 2019-10-29 Live webcast content analysis method based on multi-mode fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911039049.7A CN111031330A (en) 2019-10-29 2019-10-29 Live webcast content analysis method based on multi-mode fusion

Publications (1)

Publication Number Publication Date
CN111031330A true CN111031330A (en) 2020-04-17

Family

ID=70204657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911039049.7A Pending CN111031330A (en) 2019-10-29 2019-10-29 Live webcast content analysis method based on multi-mode fusion

Country Status (1)

Country Link
CN (1) CN111031330A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723239A (en) * 2020-05-11 2020-09-29 华中科技大学 Multi-mode-based video annotation method
CN113315983A (en) * 2021-05-17 2021-08-27 唐晓晖 Live frame transmission system for 5G and 4G network aggregation
CN114786035A (en) * 2022-05-25 2022-07-22 上海氪信信息技术有限公司 Compliance quality inspection and interactive question-answering system and method for live scene
CN117218453A (en) * 2023-11-06 2023-12-12 中国科学院大学 Incomplete multi-mode medical image learning method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199933A (en) * 2014-09-04 2014-12-10 华中科技大学 Multi-modal information fusion football video event detection and semantic annotation method
CN107197384A (en) * 2017-05-27 2017-09-22 北京光年无限科技有限公司 The multi-modal exchange method of virtual robot and system applied to net cast platform
CN108710918A (en) * 2018-05-23 2018-10-26 北京奇艺世纪科技有限公司 A kind of fusion method and device of the multi-modal information of live video
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium
CN110222231A (en) * 2019-06-11 2019-09-10 成都澳海川科技有限公司 A kind of temperature prediction technique of video clip

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199933A (en) * 2014-09-04 2014-12-10 华中科技大学 Multi-modal information fusion football video event detection and semantic annotation method
CN107197384A (en) * 2017-05-27 2017-09-22 北京光年无限科技有限公司 The multi-modal exchange method of virtual robot and system applied to net cast platform
CN108710918A (en) * 2018-05-23 2018-10-26 北京奇艺世纪科技有限公司 A kind of fusion method and device of the multi-modal information of live video
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium
CN110222231A (en) * 2019-06-11 2019-09-10 成都澳海川科技有限公司 A kind of temperature prediction technique of video clip

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723239A (en) * 2020-05-11 2020-09-29 华中科技大学 Multi-mode-based video annotation method
CN111723239B (en) * 2020-05-11 2023-06-16 华中科技大学 Video annotation method based on multiple modes
CN113315983A (en) * 2021-05-17 2021-08-27 唐晓晖 Live frame transmission system for 5G and 4G network aggregation
CN114786035A (en) * 2022-05-25 2022-07-22 上海氪信信息技术有限公司 Compliance quality inspection and interactive question-answering system and method for live scene
CN117218453A (en) * 2023-11-06 2023-12-12 中国科学院大学 Incomplete multi-mode medical image learning method
CN117218453B (en) * 2023-11-06 2024-01-16 中国科学院大学 Incomplete multi-mode medical image learning method

Similar Documents

Publication Publication Date Title
CN111031330A (en) Live webcast content analysis method based on multi-mode fusion
US11412023B2 (en) Video description generation method and apparatus, video playing method and apparatus, and storage medium
CN109977921B (en) Method for detecting hidden danger of power transmission line
CN109495766A (en) A kind of method, apparatus, equipment and the storage medium of video audit
CN110796098B (en) Method, device, equipment and storage medium for training and auditing content auditing model
CN105138953B (en) A method of action recognition in the video based on continuous more case-based learnings
CN111382623A (en) Live broadcast auditing method, device, server and storage medium
US11868738B2 (en) Method and apparatus for generating natural language description information
CN111582122B (en) System and method for intelligently analyzing behaviors of multi-dimensional pedestrians in surveillance video
CN111008337B (en) Deep attention rumor identification method and device based on ternary characteristics
CN113766299B (en) Video data playing method, device, equipment and medium
CN112488071B (en) Method, device, electronic equipment and storage medium for extracting pedestrian features
CN111783712A (en) Video processing method, device, equipment and medium
CN111259720A (en) Unsupervised pedestrian re-identification method based on self-supervision agent feature learning
CN112149642A (en) Text image recognition method and device
CN109614896A (en) A method of the video content semantic understanding based on recursive convolution neural network
CN112183672A (en) Image classification method, and training method and device of feature extraction network
CN112434178A (en) Image classification method and device, electronic equipment and storage medium
CN111914649A (en) Face recognition method and device, electronic equipment and storage medium
CN111597361B (en) Multimedia data processing method, device, storage medium and equipment
CN114302157B (en) Attribute tag identification and substitution event detection methods, device, equipment and medium thereof
CN112235517B (en) Method for adding white-matter, device for adding white-matter, and storage medium
CN114581994A (en) Class attendance management method and system
CN114363664A (en) Method and device for generating video collection title
CN114357301A (en) Data processing method, device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200417