CN111031330A - Live webcast content analysis method based on multi-mode fusion - Google Patents
Live webcast content analysis method based on multi-mode fusion Download PDFInfo
- Publication number
- CN111031330A CN111031330A CN201911039049.7A CN201911039049A CN111031330A CN 111031330 A CN111031330 A CN 111031330A CN 201911039049 A CN201911039049 A CN 201911039049A CN 111031330 A CN111031330 A CN 111031330A
- Authority
- CN
- China
- Prior art keywords
- video
- feature
- audio
- live
- representation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/21—Server components or server architectures
- H04N21/218—Source of audio or video content, e.g. local disk arrays
- H04N21/2187—Live feed
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
Abstract
The invention relates to the technical field of artificial intelligence, in particular to a live webcast content analysis method based on multi-mode fusion, which aims at the characteristics of live webcast video content analysis, provides a powerful auxiliary tool for monitoring live webcast content and reduces the consumption of human resources; s1, preprocessing a video stream; s2, element extraction; s3, training a video feature learning network, a bullet screen feature learning network and an audio feature learning network; s4, respectively sending the video feature representation and the audio feature representation of the video slice into a long-term and short-term memory network to obtain the global feature vector representation of the video slice and the audio slice; s5, simultaneously sending the two feature vector representations and the bullet screen feature representation into a multi-expert judgment decision model, and learning self-adaptive weight to fuse multi-modal features to obtain the final fusion feature representation of the live content; and S6, inputting the fusion feature representation into a classifier to obtain a final analysis result of the live content.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a live webcast content analysis method based on multi-mode fusion.
Background
The great success of the deep learning technology in the visual field makes it possible for the deep learning technology to analyze and supervise the live webcast content automatically. The analysis of the live content of the network is different from the single offline image content analysis (such as segmentation, detection and identification), and the live video content cannot meet the requirements of the common single-mode content analysis technology due to the diversification of the expression modes (video images, sounds and text barrages) and the requirement on real-time performance. According to the characteristics of the live video, a multi-mode technology is utilized to solve the problem of analyzing the live content. In one aspect, the existing multi-expert model (multi-expert decision model)[1](N.Shazeer, A. Mirhoseini, K.Maziarz, A, Davis, Q.le, G.Hinton, and J.dean.Outage large neural networks: The sparse mixture of-expert International Conference on Learning retrieval, 2017.) multimodality information can be fully utilized; on the other hand, video is different from image, and there is some time sequence information and continuity between video frames, and when extracting features, we can use long-short term memory network LSTM[2](Hochreiter S, Schmidhuber,Jürgen.Long Short-Term Memory[J]Neural Computation,1997,9(8): 1735-.
Disclosure of Invention
In order to solve the technical problems, the invention provides a practical and effective network live broadcast content analysis method based on multi-mode fusion, which aims at the characteristics of live broadcast video content analysis, provides a powerful auxiliary tool for network live broadcast content supervision, and reduces the human resource consumption.
The invention discloses a network live broadcast content analysis method based on multi-mode fusion, which comprises the following steps of:
s1, for input original video stream VsFirst cut it into video slices V of 10 seconds segmentst1,Vt2…Vtm;
S2, for video slice Vt1,Vt2…VtmFirst of all, its audio stream S is extractedt1,St2…StmSecondly, extracting the bullet screen sequence T thereint1,Tt2…TtmFinally, extracting the slice key frame
S3, training a video feature learning network, a barrage feature learning network and an audio feature learning network, and extracting the features of the video slice key frames, the barrages and the audio streams;
s4, sending the audio features and the key frame features extracted in the step S3 into the LSTM network to obtain the global representation of the audio features and the key frame featuresAnd
s5, mixingAnd FtInputting the data into a multi-expert decision model MoE to obtain a final fusion feature FfinalThe formula is as follows:
s6, input FfinalAnd obtaining a final analysis result in a classifier.
The invention discloses a network live content analysis method based on multi-mode fusion, which comprises the following specific operation flows of the step S3:
(1) input St1,St2…StmTo an audio feature learning network, extracting features Ft1,Ft2…Ftm;
(2) Input Tt1,Tt2…TtmTo the bullet screen feature learning network, extracting the feature Ft;
(3) Input device
To a video feature learning network,
Further, in step S5, if F is detectedfinalAndin a linear relationship, it can be simplified as follows:
wherein, λ 1 and λ 2 are weights obtained by self-adaptive learning of a multi-expert decision model.
Compared with the prior art, the invention has the beneficial effects that: the analysis method of the invention fully utilizes the time sequence information and the interframe relation of the live video, takes the multi-modal characteristics of the live video into consideration, extracts the modal information of the live video, extracts the characteristics of the live video by using a special deep network, and sends the live video to the LSTM[1]The global representation of the modal characteristics is obtained by learning in the network and then is sent to a multi-expert decision model[2]In the network, the characteristics are respectively learned by a multi-expert decision algorithm, different weights are given to the characteristics of each mode to be fused into a final representation, and finally the final representation is sent into a softmax layer to obtain a judgment result. The advantage of directly obtaining the discrimination result without using a multi-expert decision model network is that the live broadcast content is changedAfter the indexes are analyzed, because the specific judged layer is softmax, a new analysis result can be easily obtained under the condition of not modifying the multi-expert decision model network. By adopting the design, the applicability of the whole algorithm is greatly improved.
Drawings
FIG. 1 is a flow chart of an analytical method of the present invention.
Detailed Description
The following describes the present invention in further detail with reference to fig. 1 and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Examples
S1, for input original video stream V3First cut it into video slices V of 10 seconds segmentst1,Vt2…Vtm;
S2, for video slice Vt1,Vt2…VtmFirst of all, its audio stream S is extractedt1,St2…StmSecondly, extracting the bullet screen sequence T thereint1,Tt2…TtmFinally, extracting the slice key frame
S3, training a video feature learning network, a barrage feature learning network and an audio feature learning network, and extracting the features of the video slice key frames, the barrages and the audio streams:
(1) input St1,St2…StmTo an audio feature learning network, extracting features Ft1,Ft2…Ftm;
(2) Input Tt1,Tt2…TtmTo the bullet screen feature learning network, extracting the feature Ft;
(3) Input device
To a video feature learning network,
S4, sending the audio features and the key frame features extracted in the step S3 into the LSTM network to obtain the global representation of the audio features and the key frame featuresAnd
s5, mixingAnd FtInputting the data into a multi-expert decision model MoE to obtain a final fusion feature FfinalThe formula is as follows:
wherein, λ 1 and λ 2 are weights obtained by self-adaptive learning of a multi-expert decision model.
S6, input FfinalAnd obtaining a final analysis result in a classifier.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (2)
1. A live webcast content analysis method based on multi-modal fusion is characterized by comprising the following steps:
s1, video stream preprocessing: firstly, preprocessing an input original network live broadcast video stream, and cutting the original network live broadcast video stream into ten-second long video slices;
s2, element extraction: extracting key frames, audio slices and bullet screen characters from the video slices;
s3, training a video feature learning network, a barrage feature learning network and an audio feature learning network, and extracting the features of the video slice key frames, the barrages and the audio streams;
s4, respectively sending the video feature representation and the audio feature representation of the video slice into a long-term and short-term memory network, and respectively obtaining 1024-dimensional feature vector representations of the video slice and the audio slice;
s5, simultaneously sending the two 1024-dimensional video and audio feature vector representations and the bullet screen feature representation into a multi-expert judgment decision model, and learning self-adaptive weights to fuse multi-modal features to obtain the final fusion feature representation of the live content;
and S6, inputting the fusion feature representation into a classifier to obtain a final analysis result of the live content.
2. The live webcast content analysis method based on multi-modal fusion as claimed in claim 1, wherein the specific operation flow of step S3 is as follows:
(1) sending the extracted video key frames into a deep residual error learning network, and mapping the extracted video key frames into a feature space to obtain video feature representation;
(2) sending the extracted audio slices into an audio feature deep learning network, and mapping the audio slices to a feature space to obtain audio feature representation;
(3) and (4) sending the bullet screen characters into a character feature extraction model, and mapping to a feature space to obtain text feature representation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911039049.7A CN111031330A (en) | 2019-10-29 | 2019-10-29 | Live webcast content analysis method based on multi-mode fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911039049.7A CN111031330A (en) | 2019-10-29 | 2019-10-29 | Live webcast content analysis method based on multi-mode fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111031330A true CN111031330A (en) | 2020-04-17 |
Family
ID=70204657
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911039049.7A Pending CN111031330A (en) | 2019-10-29 | 2019-10-29 | Live webcast content analysis method based on multi-mode fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111031330A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723239A (en) * | 2020-05-11 | 2020-09-29 | 华中科技大学 | Multi-mode-based video annotation method |
CN113315983A (en) * | 2021-05-17 | 2021-08-27 | 唐晓晖 | Live frame transmission system for 5G and 4G network aggregation |
CN114786035A (en) * | 2022-05-25 | 2022-07-22 | 上海氪信信息技术有限公司 | Compliance quality inspection and interactive question-answering system and method for live scene |
CN117218453A (en) * | 2023-11-06 | 2023-12-12 | 中国科学院大学 | Incomplete multi-mode medical image learning method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104199933A (en) * | 2014-09-04 | 2014-12-10 | 华中科技大学 | Multi-modal information fusion football video event detection and semantic annotation method |
CN107197384A (en) * | 2017-05-27 | 2017-09-22 | 北京光年无限科技有限公司 | The multi-modal exchange method of virtual robot and system applied to net cast platform |
CN108710918A (en) * | 2018-05-23 | 2018-10-26 | 北京奇艺世纪科技有限公司 | A kind of fusion method and device of the multi-modal information of live video |
CN109376603A (en) * | 2018-09-25 | 2019-02-22 | 北京周同科技有限公司 | A kind of video frequency identifying method, device, computer equipment and storage medium |
CN110222231A (en) * | 2019-06-11 | 2019-09-10 | 成都澳海川科技有限公司 | A kind of temperature prediction technique of video clip |
-
2019
- 2019-10-29 CN CN201911039049.7A patent/CN111031330A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104199933A (en) * | 2014-09-04 | 2014-12-10 | 华中科技大学 | Multi-modal information fusion football video event detection and semantic annotation method |
CN107197384A (en) * | 2017-05-27 | 2017-09-22 | 北京光年无限科技有限公司 | The multi-modal exchange method of virtual robot and system applied to net cast platform |
CN108710918A (en) * | 2018-05-23 | 2018-10-26 | 北京奇艺世纪科技有限公司 | A kind of fusion method and device of the multi-modal information of live video |
CN109376603A (en) * | 2018-09-25 | 2019-02-22 | 北京周同科技有限公司 | A kind of video frequency identifying method, device, computer equipment and storage medium |
CN110222231A (en) * | 2019-06-11 | 2019-09-10 | 成都澳海川科技有限公司 | A kind of temperature prediction technique of video clip |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723239A (en) * | 2020-05-11 | 2020-09-29 | 华中科技大学 | Multi-mode-based video annotation method |
CN111723239B (en) * | 2020-05-11 | 2023-06-16 | 华中科技大学 | Video annotation method based on multiple modes |
CN113315983A (en) * | 2021-05-17 | 2021-08-27 | 唐晓晖 | Live frame transmission system for 5G and 4G network aggregation |
CN114786035A (en) * | 2022-05-25 | 2022-07-22 | 上海氪信信息技术有限公司 | Compliance quality inspection and interactive question-answering system and method for live scene |
CN117218453A (en) * | 2023-11-06 | 2023-12-12 | 中国科学院大学 | Incomplete multi-mode medical image learning method |
CN117218453B (en) * | 2023-11-06 | 2024-01-16 | 中国科学院大学 | Incomplete multi-mode medical image learning method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111031330A (en) | Live webcast content analysis method based on multi-mode fusion | |
US11412023B2 (en) | Video description generation method and apparatus, video playing method and apparatus, and storage medium | |
CN109977921B (en) | Method for detecting hidden danger of power transmission line | |
CN109495766A (en) | A kind of method, apparatus, equipment and the storage medium of video audit | |
CN110796098B (en) | Method, device, equipment and storage medium for training and auditing content auditing model | |
CN105138953B (en) | A method of action recognition in the video based on continuous more case-based learnings | |
CN111382623A (en) | Live broadcast auditing method, device, server and storage medium | |
US11868738B2 (en) | Method and apparatus for generating natural language description information | |
CN111582122B (en) | System and method for intelligently analyzing behaviors of multi-dimensional pedestrians in surveillance video | |
CN111008337B (en) | Deep attention rumor identification method and device based on ternary characteristics | |
CN113766299B (en) | Video data playing method, device, equipment and medium | |
CN112488071B (en) | Method, device, electronic equipment and storage medium for extracting pedestrian features | |
CN111783712A (en) | Video processing method, device, equipment and medium | |
CN111259720A (en) | Unsupervised pedestrian re-identification method based on self-supervision agent feature learning | |
CN112149642A (en) | Text image recognition method and device | |
CN109614896A (en) | A method of the video content semantic understanding based on recursive convolution neural network | |
CN112183672A (en) | Image classification method, and training method and device of feature extraction network | |
CN112434178A (en) | Image classification method and device, electronic equipment and storage medium | |
CN111914649A (en) | Face recognition method and device, electronic equipment and storage medium | |
CN111597361B (en) | Multimedia data processing method, device, storage medium and equipment | |
CN114302157B (en) | Attribute tag identification and substitution event detection methods, device, equipment and medium thereof | |
CN112235517B (en) | Method for adding white-matter, device for adding white-matter, and storage medium | |
CN114581994A (en) | Class attendance management method and system | |
CN114363664A (en) | Method and device for generating video collection title | |
CN114357301A (en) | Data processing method, device and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200417 |