CN115831119B - Speaker detection and subtitle generation method based on cross-attention mechanism - Google Patents

Speaker detection and subtitle generation method based on cross-attention mechanism Download PDF

Info

Publication number
CN115831119B
CN115831119B CN202211561326.2A CN202211561326A CN115831119B CN 115831119 B CN115831119 B CN 115831119B CN 202211561326 A CN202211561326 A CN 202211561326A CN 115831119 B CN115831119 B CN 115831119B
Authority
CN
China
Prior art keywords
audio
visual
features
speaker detection
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211561326.2A
Other languages
Chinese (zh)
Other versions
CN115831119A (en
Inventor
肖业伟
刘烜铭
滕连伟
朱澳苏
田丕承
黄健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangtan University
Original Assignee
Xiangtan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangtan University filed Critical Xiangtan University
Priority to CN202211561326.2A priority Critical patent/CN115831119B/en
Publication of CN115831119A publication Critical patent/CN115831119A/en
Application granted granted Critical
Publication of CN115831119B publication Critical patent/CN115831119B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a speaker detection and caption generation method based on a cross-attention mechanism, which relates to the technical field of active speaker detection and caption generation and comprises the following steps: (1) acquiring a dataset; (2) Designing an algorithm model to obtain an active speaker detection and caption generation model; (3) preprocessing the data; (4) Training the preprocessed data by using a designed active speaker detection and caption generation model to obtain a training model; (5) And demonstrating the active speaker detection and the caption generation, and displaying the caption generation result below the video. By designing the visual time encoder and the audio time encoder of the model, the audio information and the video information are mutually learned by applying a cross-attention mechanism while the audio and video characteristics with space-time information are acquired, so that the caption generator can generate the caption corresponding to the speaker in a multi-speaker scene and still maintain the accuracy in a complex voice environment.

Description

Speaker detection and subtitle generation method based on cross-attention mechanism
Technical Field
The invention relates to the technical field of active speaker detection and caption generation, in particular to a speaker detection and caption generation method based on a cross-attention mechanism.
Background
The transfer of information by voice is the most important, efficient, most commonly used and most convenient form of exchanging information for humans. However, the hearing impaired person has a transmission obstacle and a reception obstacle of voice information, and thus cannot efficiently transmit voice information and acquire voice information. As an important information transmission medium, the video adds color to our daily life and provides convenience for us. However, important information in video is mostly transmitted through voice, so that it is difficult for the hearing impaired to understand the video. Thus, how to help hearing impaired people acquire voice information in video has become popular in recent years.
In recent years, with the vigorous development of deep learning, the voice recognition technology has made a great breakthrough, and various caption generators based on voice recognition are endless. Many commercial-grade subtitle generators have achieved success rates of over 95%. The presence of the subtitle generator provides a channel for the hearing impaired to acquire video and voice information, thereby facilitating the life of the hearing impaired to a certain extent.
However, complex video scenes leave existing subtitle generators somewhat unfulfilled. When a plurality of speakers exist in the video, the existing caption generator cannot generate corresponding captions according to the speakers, which can affect the experience of watching the video; more importantly, the video quality on the internet is uneven, and the existing caption generator cannot guarantee the accuracy of caption generation in an environment with noise. In summary, the conventional caption generation technology has the following technical difficulties:
(1) How to detect a speaker in a video, thereby generating subtitles for the corresponding speaker.
(2) How to use visual information to improve the subtitle generating precision in a complex voice environment.
(3) How to enable meaningful mutual learning of audio information and video information rather than simple concatenation.
Disclosure of Invention
The invention aims to provide a speaker detection and caption generation method based on a cross-attention mechanism, which not only can generate captions of corresponding speakers in a multi-speaker scene, but also can ensure the accuracy of caption generation in a noisy voice environment.
In order to achieve the above object, the present invention provides a speaker detection and subtitle generation method based on a cross-attention mechanism, comprising the steps of:
s1, acquiring AVA-Active Speaker and CMLR data sets, wherein the data sets are collected and manufactured from Holland movies, and the data sets comprise about 365 ten thousand human mark frames and about 38.5 hours of facial tracks and corresponding audio. Wherein each face instance is marked as talking or not, and whether speech is audible or not. CMLR datasets were collected by the university of Zhejiang Visual Intelligent and Pattern Analysis (VIPA) subject group. The dataset consisted of 102072 sentences from 11 speakers from the national news program news simulcast recorded during month 6 of 2009 to month 6 of 2018.
S2, designing an algorithm model: 1. front end module: two types of temporal encoders are used to extract spatiotemporal features of the audio-visual information. The visual time encoder consists of a visual front-end module (3D-res net 18) and a visual time module (depth separable convolution + ReLU & BN) aimed at learning a long-term representation of facial dynamics. The audio time encoder consists of a ResNet-34 network and an SE (Squeeze-and-specification) module. The ResNet-34 network can extract deeper audio features, and the SE module can enable the network to pay attention to the feature channels with larger weight, so that the audio information is effectively modeled; 2. and a back end module: a cross-attention mechanism is designed to dynamically describe audiovisual interactions. By letting the visual and audio features generated by the front-end module learn each other through a cross-attention mechanism, two new features are generated: i.e. the audio features of the visual features are learned and the visual features of the audio features are learned. The two features are multiplied by element level and then pass through a self-attention mechanism layer to obtain enhanced self-attention features, finally, a fully connected layer is applied, and the self-attention features are projected onto an active speaker detection tag sequence by using softmax operation. Meanwhile, the self-attention feature is passed through a transducer-decoder module to generate corresponding subtitles according to the speaker. 3. And designing a loss function and an optimizer. 4. And designing a training strategy and constructing an active speaker detection and subtitle generation model.
S3, preprocessing the data, and adding Noise audio of-5 to 20dB in a Noise database of noise_92 to all audio in the data in order to detect the active speaker detection and subtitle generation performance of the robust model in a Noise environment. In order to make our model better adapt to different noise environments, we randomly select audio tracks from the same batch of videos as noise for speech enhancement. And finally, encoding the video data, the audio data and the text information after preprocessing.
And S4, training the preprocessed data by using the designed algorithm model to obtain a training model.
And S5, demonstrating the active speaker detection and the caption generation, and displaying the caption generation result below the video.
Therefore, the method and the device provide a cross attention mechanism through design, and the visual time encoder and the audio time encoder of a design model are used for acquiring audio and video characteristics with space-time information, simultaneously, the audio information and the video information are mutually learned, so that the caption generator can generate captions corresponding to speakers in a multi-speaker scene, and still can maintain accuracy in a complex voice environment.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a flow chart of a method for speaker detection and subtitle generation based on a cross-attention mechanism according to the present invention;
FIG. 2 is a complete network diagram of a method for speaker detection and caption generation based on a cross-attention mechanism according to the present invention, wherein a represents speaker detection and b represents caption generation;
FIG. 3 is a detailed block diagram of the cross-attention mechanism in the speaker detection and caption generation method based on the cross-attention mechanism of the present invention;
fig. 4 is a detailed block diagram of a transducer decoder of a speaker detection and subtitle generation method based on a cross-attention mechanism.
Detailed Description
The technical scheme of the invention is further described below through the attached drawings and the embodiments.
The following detailed description of the embodiments of the invention, provided in the accompanying drawings, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
Referring to fig. 1, fig. 1 is a flowchart of a method for speaker detection and subtitle generation based on a cross-attention mechanism according to an embodiment of the present application, where the method specifically includes the following steps:
step S1: AVA-Active Speaker and CMLR datasets were acquired.
Step S2: fig. 2 is a complete network of a cross-attention mechanism-based speaker detection and caption generation method as set forth in the present application. Where a represents speaker detection and b represents subtitle generation. The network front end consists of a visual time encoder and an audio time encoder. Wherein the visual time encoder is used to perform feature extraction and temporal modeling on the input image sequence. The audio time encoder is used to perform feature extraction on the audio waveform. The network backend is composed of a cross-attention mechanism layer, a self-attention mechanism layer and a transducer decoder. The cross-attention mechanism layer may allow the audio features and video features to learn from each other to dynamically align the audio features and video features. The self-attention mechanism layer is used for carrying out feature enhancement on the output features of the cross-attention mechanism layer, and projecting the enhanced features onto an active person detection tag sequence through a softmax operation. The decoder layer decodes the enhanced features to generate subtitles.
First, to model the temporal relationship from frame to frame while extracting the facial granularity features of the speaker, we add a visual front-end module to the visual temporal encoder, consisting of 3D-res net18, where the convolution kernel size of the 3D convolution layer is 5 x 7. To model the temporal relationship of the entire video sequence, we introduce a visual time module consisting of a visual time convolution block and a one-dimensional convolution layer. The visual time convolution block has 5 layers, each layer comprises a depth separable convolution layer (DS-Conv 1D), and a rectifying linear unit layer (ReLU)) Batch normalization layer (BN). Meanwhile, the layers are connected in a residual mode. One-dimensional convolution layers are used to reduce the feature dimension. Second, for audio information, we use an audio temporal encoder for feature extraction. The audio time encoder consists of a ResNet-34 network and an SE (Squeeze-and-specification) module. The ResNet-34 network can extract deeper audio features, and the SE module can enable the network to pay attention to the feature channels with larger weights, so that the audio information can be effectively modeled. For the back-end module, we have designed two cross-attention networks along the time dimension to dynamically describe audiovisual interactions. As shown in fig. 3, a new audio feature F for learning interactions a→v Attention layer application F v Generating query vector Q as target sequence v Application F a Generating key and value k as source sequence a 、v a . To learn new video features F v→a Attention layer application F a Generating query vector Q as target sequence a Application F v Generating key and value k as source sequence v 、v v . In order to make the weight distribution of the attention layer more uniform and facilitate gradient update, we divide the query vector and the key value after dot multiplicationAs shown in formulas 1 and 2:
wherein, the liquid crystal display device comprises a liquid crystal display device,dimension representing output size, +.>Is the transposition of key values。
We will F a→v And F v→a Element-level multiplication is performed and then input into a self-attention mechanism, so that an enhanced self-attention characteristic is obtained. The full connectivity layer is then applied, and the attention feature is projected onto an active speaker detection tag sequence using a softmax operation. At the same time, we input the feature to a transducer decoder to generate the corresponding subtitles from the speaker. The transducer decoder structure is shown in fig. 4.
And step S3, data preprocessing. 1. For video data, we set the face sequence to a size of 112×112, we perform visual enhancement by randomly flipping, rotating, and cropping the original image; 2. for audio data, to allow our model to perform well in noisy environments, we have done so by adding noise_92 Noise in the Noise database-5 to 20dB of Noise to all the audio in the data; 3. we propose a negative sampling method to increase the number of noise samples. We use one video as input data during the training process, then we randomly decimate another video of the same batch, select its audio track as noise, and superimpose it on the audio track of our input video. The method can greatly increase the richness of noise, so that the model is more fully trained.
And S4, training the preprocessed data. Our training phase is divided into two steps. First, we freeze the transducer decoder module of our model focusing on training our active speaker detection section using the AVA-ActiveSpeaker dataset. After the training is completed, we train the subtitle generating part using the CMLR data set. 1. The method comprises the steps of freezing a transducer decoder module, conveying the output of a network after freezing to a softmax layer, and performing end-to-end training on the whole network consisting of the network and the softmax layer, so as to obtain the pre-training weight of an active speaker detection part. We used an Adam optimizer to perform the optimization with an initial learning rate η=3e-4 and a weight decay=1e-4. We only use one graphics card for training, with the batch-size set to 32. We used a Cosine Scheduler over 80 training cycles, which has the advantage that we can reduce the learning rate from the beginning of training while maintaining a relatively large learning rate, which has potential benefits for training. 2. After the training is completed, we defrost the transducer decoder, replace the data set with the CMLR data set, and use the same settings to perform end-to-end training on the caption generating section, thereby obtaining the pre-training weights of the caption generating section.
Step S5: demonstration is carried out on active speaker detection and caption generation: 1. the ui interface of the system was designed using the side2 toolkit, consisting mainly of the video display area. To increase availability, we set modules for play, pause, fast forward, fast reverse, and volume adjustment, progress bar, playlist, etc. 2. Loading training weights, loading models, and matching the buttons of the Ui interface with functions in the codes. 3. Clicking the "load video" button imports the video into the system. 4. Clicking the "Add Noise" button adds Noise audio of-5 to 20dB in the noise_92 Noise database to the video to be processed. 5. And loading a network, and performing active speaker detection and subtitle generation on the video with the added noise by using training weights. 6. And displaying the subtitle generating result below the video.
The beneficial effects of this application are: first, to model the temporal relationship from frame to frame while extracting the facial granularity features of the speaker, we add a visual front-end module to the video temporal encoder, consisting of 3D-res net18, where the convolution kernel size of the 3D convolution layer is 5 x 7. To model the temporal relationship of the entire video sequence, we introduce a visual time module consisting of a visual time convolution block and a one-dimensional convolution layer. The visual time convolution block has 5 layers, and each layer comprises a depth separable convolution layer (DS-Conv 1D), a rectification linear unit layer (ReLU) and a batch normalization layer (BN). Meanwhile, the layers are connected in a residual mode. One-dimensional convolution layers are used to reduce the feature dimension. Second, for audio information, we use an audio temporal encoder for feature extraction. The audio time encoder consists of a ResNet-34 network and an SE (Squeeze-and-specification) module. The ResNet-34 network can extract deeper audio features, and the SE module can enable the network to pay attention to the feature channels with larger weights, so that the audio information can be effectively modeled. In the back-end module, we have devised a cross-attention mechanism by letting the visual and audio features generated by the front-end module learn each other through the cross-attention mechanism, generating two new features: i.e. the audio features of the visual features are learned and the visual features of the audio features are learned. By the module, video information and audio information can be dynamically aligned, so that the active speaker detection and caption generation performance of the model in a complex environment is improved.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims (4)

1. A speaker detection and caption generation method based on a cross-attention mechanism is characterized by comprising the following steps:
s1, acquiring a data set;
s2, designing an algorithm model and a training strategy, and constructing an active speaker detection and subtitle generation model:
s2-1, constructing a front-end module;
s2-2, the visual features and the audio features generated by the front end module are mutually learned through a cross attention mechanism of the rear end module to obtain audio features with learned visual features and visual features with learned audio features, the audio features with learned visual features and the visual features with learned audio features are subjected to element level multiplication and then pass through a self attention mechanism layer to obtain self attention features, the self attention features are projected onto an active speaker detection tag sequence, and meanwhile, the self attention features are subjected to a converter-decoder module to generate subtitles according to speakers;
s2-3, designing a loss function and an optimizer;
s2-4, designing a training strategy, and constructing an active speaker detection and subtitle generation model;
s3, preprocessing and encoding data: performing visual enhancement by randomly flipping, rotating, and cropping the original image for video data, and adding Noise audio of-5 to 20dB in a noise_92 Noise database for audio data; increasing the number of noise samples by adopting a negative sampling method, selecting one video as input data, randomly extracting another video in the same batch, selecting an audio track of the video as noise, and overlapping the audio track on the audio track of the input data;
s4, training the preprocessed data by using a designed algorithm model to obtain a training model;
and S5, demonstrating the active speaker detection and the caption generation, and displaying the caption generation result below the video.
2. The method for speaker detection and subtitle generation based on a cross-attention mechanism of claim 1, wherein: the front-end module in the step S2-1 consists of a visual time encoder and an audio time encoder.
3. The method for speaker detection and subtitle generation based on a cross-attention mechanism of claim 2, wherein: the visual time encoder consists of a visual front-end module and a visual time module, wherein the visual front-end module consists of a 3D-ResNet18, and the visual time module consists of a visual time convolution block and a one-dimensional convolution layer, and the total of the visual time convolution blocks is 5 layers.
4. A method for speaker detection and subtitle generation based on a cross-attention mechanism according to claim 3, wherein: the audio time encoder consists of a ResNet-34 network and an SE (Squeeze-and-specification) module.
CN202211561326.2A 2022-12-07 2022-12-07 Speaker detection and subtitle generation method based on cross-attention mechanism Active CN115831119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211561326.2A CN115831119B (en) 2022-12-07 2022-12-07 Speaker detection and subtitle generation method based on cross-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211561326.2A CN115831119B (en) 2022-12-07 2022-12-07 Speaker detection and subtitle generation method based on cross-attention mechanism

Publications (2)

Publication Number Publication Date
CN115831119A CN115831119A (en) 2023-03-21
CN115831119B true CN115831119B (en) 2023-07-21

Family

ID=85544378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211561326.2A Active CN115831119B (en) 2022-12-07 2022-12-07 Speaker detection and subtitle generation method based on cross-attention mechanism

Country Status (1)

Country Link
CN (1) CN115831119B (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020190112A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Method, apparatus, device and medium for generating captioning information of multimedia data
CN114519880B (en) * 2022-02-09 2024-04-05 复旦大学 Active speaker recognition method based on cross-modal self-supervision learning

Also Published As

Publication number Publication date
CN115831119A (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN109874053A (en) The short video recommendation method with user's dynamic interest is understood based on video content
CN107818306A (en) A kind of video answering method based on attention model
EP4099709A1 (en) Data processing method and apparatus, device, and readable storage medium
CN113395578A (en) Method, device and equipment for extracting video theme text and storage medium
CN111046757B (en) Training method and device for face portrait generation model and related equipment
CN109670453B (en) Method for extracting short video theme
CN108805036A (en) A kind of new non-supervisory video semanteme extracting method
CN111368142A (en) Video intensive event description method based on generation countermeasure network
CN116935170B (en) Processing method and device of video processing model, computer equipment and storage medium
CN115129934A (en) Multi-mode video understanding method
CN112287175A (en) Method and system for predicting highlight segments of video
Li et al. [Retracted] Multimedia Data Processing Technology and Application Based on Deep Learning
KR20210047467A (en) Method and System for Auto Multiple Image Captioning
CN115831119B (en) Speaker detection and subtitle generation method based on cross-attention mechanism
Chen [Retracted] Semantic Analysis of Multimodal Sports Video Based on the Support Vector Machine and Mobile Edge Computing
CN113099374B (en) Audio frequency three-dimensional method based on multi-attention audio-visual fusion
KR102526263B1 (en) Method and System for Auto Multiple Image Captioning
CN111327943B (en) Information management method, device, system, computer equipment and storage medium
Feng et al. Neural Network‐Based Ultra‐High‐Definition Video Live Streaming Optimization Algorithm
CN110933519A (en) Multi-path feature-based memory network video abstraction method
CN114780867B (en) Recommendation method, medium, device and computing equipment
Di Principles of AIGC technology and its application in new media micro-video creation
Costa et al. Deep Learning Approach for Seamless Navigation in Multi-View Streaming Applications
Sanjeeva et al. TEXT2AV–Automated Text to Audio and Video Conversion
CN111757173B (en) Commentary generation method and device, intelligent sound box and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant