CN115831119B

CN115831119B - Speaker detection and subtitle generation method based on cross-attention mechanism

Info

Publication number: CN115831119B
Application number: CN202211561326.2A
Authority: CN
Inventors: 肖业伟; 刘烜铭; 滕连伟; 朱澳苏; 田丕承; 黄健
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2023-07-21
Anticipated expiration: 2042-12-07
Also published as: CN115831119A

Abstract

The invention discloses a speaker detection and caption generation method based on a cross-attention mechanism, which relates to the technical field of active speaker detection and caption generation and comprises the following steps: (1) acquiring a dataset; (2) Designing an algorithm model to obtain an active speaker detection and caption generation model; (3) preprocessing the data; (4) Training the preprocessed data by using a designed active speaker detection and caption generation model to obtain a training model; (5) And demonstrating the active speaker detection and the caption generation, and displaying the caption generation result below the video. By designing the visual time encoder and the audio time encoder of the model, the audio information and the video information are mutually learned by applying a cross-attention mechanism while the audio and video characteristics with space-time information are acquired, so that the caption generator can generate the caption corresponding to the speaker in a multi-speaker scene and still maintain the accuracy in a complex voice environment.

Description

Speaker detection and subtitle generation method based on cross-attention mechanism

Technical Field

The invention relates to the technical field of active speaker detection and caption generation, in particular to a speaker detection and caption generation method based on a cross-attention mechanism.

Background

The transfer of information by voice is the most important, efficient, most commonly used and most convenient form of exchanging information for humans. However, the hearing impaired person has a transmission obstacle and a reception obstacle of voice information, and thus cannot efficiently transmit voice information and acquire voice information. As an important information transmission medium, the video adds color to our daily life and provides convenience for us. However, important information in video is mostly transmitted through voice, so that it is difficult for the hearing impaired to understand the video. Thus, how to help hearing impaired people acquire voice information in video has become popular in recent years.

In recent years, with the vigorous development of deep learning, the voice recognition technology has made a great breakthrough, and various caption generators based on voice recognition are endless. Many commercial-grade subtitle generators have achieved success rates of over 95%. The presence of the subtitle generator provides a channel for the hearing impaired to acquire video and voice information, thereby facilitating the life of the hearing impaired to a certain extent.

However, complex video scenes leave existing subtitle generators somewhat unfulfilled. When a plurality of speakers exist in the video, the existing caption generator cannot generate corresponding captions according to the speakers, which can affect the experience of watching the video; more importantly, the video quality on the internet is uneven, and the existing caption generator cannot guarantee the accuracy of caption generation in an environment with noise. In summary, the conventional caption generation technology has the following technical difficulties:

(1) How to detect a speaker in a video, thereby generating subtitles for the corresponding speaker.

(2) How to use visual information to improve the subtitle generating precision in a complex voice environment.

(3) How to enable meaningful mutual learning of audio information and video information rather than simple concatenation.

Disclosure of Invention

The invention aims to provide a speaker detection and caption generation method based on a cross-attention mechanism, which not only can generate captions of corresponding speakers in a multi-speaker scene, but also can ensure the accuracy of caption generation in a noisy voice environment.

In order to achieve the above object, the present invention provides a speaker detection and subtitle generation method based on a cross-attention mechanism, comprising the steps of:

s1, acquiring AVA-Active Speaker and CMLR data sets, wherein the data sets are collected and manufactured from Holland movies, and the data sets comprise about 365 ten thousand human mark frames and about 38.5 hours of facial tracks and corresponding audio. Wherein each face instance is marked as talking or not, and whether speech is audible or not. CMLR datasets were collected by the university of Zhejiang Visual Intelligent and Pattern Analysis (VIPA) subject group. The dataset consisted of 102072 sentences from 11 speakers from the national news program news simulcast recorded during month 6 of 2009 to month 6 of 2018.

S2, designing an algorithm model: 1. front end module: two types of temporal encoders are used to extract spatiotemporal features of the audio-visual information. The visual time encoder consists of a visual front-end module (3D-res net 18) and a visual time module (depth separable convolution + ReLU & BN) aimed at learning a long-term representation of facial dynamics. The audio time encoder consists of a ResNet-34 network and an SE (Squeeze-and-specification) module. The ResNet-34 network can extract deeper audio features, and the SE module can enable the network to pay attention to the feature channels with larger weight, so that the audio information is effectively modeled; 2. and a back end module: a cross-attention mechanism is designed to dynamically describe audiovisual interactions. By letting the visual and audio features generated by the front-end module learn each other through a cross-attention mechanism, two new features are generated: i.e. the audio features of the visual features are learned and the visual features of the audio features are learned. The two features are multiplied by element level and then pass through a self-attention mechanism layer to obtain enhanced self-attention features, finally, a fully connected layer is applied, and the self-attention features are projected onto an active speaker detection tag sequence by using softmax operation. Meanwhile, the self-attention feature is passed through a transducer-decoder module to generate corresponding subtitles according to the speaker. 3. And designing a loss function and an optimizer. 4. And designing a training strategy and constructing an active speaker detection and subtitle generation model.

S3, preprocessing the data, and adding Noise audio of-5 to 20dB in a Noise database of noise_92 to all audio in the data in order to detect the active speaker detection and subtitle generation performance of the robust model in a Noise environment. In order to make our model better adapt to different noise environments, we randomly select audio tracks from the same batch of videos as noise for speech enhancement. And finally, encoding the video data, the audio data and the text information after preprocessing.

And S4, training the preprocessed data by using the designed algorithm model to obtain a training model.

And S5, demonstrating the active speaker detection and the caption generation, and displaying the caption generation result below the video.

Therefore, the method and the device provide a cross attention mechanism through design, and the visual time encoder and the audio time encoder of a design model are used for acquiring audio and video characteristics with space-time information, simultaneously, the audio information and the video information are mutually learned, so that the caption generator can generate captions corresponding to speakers in a multi-speaker scene, and still can maintain accuracy in a complex voice environment.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a flow chart of a method for speaker detection and subtitle generation based on a cross-attention mechanism according to the present invention;

FIG. 2 is a complete network diagram of a method for speaker detection and caption generation based on a cross-attention mechanism according to the present invention, wherein a represents speaker detection and b represents caption generation;

FIG. 3 is a detailed block diagram of the cross-attention mechanism in the speaker detection and caption generation method based on the cross-attention mechanism of the present invention;

fig. 4 is a detailed block diagram of a transducer decoder of a speaker detection and subtitle generation method based on a cross-attention mechanism.

Detailed Description

The technical scheme of the invention is further described below through the attached drawings and the embodiments.

The following detailed description of the embodiments of the invention, provided in the accompanying drawings, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

Referring to fig. 1, fig. 1 is a flowchart of a method for speaker detection and subtitle generation based on a cross-attention mechanism according to an embodiment of the present application, where the method specifically includes the following steps:

step S1: AVA-Active Speaker and CMLR datasets were acquired.

Step S2: fig. 2 is a complete network of a cross-attention mechanism-based speaker detection and caption generation method as set forth in the present application. Where a represents speaker detection and b represents subtitle generation. The network front end consists of a visual time encoder and an audio time encoder. Wherein the visual time encoder is used to perform feature extraction and temporal modeling on the input image sequence. The audio time encoder is used to perform feature extraction on the audio waveform. The network backend is composed of a cross-attention mechanism layer, a self-attention mechanism layer and a transducer decoder. The cross-attention mechanism layer may allow the audio features and video features to learn from each other to dynamically align the audio features and video features. The self-attention mechanism layer is used for carrying out feature enhancement on the output features of the cross-attention mechanism layer, and projecting the enhanced features onto an active person detection tag sequence through a softmax operation. The decoder layer decodes the enhanced features to generate subtitles.

First, to model the temporal relationship from frame to frame while extracting the facial granularity features of the speaker, we add a visual front-end module to the visual temporal encoder, consisting of 3D-res net18, where the convolution kernel size of the 3D convolution layer is 5 x 7. To model the temporal relationship of the entire video sequence, we introduce a visual time module consisting of a visual time convolution block and a one-dimensional convolution layer. The visual time convolution block has 5 layers, each layer comprises a depth separable convolution layer (DS-Conv 1D), and a rectifying linear unit layer (ReLU)) Batch normalization layer (BN). Meanwhile, the layers are connected in a residual mode. One-dimensional convolution layers are used to reduce the feature dimension. Second, for audio information, we use an audio temporal encoder for feature extraction. The audio time encoder consists of a ResNet-34 network and an SE (Squeeze-and-specification) module. The ResNet-34 network can extract deeper audio features, and the SE module can enable the network to pay attention to the feature channels with larger weights, so that the audio information can be effectively modeled. For the back-end module, we have designed two cross-attention networks along the time dimension to dynamically describe audiovisual interactions. As shown in fig. 3, a new audio feature F for learning interactions _a→v Attention layer application F _v Generating query vector Q as target sequence _v Application F _a Generating key and value k as source sequence _a 、v _a . To learn new video features F _v→a Attention layer application F _a Generating query vector Q as target sequence _a Application F _v Generating key and value k as source sequence _v 、v _v . In order to make the weight distribution of the attention layer more uniform and facilitate gradient update, we divide the query vector and the key value after dot multiplicationAs shown in formulas 1 and 2:

wherein, the liquid crystal display device comprises a liquid crystal display device,dimension representing output size, +.>Is the transposition of key values。

We will F _a→v And F _v→a Element-level multiplication is performed and then input into a self-attention mechanism, so that an enhanced self-attention characteristic is obtained. The full connectivity layer is then applied, and the attention feature is projected onto an active speaker detection tag sequence using a softmax operation. At the same time, we input the feature to a transducer decoder to generate the corresponding subtitles from the speaker. The transducer decoder structure is shown in fig. 4.

And step S3, data preprocessing. 1. For video data, we set the face sequence to a size of 112×112, we perform visual enhancement by randomly flipping, rotating, and cropping the original image; 2. for audio data, to allow our model to perform well in noisy environments, we have done so by adding noise_92 Noise in the Noise database-5 to 20dB of Noise to all the audio in the data; 3. we propose a negative sampling method to increase the number of noise samples. We use one video as input data during the training process, then we randomly decimate another video of the same batch, select its audio track as noise, and superimpose it on the audio track of our input video. The method can greatly increase the richness of noise, so that the model is more fully trained.

And S4, training the preprocessed data. Our training phase is divided into two steps. First, we freeze the transducer decoder module of our model focusing on training our active speaker detection section using the AVA-ActiveSpeaker dataset. After the training is completed, we train the subtitle generating part using the CMLR data set. 1. The method comprises the steps of freezing a transducer decoder module, conveying the output of a network after freezing to a softmax layer, and performing end-to-end training on the whole network consisting of the network and the softmax layer, so as to obtain the pre-training weight of an active speaker detection part. We used an Adam optimizer to perform the optimization with an initial learning rate η=3e-4 and a weight decay=1e-4. We only use one graphics card for training, with the batch-size set to 32. We used a Cosine Scheduler over 80 training cycles, which has the advantage that we can reduce the learning rate from the beginning of training while maintaining a relatively large learning rate, which has potential benefits for training. 2. After the training is completed, we defrost the transducer decoder, replace the data set with the CMLR data set, and use the same settings to perform end-to-end training on the caption generating section, thereby obtaining the pre-training weights of the caption generating section.

Step S5: demonstration is carried out on active speaker detection and caption generation: 1. the ui interface of the system was designed using the side2 toolkit, consisting mainly of the video display area. To increase availability, we set modules for play, pause, fast forward, fast reverse, and volume adjustment, progress bar, playlist, etc. 2. Loading training weights, loading models, and matching the buttons of the Ui interface with functions in the codes. 3. Clicking the "load video" button imports the video into the system. 4. Clicking the "Add Noise" button adds Noise audio of-5 to 20dB in the noise_92 Noise database to the video to be processed. 5. And loading a network, and performing active speaker detection and subtitle generation on the video with the added noise by using training weights. 6. And displaying the subtitle generating result below the video.

The beneficial effects of this application are: first, to model the temporal relationship from frame to frame while extracting the facial granularity features of the speaker, we add a visual front-end module to the video temporal encoder, consisting of 3D-res net18, where the convolution kernel size of the 3D convolution layer is 5 x 7. To model the temporal relationship of the entire video sequence, we introduce a visual time module consisting of a visual time convolution block and a one-dimensional convolution layer. The visual time convolution block has 5 layers, and each layer comprises a depth separable convolution layer (DS-Conv 1D), a rectification linear unit layer (ReLU) and a batch normalization layer (BN). Meanwhile, the layers are connected in a residual mode. One-dimensional convolution layers are used to reduce the feature dimension. Second, for audio information, we use an audio temporal encoder for feature extraction. The audio time encoder consists of a ResNet-34 network and an SE (Squeeze-and-specification) module. The ResNet-34 network can extract deeper audio features, and the SE module can enable the network to pay attention to the feature channels with larger weights, so that the audio information can be effectively modeled. In the back-end module, we have devised a cross-attention mechanism by letting the visual and audio features generated by the front-end module learn each other through the cross-attention mechanism, generating two new features: i.e. the audio features of the visual features are learned and the visual features of the audio features are learned. By the module, video information and audio information can be dynamically aligned, so that the active speaker detection and caption generation performance of the model in a complex environment is improved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims

1. A speaker detection and caption generation method based on a cross-attention mechanism is characterized by comprising the following steps:

s1, acquiring a data set;

s2, designing an algorithm model and a training strategy, and constructing an active speaker detection and subtitle generation model:

s2-1, constructing a front-end module;

s2-2, the visual features and the audio features generated by the front end module are mutually learned through a cross attention mechanism of the rear end module to obtain audio features with learned visual features and visual features with learned audio features, the audio features with learned visual features and the visual features with learned audio features are subjected to element level multiplication and then pass through a self attention mechanism layer to obtain self attention features, the self attention features are projected onto an active speaker detection tag sequence, and meanwhile, the self attention features are subjected to a converter-decoder module to generate subtitles according to speakers;

s2-3, designing a loss function and an optimizer;

s2-4, designing a training strategy, and constructing an active speaker detection and subtitle generation model;

s3, preprocessing and encoding data: performing visual enhancement by randomly flipping, rotating, and cropping the original image for video data, and adding Noise audio of-5 to 20dB in a noise_92 Noise database for audio data; increasing the number of noise samples by adopting a negative sampling method, selecting one video as input data, randomly extracting another video in the same batch, selecting an audio track of the video as noise, and overlapping the audio track on the audio track of the input data;

s4, training the preprocessed data by using a designed algorithm model to obtain a training model;

2. The method for speaker detection and subtitle generation based on a cross-attention mechanism of claim 1, wherein: the front-end module in the step S2-1 consists of a visual time encoder and an audio time encoder.

3. The method for speaker detection and subtitle generation based on a cross-attention mechanism of claim 2, wherein: the visual time encoder consists of a visual front-end module and a visual time module, wherein the visual front-end module consists of a 3D-ResNet18, and the visual time module consists of a visual time convolution block and a one-dimensional convolution layer, and the total of the visual time convolution blocks is 5 layers.

4. A method for speaker detection and subtitle generation based on a cross-attention mechanism according to claim 3, wherein: the audio time encoder consists of a ResNet-34 network and an SE (Squeeze-and-specification) module.