CN116456262B - Dual-channel audio generation method based on multi-modal sensing - Google Patents

Dual-channel audio generation method based on multi-modal sensing Download PDF

Info

Publication number
CN116456262B
CN116456262B CN202310329306.0A CN202310329306A CN116456262B CN 116456262 B CN116456262 B CN 116456262B CN 202310329306 A CN202310329306 A CN 202310329306A CN 116456262 B CN116456262 B CN 116456262B
Authority
CN
China
Prior art keywords
audio
visual
model
spectrum
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310329306.0A
Other languages
Chinese (zh)
Other versions
CN116456262A (en
Inventor
任玲
董波
王洋
王聪聪
李晓慧
孙蕴甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Urban Rail Transit Technology Co ltd
Original Assignee
Qingdao Urban Rail Transit Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Urban Rail Transit Technology Co ltd filed Critical Qingdao Urban Rail Transit Technology Co ltd
Priority to CN202310329306.0A priority Critical patent/CN116456262B/en
Publication of CN116456262A publication Critical patent/CN116456262A/en
Application granted granted Critical
Publication of CN116456262B publication Critical patent/CN116456262B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

A binaural audio generation method based on multi-mode perception comprises the steps of visual feature extraction and analysis, audio feature extraction and analysis, binaural audio generation, visual and audio feature fusion and the like.

Description

Dual-channel audio generation method based on multi-modal sensing
Technical Field
The invention relates to the field of rail transit, relates to the field of audio processing of a rail transit power supply system, is applied to a power supply intelligent operation and maintenance system, and particularly relates to a binaural audio generation method based on multi-mode sensing.
Background
The spatial distribution of the restored real scene in the binaural audio is an important data characteristic in intelligent inspection of the rail transit power supply system, the binaural audio is more spatial information than the monaural audio, and the human auditory system can position the sound source position and distance according to the sound level difference between two ears of the audio and the time difference of sound reaching the two ears, and sense the spatial distribution of the environment. However, in the current track traffic power supply system monitoring process, most of videos still use mono audio in intelligent inspection, real auditory feelings of human beings cannot be reproduced through differences of left and right channels, spatial positions in the videos are not reflected in the audio, spatial distribution of real environments cannot be reproduced through differences of the left and right channels, and most of current methods rely on high-quality dual-channel audio, so that the audio is difficult to obtain, the method is complex, and migration cannot be directly carried out between videos in different forms, so that real scene information which can be captured by operators is greatly reduced, and real spatialization auditory feelings cannot be restored. In addition, the acquisition of high-quality dual-channel audio requires professional recording equipment, and is difficult to apply to a power supply system monitoring video scene.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a binaural audio generation method based on multi-mode perception, visual information is fused into audio features in a multi-scale mode through fusion feature analysis of a plurality of modes of audio and video in a monitoring process of a power supply system, and the problem that the visual features are difficult to effectively utilize in binaural audio generation is solved, so that the quality of the generated binaural audio is improved.
The invention provides a binaural audio generation method based on multi-modal sensing, which comprises the following steps in sequence:
(1) In the monitoring process of the power supply system, acquiring a real video, and extracting and analyzing visual characteristics of the real video based on a convolutional neural network;
(2) Acquiring an audio signal in a video, and performing time-frequency analysis on the audio signal by utilizing short-time Fourier transform to obtain the characteristics of the audio signal in a frequency domain and a time domain;
(3) The left and right channel audios contained in the time frequency are used as prediction targets of the model, the prediction of the audios is realized through a deep convolutional neural network, and a self-supervision binaural audio separation method is adopted to generate binaural audios;
(4) Fusing the audio and video through a fusion analysis network: the encoder is used for taking the frequency spectrum of the mixed mono audio as input, and a two-dimensional convolution network is used for completing downsampling, so that high-level extraction in the audio frequency spectrum is realized; and up-sampling the high-level features by using a decoder, introducing enhanced visual feature fusion, compressing the introduced visual features, and merging with the audio features in a splicing manner.
In a preferred manner, the extracting and analyzing the visual features in the step (1) are completed based on the convolutional neural network, specifically: and extracting visual characteristics by using a pre-trained image classification deep learning model to complete visual analysis.
In a preferred approach, the pre-trained image classification deep learning model is modified and adapted on the data set to accommodate binaural audio generation tasks.
In a preferred manner, the step (1) specifically includes:
(1.1) acquiring a real video, and dividing the real video into a plurality of continuous video segments with the length of t seconds, wherein t <1.0;
(1.2) for each video clip, extracting the picture frames at the intermediate positions as key frames and as visual input for that video clip;
(1.3) initializing a convolutional neural network using pre-trained weights, and then fine-tuning the model using a small learning rate;
and (1.3) reserving a feature extraction part in the original network in the convolutional neural network model, removing a classifier part at the tail end in the network, and only obtaining visual features of the vision extracted by a model hidden layer of the convolutional neural network model.
In a preferred manner, the step (2) specifically includes:
the method comprises the steps of carrying out time-frequency analysis on an audio signal by utilizing short-time Fourier transform, and specifically realizing the time-frequency analysis by utilizing the following formula:
where x [ n ] represents the input signal at time n and w [ n ] is the Hanning window function.
In a preferred manner, the step (3) specifically includes:
the method comprises the steps that an original time sequence signal of audio is taken as a model input, prediction is carried out through a time-frequency mask, and a time-frequency spectrum of the audio is taken as a prediction target of the model when the model is output;
using spectral masking as a prediction object for a binaural audio generation model, wherein the spectral masking is a matrix of the same size as the input spectrum by combining the original spectrum S 0 Performing product operation with the mask M to obtain a target frequency spectrum S t
S t =M·S 0
Left-right channel audio difference S in real data D (t) is expressed as:
S D (t)=S L (t)–S R (t)
mixed audio S of known input M (t) is:
S M (t)=S L (t)+S R (t)
then by mixing the audio S M (t) and predicted Audio DifferenceRestoring to obtain audio of left and right channels:
the prediction target of the model is the audio differenceAnd as far as possible make +.>And S is equal to D The error of (t) is small.
In a preferred manner, the spectral masking employs masking operations directly in the complex domain. Complex-valued masking requires the product operation of the original spectrum and masking in the complex domain, the complex-valued masking operation for the target audio being expressed as:
R(S t )=R(M)·R(S 0 )-I(M)·I(S 0 )
I(S t )=R(M)·I(S 0 )-I(M)·R(S 0 )
where R (x) represents the real part of the complex-valued spectrum and I (x) represents the imaginary part.
In a preferred mode, the step (4) specifically includes:
(4.1) the encoder takes the spectrum of the mixed mono audio as input, uses a two-dimensional convolution network to complete the down-sampling process, the convolution kernel size is 4 multiplied by 4, adds the LeakyRelu as an activation function, uses a Sigmoid activation function to limit the output range of the audio spectrum mask to be [0,1], maps the output range to [ -1,1], and extracts the characteristics in the audio spectrum;
(4.2) introducing reinforced visual feature fusion into the decoder part, wherein the original visual features pass through a visual fusion module at each network layer in the up-sampling stage, the visual fusion module reserves a 1X 1 convolution dimension reduction mode used in MONO2BINAURAL, compresses the input visual features, and the compressed visual features are combined with the audio features in a splicing mode;
(4.3) each stage of upsampling is provided with a separate visual fusion layer, screening different visual features for different stages of upsampling; the upsampling operation is accomplished by transpose convolution, and the fused features of the audio and video are upsampled by 5 layers, ultimately outputting the predicted audio spectral mask.
Compared with the prior art, the binaural audio generation method based on multi-mode perception has the following advantages:
(1) The fusion of the visual features is enhanced, and the visual features are multiplexed in the audio generation network so as to ensure that the audio information and the visual information are fully fused.
(2) The method is an end-to-end model as a whole, takes a picture of video and mono audio as input, and enhances the feasibility.
Drawings
FIG. 1 is a schematic diagram of a binaural audio generation model based on multi-modal awareness;
FIG. 2 is a schematic diagram of the role of video analysis in binaural audio generation;
FIG. 3 is a schematic diagram of STFT transformation of an audio signal;
FIG. 4 is a schematic diagram of an Audio-Visual U-Net network;
fig. 5 is a schematic diagram of a comparison of a power supply system monitoring scene binaural audio generation result and original binaural audio.
Detailed Description
The following detailed description of the invention is provided for the purpose of further illustrating the invention and should not be construed as limiting the scope of the invention, as numerous insubstantial modifications and adaptations of the invention as described above will be apparent to those skilled in the art and are intended to be within the scope of the invention.
The invention provides a binaural audio generation method based on multi-modal sensing, which particularly relates to an implementation mode shown in figures 1-5, wherein figure 1 is a binaural audio generation model schematic diagram based on multi-modal sensing; FIG. 2 is a schematic diagram of the role of video analysis in binaural audio generation; FIG. 3 is a schematic diagram of STFT transformation of an audio signal; FIG. 4 is a schematic diagram of an Audio-Visual U-Net network; fig. 5 is a schematic diagram of a comparison of a power supply system monitoring scene binaural audio generation result and original binaural audio. The following describes a power supply system graphic primitive identification method based on the Faster RCNN in detail.
With reference to the drawings, the application provides a binaural audio generation method based on multi-mode perception, and an improved audio-video fusion analysis network specific to a monitoring process of a power supply system is adopted in combination with an actual engineering application scene, so that a mono video is firstly encoded by using a structure of an encoder-decoder, then video features and audio features are subjected to multi-scale fusion, video and audio information are subjected to collaborative analysis, so that the binaural audio has spatial information which is not available in the original mono audio, and the binaural audio corresponding to the video is finally generated, and is specifically introduced below.
The invention provides a binaural audio generation model and a working principle, which are shown in figure 1, and mainly comprise four parts: (1) Visual Feature (Visual Feature) extraction and analysis; (2) Audio Feature (Audio Feature) extraction analysis; (3) binaural audio generation; (4) visual and audio feature fusion.
The Visual analysis module extracts features from an input video picture, the Audio analysis module extracts features from an input mono frequency spectrum, the Visual features and the Audio features are fused in an Audio-Visual U-Net network, the frequency spectrum masking corresponding to the binaural Audio is obtained through the prediction of the composite features, the output complex value masking is applied to the mono frequency spectrum to obtain a frequency spectrum of a left-right channel Audio frequency difference, and the obtained frequency spectrum is subjected to inverse STFT transformation to obtain the left-right channel Audio frequency difference, so that the Audio signals of the left channel and the right channel are restored. Hereinafter, a specific description will be made.
With reference to fig. 1-5, the invention provides a binaural audio generation method based on multi-modal sensing, which comprises the following steps in sequence:
firstly, in the monitoring process of a power supply system, a real video is obtained, and the extraction and analysis of Visual features (Visual features) of the real video are completed based on a convolutional neural network.
Visual features can provide assistance for the deep learning task of audio content. In order to realize binaural audio generation, spatial information in a real video picture acquired in a monitoring process of a power supply system needs to be analyzed, specific positions of different sound source objects in a scene are known through the picture, the direction and the distance from which sound is transmitted to an observer position are judged, so that the spatial layout in the current scene is known, and the reverberation or echo condition of the sound in the environment is determined. As shown in fig. 1, sound sources (violin) and listeners can be located by analyzing the video frames, which are very useful for binaural audio generation.
In order to achieve the extraction of the visual features, the invention uses a convolutional neural network to complete the visual analysis task in the binaural audio generation model, namely, the visual features are extracted by the image classification deep learning model. The invention adopts a pre-trained visual image network to complete visual analysis work, and fine-tunes a pre-trained model on a specific data set to adapt to a binaural audio generation task. The cost of model training can be effectively reduced, and meanwhile, the obtained model can have better generalization capability after migration. The deep convolutional neural network-based image analysis models are more, such as ResNet, denseNet, googleNet. However, these models cannot be used directly to process real video in industrial scenes. In one aspect, the models each take a single image as a processing object, and the video is a continuous frame. On the other hand, these models are mainly used for image classification tasks and are not suitable for binaural audio generation tasks, and thus need to be modified and adapted to the visual analysis task for binaural audio generation.
Specifically, in order to solve the problem of processing continuous picture frames in video, the invention uses key picture frames to replace power supply systems in short segments to monitor video pictures. Specifically, one video is divided into a plurality of consecutive video segments of length t (t < 1.0) seconds, and for each video segment, a picture frame at an intermediate position is extracted as a key frame and used as a visual input for the video segment. Considering that the video content within the short segments does not change significantly, the key picture frames are essentially capable of reflecting the rough visual conditions within the video segments. In order to adjust the model to adapt to the visual analysis task of binaural audio generation, the invention maintains a feature extraction part in the original network in the model, removes a classifier part at the tail end in the network, and only acquires visual features of the vision extracted by a model hidden layer of the model. The plurality of hidden layers can capture high-dimensional visual features corresponding to the input image, the features are input into a subsequent audio analysis module, and the visual features are integrated into audio content by utilizing an audio-video fusion network. In addition, the invention further improves the generalization capability of the model by adopting transfer learning. Firstly, initializing a network by using pre-trained weights, then fine-tuning the model by using a smaller learning rate, and enabling the network to adapt to an audio and video data set of a current power supply system monitoring scene, so that the training of the model from the beginning by using the re-initialized weights is avoided, the training speed of the model is accelerated, and meanwhile, the generalization capability of the network is improved.
And secondly, in the monitoring process of the power supply system, acquiring an audio signal in the video, and performing time-frequency analysis on the audio signal by utilizing short-time Fourier transform to obtain the characteristics of the audio signal in the frequency domain and the time domain.
The original power supply system monitors the time sequence sampling sequence that the audio signal in the video is the audio frequency, include the signal value of each discrete sampling point, its data format is the one-dimensional array, the length of the array is audio frequency duration (T) x audio frequency sampling rate (S). Wherein the audio sample rate represents the number of samples of the sound signal by the recording device in one second. The higher sampling rate means that the original audio signal can be restored more accurately, but the amount of data per unit time increases. The common mono audio only contains one audio sequence, while the binaural audio contains two different audio sequences, left and right.
The original audio signal only contains waveform information in time sequence, and the distribution characteristics in the audio frequency domain cannot be directly obtained through the original audio signal. In order to analyze the features in the frequency domain, fourier transforms of the audio are required. The audio signal varies over time, as does the distribution characteristics of the frequency domain. Such non-stationary signals are not suitable for general fourier analysis.
In order to acquire the characteristics of the audio frequency in the frequency domain and the time domain in the monitoring process of the power supply system, time-frequency analysis is needed to be carried out on the audio signal, and the invention selects to use Short-time Fourier transform STFT (Short-Time Fourier Transform) for carrying out the time-frequency analysis. The STFT can acquire signal distribution of two dimensions of a time domain and a frequency domain, can more clearly present audio signal characteristics, and is expressed as follows:
where x [ n ] represents the input signal at time n and w [ n ] is the corresponding window function. STFT is an expansion of traditional Fourier transform, and a small-range signal is intercepted by time-series data in a certain window function in a time dimension, and then discrete Fourier transform is carried out on the signal in the window, so that the spectrum state in the sampling frame can be obtained. Stacking the spectrum of each sample frame in the time dimension results in a spectral change in the time dimension. The boxed area in fig. 2 represents the original waveform of the audio within the same time frame, and the corresponding distribution in the time-frequency spectrum.
The audio and video are associated by time. Along the time dimension, the spectra of each group of frames will typically have overlapping regions to avoid boundary errors between frames due to the cutting pattern. In order to reduce the frequency spectrum leakage caused by signal interception, a window function is needed to be used in the interception process, and the original signal and the window function are subjected to point multiplication operation. The window function requires that its central location be maximized and monotonically decreasing from the center to zero on both sides, thereby reducing interference between truncated frames. Based on this, the present invention selects the Hanning window function:
the video analysis spectrum is used as the input of the audio analysis, so that the analysis processing of the data is facilitated, the audio spectrum obtained after the STFT conversion is a two-dimensional complex value matrix, the amplitude obtained by taking the absolute value of the complex value spectrum is the corresponding spectrum amplitude, and the complex argument is the phase of the corresponding spectrum, so that the complex value spectrum simultaneously contains the information of the spectrum amplitude and the phase, and the information content is more abundant. And meanwhile, the spectrum can be subjected to further feature extraction work by using a traditional two-dimensional convolutional neural network. Because the time-frequency analysis spectrum simultaneously contains the information of the audio signal in the frequency domain and the time dimension, the characteristics of the audio component can be extracted more effectively compared with the original waveform signal with the time dimension only. The short-time Fourier transform is reversible, so that the time-frequency spectrum of the audio can be directly used as a prediction target when the audio is generated, and the original audio can be restored by performing the inverse short-time Fourier transform on the frequency spectrum.
And then, taking left and right channel audios contained in the time frequency as prediction targets of the model, and realizing the prediction of the audios through a deep convolutional neural network so as to generate the binaural audios by adopting a self-supervision binaural audio separation method.
The objective of the binaural audio generation task is to obtain the right and left audio channels as accurately as possible, and the task can be regarded as a special audio separation task, i.e. separating the corresponding audio information of the right and left channels from the original mixed audio. Traditional solutions to the task of audio separation include supervised and unsupervised methods. The supervised audio separation method uses a deep neural network to learn the mapping between the original audio and the target audio, and discovers the association between the audio features through the combination of a hidden layer and a nonlinear activation function, thereby realizing end-to-end audio generation. However, the supervised method requires a large amount of sample data and performs data labeling, which is time-consuming, labor-consuming and costly. The unsupervised audio separation method takes the acoustic characteristics of the original audio as the separation basis, and adopts non-negative matrix factorization and the like to realize separation. However, unsupervised methods are difficult to popularize in complex audio environments. For example, when the environment contains a large amount of unknown noise, it is difficult to use the environment in a realistic environment.
The invention adopts a self-supervision double-channel audio separation method to realize double-channel audio generation. The method takes left and right channel audios contained in time frequency as a prediction target of a model, and predicts the audios through a deep convolutional neural network. The left and right channel audios as training targets are naturally carried in video data, so that any video containing the two channel audios can be used as training data of a two channel audio separation task, and no additional manual marking work is needed.
For audio separation tasks, spectral masking is a common means of achieving spectral separation. Since the original time sequence signal of the audio is directly taken as the model input and the output cannot fully analyze the audio content, the model is difficult to converge and cannot obtain an accurate audio output result, the time-frequency spectrum of the audio is taken as the prediction target of the model in the output stage, and the prediction can be performed through a time-frequency mask. The spectral mask is a matrix of the same size as the input spectrum by combining the original spectrum S 0 The product operation is carried out with the masking M to obtain a target frequency spectrum S t
S t =M·S 0 (3)
The spectrum masking is used as a prediction object of the binaural audio generation model, so that the information amount of model learning can be reduced, and the masking with stable value distribution is also convenient for model learning and convergence.
The spectrum masking here adopts ideal complex-valued masking in which masking operation is directly performed in the complex domain, and can reduce the amount of computation of the model. Complex-valued masking requires the product operation of the original spectrum and masking in the complex domain, the complex-valued masking operation for the target audio being expressed as:
R(S t )=R(M)·R(S 0 )-I(M)·I(S 0 ) (4)
I( S t)=R(M)·I(S 0 )-I(M)·R(S 0 ) (5)
where R (x) represents the real part of the complex-valued spectrum and I (x) represents the imaginary part.
A plurality of separation targets may exist in a general audio separation, but the separation targets of the binaural audio generation task only have two objects of left and right audio, and the designated range of the separation targets is clear, that is, the audio corresponding to the left and right parts in the picture. Thus, binaural audio generation may take advantage of this constraint to promote model effects.
Left-right channel audio difference S in real data D (t) can be expressed as:
S D (t)=S L (t)–S R (t) (6)
mixed audio S of known input M (t) is:
S M (t)=S L (t)+S R (t) (7)
can be combined with the audio S M (t) and predicted Audio DifferenceRestoring to obtain audio of left and right channels:
thus, the prediction target of the model is the audio differenceAnd as far as possible make +.>And S is equal to D The error of (t) is small.
The reduction of the left and right channels is realized by predicting the audio frequency difference, and the characteristics of the double-channel audio frequency are more met. The binaural audio is just to utilize the difference of left and right channel audio to realize the spatial sense of audio, regard the difference of left and right channel as the prediction target of model, can force the model to pay attention to the difference of left and right channel audio, obtain more true left and right channel audio. Meanwhile, compared with the prediction of complete audio, the prediction audio difference model has fewer contents to be learned, so that the model can be converged more quickly, and a better prediction effect is realized.
Finally, in the monitoring process of the power supply system, the audio and the video are fused by utilizing a fusion analysis network, the frequency spectrum of the mixed mono audio is used as input by utilizing an encoder, and downsampling is completed by utilizing a two-dimensional convolution network, so that high-level extraction in the audio frequency spectrum is realized; and up-sampling the high-level features by using a decoder, introducing enhanced visual feature fusion, compressing the introduced visual features, and merging with the audio features in a splicing manner.
The U-Net consists of two symmetrical networks, namely an encoder and a decoder, wherein the encoder is a multi-layer convolutional neural network, performs downsampling on input data to extract high-level features, and the decoder performs upsampling on the high-level features, wherein the upsampling is realized through transposed convolution operation, and the highly compressed data features are restored to output results of original sizes. Mono2BINAURAL is a deep learning model based on BINAURAL audio generation of U-Net, which retains most of the structure of the original U-Net, but introduces visual features at the input stage of the audio decoder, enhances the fusion of the visual features, multiplexes the visual features in the audio generation network to ensure the full fusion of audio information and visual information. The invention designs a network architecture similar to U-Net to realize audio analysis and binaural audio generation.
The invention further improves the network structure of MONO2BINAURAL, and provides a novel dual-channel Audio generation model Audio-Visual U-Net for fusing Audio and video, which strengthens the fusion of Visual characteristics, multiplexes the Visual characteristics in an Audio generation network, and ensures that Audio information and Visual information are fully fused.
As shown in fig. 3, the Audio-Visual U-Net model also contains two modules, an encoder and a decoder. Both the encoder and decoder include a convolutional neural network of 5 layers. The encoder section is similar to the traditional U-Net, takes the spectrum of the mixed mono audio as input, uses a two-dimensional convolution network to complete the down-sampling process, the convolution kernel size is 4 multiplied by 4, simultaneously adds the LeakyRelu as an activation function and Batch Normalization, uses a Sigmoid activation function to limit the output range to [0,1], and maps the output range to [ -1,1] to realize the extraction of the characteristics in the audio spectrum. And the decoder part is introduced with enhanced Visual feature Fusion, and each network layer in the up-sampling stage, the original Visual features pass through a Visual Fusion module (Visual Fusion), and the Visual Fusion module reserves a 1 multiplied by 1 convolution dimension reduction mode used in MONO2BINAURAL and compresses the input Visual features. The compressed visual features are combined with the audio features by stitching. Meanwhile, the model is enhanced for visual feature fusion, and a visual fusion module is expanded into each network layer in the up-sampling stage, so that multi-scale and multi-layer visual feature fusion is realized. Each stage of upsampling has a separate visual fusion layer that can screen different visual features for different stages of upsampling. The data features of different scales are often focused on by different network layers in the up-sampling stage, the information quantity which can be carried by single visual feature input after the input is subjected to dimension reduction compression is reduced, and the information requirements under different scales cannot be fully met. Therefore, according to different up-sampling stages, proper visual characteristics are selected, so that different network layers pay attention to different visual characteristics, and the input visual characteristics can be utilized more efficiently. The up-sampling stage also reserves jump connection, so the input of the up-sampling network layer is the fusion of the down-sampling characteristic of the same stage, the up-sampling output of the previous stage and the vision characteristic after dimension reduction. The upsampling operation is accomplished by transpose convolution, and the fused features of the audio and video are upsampled by 5 layers, ultimately outputting the predicted audio spectral mask.
The training target of the Audio-Visual U-Net model is an STFT frequency spectrum corresponding to the difference between the left and right channels, and the mean square error of the target frequency spectrum and the predicted frequency spectrum is used as a loss function to measure the difference between the predicted frequency spectrum and the target frequency spectrum.
The real and imaginary parts of the complex-valued spectrum have been separated separately into the two channels of input, so the loss function is calculated in the real domain.
Table 1 shows the binaural audio generation performance of the inventive model and the comparison with the baseline model. Wherein, the mono audio model is directly calculated using the audio signal without a training verification penalty. Experimental results show that the model provided by the invention is superior to other methods in STFT distance and ENV distance indexes. (1) Comparing the results of the mono audio and the non-visual model, the model effect of using the audio data is better than that of the original mono audio model, and the audio analysis module of the description model can assist in generating the binaural audio by utilizing the characteristics in the audio content. But the audio feature alone does not achieve a good binaural generation. (2) Comparison of the model without vision and the model of the invention shows that the model using vision analysis can better realize the spatialization of the audio and generate more real binaural audio due to the provision of the visual features. (3) The performance of the method is superior to that of a MONO2BINAURAL model, and the Audio-Visual U-Net can better fuse Visual characteristics and Audio characteristics, so that the mixed characteristics can be better utilized and analyzed, and the connection between the Visual characteristics and the Audio characteristics is fully excavated.
Table 1 two channel audio generation performance comparison
Fig. 4 illustrates a comparison of a binaural audio waveform generated by using a model and original binaural audio in a data set, where a section of audio waveforms of a left channel and a right channel corresponding to video of a monitored scene of a power supply system are respectively shown, and a blue waveform is a true value of the data, and an orange waveform is an audio waveform obtained by model prediction. The comparison result shows that the waveform of the predicted audio is basically consistent with the actual waveform, and for the audio fragments with obvious differences in the left and right channels, the predicted audio can be the result of predicting the difference for the left and right channel audio. As in the red framed area of fig. 4, the waveforms of this portion differ significantly in the left and right channels, and some features appear only in the left channel audio and are eliminated in the right channel audio. The predicted audio can also basically restore the audio difference, and the difference between the left channel and the right channel can enable the audio to show a spatial sense, so that the quality of the binaural audio generated in the monitoring process of the power supply system is improved.
The multi-mode perception-based binaural audio generation method provided by the invention can be processed in a computer device, and the processing device can be the computer device for executing the method, wherein the computer device can comprise one or more processors, such as one or more Central Processing Units (CPU), and each processing unit can realize one or more hardware threads. The computer device may also include any memory for storing any kind of information such as code, settings, data, etc. For example, and without limitation, the memory may include any one or more of the following combinations: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may store information using any technique. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of a computer device. In one case, the computer device may perform any of the operations of the associated instructions when the processor executes the associated instructions stored in any memory or combination of memories. The computer device also includes one or more drive mechanisms for interacting with any memory, such as a hard disk drive mechanism, optical disk drive mechanism, and the like.
The computer device may also include an input/output module (I/O) for receiving various inputs (via the input device) and for providing various outputs (via the output device)). One particular output mechanism may include a presentation device and an associated Graphical User Interface (GUI). In other embodiments, input/output modules (I/O), input devices, and output devices may not be included, but may be implemented as a single computer device in a network. The computer device may also include one or more network interfaces for exchanging data with other devices via one or more communication links. One or more communication buses couple the above-described components together.
The communication link may be implemented in any manner, for example, through a local area network, a wide area network (e.g., the Internet), a point-to-point connection, etc., or any combination thereof. The communication link may comprise any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.
Although exemplary embodiments of the present invention have been described for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions, and the like, can be made in the form and detail without departing from the scope and spirit of the invention as disclosed in the accompanying claims, all such modifications are intended to be within the scope of the invention as disclosed in the accompanying claims, and the various steps of the invention in the various departments and methods of the claimed product can be combined together in any combination. Therefore, the description of the embodiments disclosed in the present invention is not intended to limit the scope of the present invention, but is used to describe the present invention. Accordingly, the scope of the invention is not limited by the above embodiments, but is defined by the claims or equivalents thereof.

Claims (7)

1. The binaural audio generation method based on multi-modal sensing is characterized by comprising the following steps in sequence:
(1) In the monitoring process of the power supply system, acquiring a real video, and completing the extraction and analysis of visual characteristics based on a convolutional neural network;
(2) Acquiring an audio signal in a video, and performing time-frequency analysis on the audio signal by utilizing short-time Fourier transform to obtain the characteristics of the audio signal in a frequency domain and a time domain;
(3) The left and right channel audios contained in the time frequency are used as prediction targets of the model, the prediction of the audios is realized through a deep convolutional neural network, and a self-supervision binaural audio separation method is adopted to generate binaural audios;
(4) Fusing the audio and video through a fusion analysis network: the encoder is used for taking the frequency spectrum of the mixed mono audio as input, and a two-dimensional convolution network is used for completing downsampling, so that high-level extraction in the audio frequency spectrum is realized; up-sampling the high-level features by using a decoder, introducing enhanced visual feature fusion, compressing the introduced visual features, and merging with the audio features in a splicing manner;
wherein, the step (1) specifically comprises:
(1.1) acquiring a real video, and dividing the real video into a plurality of continuous video segments with the length of t seconds, wherein t <1.0;
(1.2) for each video clip, extracting the picture frames at the intermediate positions as key frames and as visual input for that video clip;
(1.3) initializing a convolutional neural network using pre-trained weights, and then fine-tuning the model using a learning rate;
and (1.3) reserving a feature extraction part in the original network in the convolutional neural network model, removing a classifier part at the tail end in the network, and only obtaining visual features of the vision extracted by a model hidden layer of the convolutional neural network model.
2. The method of claim 1, wherein: in the step (1), the visual characteristics are extracted and analyzed based on the convolutional neural network, specifically: and extracting visual characteristics by using a pre-trained image classification deep learning model to complete visual analysis.
3. The method of claim 2, wherein: and modifying and adjusting the pre-trained image classification deep learning model on the data set to adapt to the binaural audio generation task.
4. The method of claim 1, wherein: the step (2) specifically comprises:
the method comprises the steps of carrying out time-frequency analysis on an audio signal by utilizing short-time Fourier transform, and specifically realizing the time-frequency analysis by utilizing the following formula:
where x [ n ] represents the input signal at time n and w [ n ] is the Hanning window function.
5. The method of claim 1 or 3 or 4, wherein: the step (3) specifically comprises:
the method comprises the steps that an original time sequence signal of audio is taken as a model input, prediction is carried out through a time-frequency mask, and a time-frequency spectrum of the audio is taken as a prediction target of the model when the model is output;
using spectral masking as a prediction object for a binaural audio generation model, wherein the spectral masking is a matrix of the same size as the input spectrum by combining the original spectrum S 0 Performing product operation with the mask M to obtain a target frequency spectrum S t
s t =M·S 0
Left-right channel audio difference S in real data D (t) is expressed as:
S D (t)=S L (t)–S R (t)
mixed audio S of known input M (t) is:
S M (t)=S L (t)+S R (t)
then by mixing the audio S M (t) and predicted Audio DifferenceRestoring to obtain audio of left and right channels:
the prediction target of the model is the audio differenceAnd as far as possible make +.>And S is equal to D The error of (t) is small.
6. The method of claim 5, wherein: the spectrum masking adopts masking operation directly in a complex domain, complex value masking requires product operation of an original spectrum and masking in the complex domain, and complex value masking operation aiming at target audio is expressed as follows:
R(S t )=R(M)·R(S 0 )-I(M)·I(S 0 )
I(S t )=R(M)·I(S 0 )-I(M)·R(S 0 )
where R (x) represents the real part of the complex-valued spectrum and I (x) represents the imaginary part.
7. The method of claim 6, wherein: the step (4) specifically comprises:
(4.1) the encoder takes the spectrum of the mixed mono audio as input, uses a two-dimensional convolution network to complete the down-sampling process, the convolution kernel size is 4 multiplied by 4, adds the LeakyRelu as an activation function, uses a Sigmoid activation function to limit the output range of the audio spectrum mask to be [0,1], maps the output range to [ -1,1], and extracts the characteristics in the audio spectrum;
(4.2) introducing reinforced visual feature fusion into the decoder part, wherein the original visual features pass through a visual fusion module at each network layer in the up-sampling stage, the visual fusion module reserves a 1X 1 convolution dimension reduction mode used in MONO2BINAURAL, compresses the input visual features, and the compressed visual features are combined with the audio features in a splicing mode;
(4.3) each stage of upsampling is provided with a separate visual fusion layer, screening different visual features for different stages of upsampling; the upsampling operation is accomplished by transpose convolution, and the fused features of the audio and video are upsampled by 5 layers, ultimately outputting the predicted audio spectral mask.
CN202310329306.0A 2023-03-30 2023-03-30 Dual-channel audio generation method based on multi-modal sensing Active CN116456262B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310329306.0A CN116456262B (en) 2023-03-30 2023-03-30 Dual-channel audio generation method based on multi-modal sensing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310329306.0A CN116456262B (en) 2023-03-30 2023-03-30 Dual-channel audio generation method based on multi-modal sensing

Publications (2)

Publication Number Publication Date
CN116456262A CN116456262A (en) 2023-07-18
CN116456262B true CN116456262B (en) 2024-01-23

Family

ID=87132998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310329306.0A Active CN116456262B (en) 2023-03-30 2023-03-30 Dual-channel audio generation method based on multi-modal sensing

Country Status (1)

Country Link
CN (1) CN116456262B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113099374A (en) * 2021-03-30 2021-07-09 四川省人工智能研究院(宜宾) Audio frequency three-dimensional method based on multi-attention audio-visual fusion
CN113221900A (en) * 2021-04-29 2021-08-06 上海海事大学 Multimode video Chinese subtitle recognition method based on densely connected convolutional network
CN113254713A (en) * 2021-05-17 2021-08-13 北京航空航天大学 Multi-source emotion calculation system and method for generating emotion curve based on video content

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989845B (en) * 2015-02-25 2020-12-08 杜比实验室特许公司 Video content assisted audio object extraction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113099374A (en) * 2021-03-30 2021-07-09 四川省人工智能研究院(宜宾) Audio frequency three-dimensional method based on multi-attention audio-visual fusion
CN113221900A (en) * 2021-04-29 2021-08-06 上海海事大学 Multimode video Chinese subtitle recognition method based on densely connected convolutional network
CN113254713A (en) * 2021-05-17 2021-08-13 北京航空航天大学 Multi-source emotion calculation system and method for generating emotion curve based on video content

Also Published As

Publication number Publication date
CN116456262A (en) 2023-07-18

Similar Documents

Publication Publication Date Title
Morgado et al. Self-supervised generation of spatial audio for 360 video
Joze et al. MMTM: Multimodal transfer module for CNN fusion
Gao et al. 2.5 d visual sound
Ephrat et al. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation
Afouras et al. The conversation: Deep audio-visual speech enhancement
Xu et al. Visually informed binaural audio generation without binaural audios
Takahashi et al. Densely connected multi-dilated convolutional networks for dense prediction tasks
Vasudevan et al. Semantic object prediction and spatial sound super-resolution with binaural sounds
US11663823B2 (en) Dual-modality relation networks for audio-visual event localization
US10701303B2 (en) Generating spatial audio using a predictive model
CN113470671B (en) Audio-visual voice enhancement method and system fully utilizing vision and voice connection
Zhu et al. Visually guided sound source separation using cascaded opponent filter network
Cobos et al. An overview of machine learning and other data-based methods for spatial audio capture, processing, and reproduction
Montesinos et al. Vovit: Low latency graph-based audio-visual voice separation transformer
Rachavarapu et al. Localize to binauralize: Audio spatialization from visual sound source localization
Zhu et al. Visually guided sound source separation and localization using self-supervised motion representations
Wu et al. Binaural audio-visual localization
Rahman et al. Weakly-supervised audio-visual sound source detection and separation
CN116456262B (en) Dual-channel audio generation method based on multi-modal sensing
US11308329B2 (en) Representation learning from video with spatial audio
CN114360491B (en) Speech synthesis method, device, electronic equipment and computer readable storage medium
Luo et al. Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments.
US20230162725A1 (en) High fidelity audio super resolution
CN115691539A (en) Two-stage voice separation method and system based on visual guidance
Cheng et al. Improving multimodal speech enhancement by incorporating self-supervised and curriculum learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant