CN116312607A - Training method for audio-visual voice separation model, electronic device and storage medium - Google Patents

Training method for audio-visual voice separation model, electronic device and storage medium Download PDF

Info

Publication number
CN116312607A
CN116312607A CN202211573033.6A CN202211573033A CN116312607A CN 116312607 A CN116312607 A CN 116312607A CN 202211573033 A CN202211573033 A CN 202211573033A CN 116312607 A CN116312607 A CN 116312607A
Authority
CN
China
Prior art keywords
audio
visual
separation model
training
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211573033.6A
Other languages
Chinese (zh)
Inventor
钱彦旻
吴逸飞
李晨达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202211573033.6A priority Critical patent/CN116312607A/en
Publication of CN116312607A publication Critical patent/CN116312607A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the invention provides a training method of an audio-visual voice separation model, electronic equipment and a storage medium. The method comprises the following steps: inputting the mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a predicted spectrogram of the plurality of speakers; determining the predicted speaker audio-visual characteristics of the predicted spectrogram and the reference speaker audio-visual characteristics of the reference spectrogram of the mixed training audio; based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, the cross-modal loss is utilized to train the audio-visual voice separation model under the mixed precision quantization condition through a cross-direction multiplier method, and a lightweight audio-visual voice separation model is obtained. According to the embodiment of the invention, the lightweight audio-visual voice separation model is trained by carrying out quantization tuning on the model based on the cross direction multiplier method, and the calculation amount and performance balance of the lightweight audio-visual voice separation model can be ensured by fully utilizing the quantization sensitivity characteristics of different modes through the multi-mode model.

Description

Training method for audio-visual voice separation model, electronic device and storage medium
Technical Field
The present invention relates to the field of intelligent speech, and in particular, to a training method for an audio-visual speech separation model, an electronic device, and a storage medium.
Background
With the development of speech technology, audio-visual speech separation systems using multiple modes exhibit speech separation performance superior to that of speech processing systems that are purely speech. However, multi-modal audio-visual speech separation systems require a significant amount of computational resources. On the one hand, audio-visual speech separation systems have a large number of parameters to model the modalities and their dependencies, which increases memory consumption. On the other hand, fusing information from both modalities requires computing a larger feature map, while requiring more floating point computing operations.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the related art:
this prevents the use of audio-visual speech separation systems in low-resource devices, as multi-modal audio-visual speech separation systems require a large amount of computational resources. Typically for application of a multi-modal audio-visual speech separation system to a low-resource device: 1. STE (straight-through estimator) uses a quantizer for the forward computation of the STE model, but ignores the gradient provided by the quantizer for the back propagation of the gradient, resulting in the optimization objective not being exactly the same as desired. 2. And the derivative quantizer uses a more complex derivative function to approximate a step function for quantization, so that the gradient can normally and reversely propagate, but the gradient is very easy to be unable to converge to a better solution in training.
Disclosure of Invention
In order to at least solve the problem that the multi-mode audio-visual voice separation system in the prior art is difficult to be applied to low-resource equipment. In a first aspect, an embodiment of the present invention provides a training method for an audio-visual speech separation model, including:
inputting mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a prediction spectrogram of the plurality of speakers;
determining predicted speaker audiovisual features of the predicted spectrogram and reference speaker audiovisual features of the reference spectrogram of the mixed training audio;
based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method to obtain a lightweight audio-visual voice separation model.
In a second aspect, an embodiment of the present invention provides a training system for an audio-visual speech separation model, including:
the prediction program module is used for inputting the mixed training audio of a plurality of speakers to the audio-visual voice separation model to obtain the prediction spectrograms of the plurality of speakers;
the audio-visual characteristic determining program module is used for determining the predicted speaker audio-visual characteristics of the predicted spectrogram and the reference speaker audio-visual characteristics of the reference spectrogram of the mixed training audio;
and the lightweight training program module is used for training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method based on the cross-modal loss determined by the predicted speaker audio-visual characteristics and the reference speaker audio-visual characteristics to obtain a lightweight audio-visual voice separation model.
In a third aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the training method of the audio-visual speech separation model of any one of the embodiments of the invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the training method of the audio-visual speech separation model of any of the embodiments of the present invention.
The embodiment of the invention has the beneficial effects that: the model is quantitatively optimized based on the cross direction multiplier method to train a lightweight audio-visual voice separation model, the model can be applied to low-resource equipment with weak computing capacity, the characteristics of different modes with different quantization sensitivities can be fully utilized through the multi-mode model, the balance of the computing capacity and the performance of the lightweight audio-visual voice separation model is ensured, and the model can be applied to the low-resource equipment to obtain better voice separation performance.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings may be obtained according to these drawings without the need for inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a training method for an audio-visual speech separation model according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a network structure of visual voice of a training method of an audio-visual voice separation model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a fixed quantization model and fine adjustment results of different parts thereof according to a training method of an audio-visual speech separation model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a hybrid accuracy fine tuning result of a training method of an audio-visual speech separation model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the precise combination of layers based on KL divergence of a training method of an audio-visual speech separation model according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a training system for audio-visual speech separation model according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an embodiment of an electronic device for training an audio-visual speech separation model according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of a training method of an audio-visual speech separation model according to an embodiment of the present invention, including the following steps:
s11: inputting mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a prediction spectrogram of the plurality of speakers;
s12: determining predicted speaker audiovisual features of the predicted spectrogram and reference speaker audiovisual features of the reference spectrogram of the mixed training audio;
s13: based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method to obtain a lightweight audio-visual voice separation model.
In this embodiment, the audio-visual speech separation model of the method is a multitasking learning framework visual voice model. In the audio-visual speech separation process, it receives the mixed speech of K speakers, and the lip video of the K speakers when speaking, and also provides K facial images as additional inputs for calculating speaker identity embedding and crossover pattern loss. In the aspect of lips, a video containing lip areas is processed through a lip reading network, which consists of a 3D convolution network and a SheffleNet V2, and finally, a time convolution network is used for extracting lip feature sequences. In the facial aspect, resNet-18 is used to extract speaker identity embedding from facial input. In terms of sound, a convolutional network of U-Net is used to process complex spectrograms of the input audio mix. The hidden audio features in the middle of the U-Net are connected with the lip feature sequences and the face features in the channel dimension to obtain the fused audio-visual features of lips, faces and sounds, and the fused audio-visual features are applied to the input spectrogram to obtain the prediction spectrograms of K speakers.
For step S11, at the time of training, mixed training audio of a plurality of speakers and a reference spectrogram of the mixed training audio are prepared. For training of the audio-visual speech separation model, mixed training audio of a plurality of speakers is input to the audio-visual speech separation model, and prediction spectrograms of the plurality of speakers are predicted by using the audio-visual speech separation model.
For step S12, a voice attribute analysis network module is introduced to determine reference speaker audiovisual features and predicted speaker audiovisual features from the reference spectrogram and the predicted spectrogram, the features including speaker identity embedding to learn correlations between audio and video embeddings.
For step S13, while visual voice achieves leading performance on many sets of audiovisual data including VoxCeleb2, LRS2, etc., its computational cost and model size are generally not acceptable for low-computational-amount small smart devices. Therefore, the method uses quantization technology to reduce the calculation and storage cost of visual voice so as to obtain a lightweight version thereof, wherein the lightweight version is a small intelligent device with a software architecture depending on environment, such as a recording pen, and the intelligent device with low calculation power is difficult to execute under the condition of not networking and is provided with a complex neural network for voice processing under the condition of hardware limitation. In order for such low-computing-power intelligent devices to have speech processing capabilities, they are often equipped with lightweight language models.
The neural network quantization of the visual voice separation model is defined as follows: given the parameter w= { W 1 ,W 2 ,...,W L A neural network N of }, wherein,
Figure BDA0003988316620000051
indicating parameters of the i-th layer. The goal of the training is to find the scale factor +.>
Figure BDA0003988316620000052
Each W i Quantized integer matrix +.>
Figure BDA0003988316620000053
Is defined as
Figure BDA0003988316620000054
Wherein (1)>
Figure BDA0003988316620000055
Bit precision representing the i-th layer, +.>
Figure BDA0003988316620000056
Representing an effective set of accuracies.
As an embodiment, the training of the audio-visual speech separation model by using the cross-modal loss through the cross-direction multiplier method is as follows:
L ρ (W,G,λ)=f(W)+ρ/2||W-G-λ|| 2 -ρ/2||λ||
wherein W is a layer 1 to layer L parameter of the audio-visual speech separation model, g= { α i Q i } L i=1 The alpha scale factor, Q is a quantized integer matrix, i is the ith layer of the audio-visual voice separation model, lambda is a Lagrangian multiplier, and ρ is the visualAnd listening to the super-parameters of the speech separation model. In this embodiment, the method uses an optimization algorithm based on ADMM (Alternating Direction Methods of Multipliers, cross direction multiplier method) for the first time to quantify an audiovisual multimodal speech separation system. Neural network quantization is considered a non-convex optimization problem of discrete constraints. This allows the training process to naturally extend from fixed precision fine tuning to mixed precision quantization.
The specific algorithm is as follows:
Procedure TRAINONESTEP
Figure BDA0003988316620000057
extra-gradient method// Extra gradient method.
Figure BDA0003988316620000058
Figure BDA0003988316620000059
Iterative update of{α i }and{Q i },initialized by
Figure BDA00039883166200000510
And->
Figure BDA00039883166200000511
//{α i }
Sum { Q i Iterative update of }, by
Figure BDA00039883166200000512
And->
Figure BDA00039883166200000513
Initializing.
while{α i }and{Q i }are not converged do//{α i Sum { Q } i No convergence.
Figure BDA00039883166200000514
Figure BDA00039883166200000515
end while The final{α i }and{Q i }are denoted as
Figure BDA00039883166200000516
and/>
Figure BDA00039883166200000517
/(last { alpha }) i Sum { Q } i Denoted ∈ }>
Figure BDA00039883166200000518
And->
Figure BDA00039883166200000519
Figure BDA00039883166200000520
end procedure
Through the optimization and quantification of the algorithm, the memory occupation and the calculation cost of the neural network of the audio-visual voice separation model are greatly reduced.
Although the tuning quantization can reduce the memory occupation and the calculation cost of the neural network of the audio-visual voice separation model, the extreme quantization can also reduce the performance of the audio-visual voice separation model. The application finds that the sensitivity of different layers of the model to quantization errors is different, so that the mixed precision quantization strategy can effectively improve the performance of the audio-visual voice separation model on the basis of the lightweight model.
In one embodiment, the audio-visual speech separation model is trained by cross-direction multiplier method using the cross-modal loss based on a hybrid accuracy quantization condition determined from a first search space sensitivity, a second search space sensitivity, and a training performance sensitivity.
Wherein the mixed precision quantization condition includes: the first search space sensitivity of each layer of parameters in the audio-visual speech separation model is searched based on the Hessian trace.
In this embodiment, the best precision combination can be searched based on the Hessian trace. The sensitivity of the network layer to quantization can be determined by multiplying the trace (trace) of the parameter Hessian matrix by the squared quantization error. The impact of quantizing each layer for the audio-visual speech separation model is assumed to be independent. Thus, the best combination may be found by searching for the combination in which the sum of the sensitivity scores of all layers is the smallest. Calculating the second-order gradient of the parameter, wherein the final sensitivity score formula is as follows:
Figure BDA0003988316620000061
wherein H is i Is W i Is a matrix of hessians of (c),
Figure BDA0003988316620000062
is made of W i H of the average of the number of parameters in (a) i A track. The Hessian trace can be calculated using an open source toolkit. In order to reduce the calculation costs, by making +.>
Figure BDA0003988316620000063
The higher layer keeps higher precision of the audio-visual voice separation model, further limits the search space, and can obtain better training results based on the lightweight voice separation model according to experiments.
As another embodiment, the mixing precision quantization condition includes: and searching a second search space sensitivity of each layer of parameters in the audio-visual voice separation model based on the relative entropy.
In this embodiment, the relative entropy is also referred to as the KL-Kullback-Leibler divergence divergence, which determines the KL bias based on the quantized model and the full-precision model output. When the neural network of the audio-visual speech separation model has a large number of layers, the search space becomes too large, resulting in unacceptable time consumption, in order to further reduce the search space of the model, the following greedy search-based algorithm is used:
procedure GREEDYSEARCH
Figure BDA0003988316620000064
for i in{1,…,L}do
Figure BDA0003988316620000071
Figure BDA0003988316620000072
End for
End procedure
where X is calibration data determined using reference speaker audiovisual features of a reference spectrogram and gw (X) represents the mask calculation process for a network with parameter W.
Figure BDA0003988316620000073
Parameters representing the network, wherein the ith layer is quantized to b i Bit(s)>
Figure BDA0003988316620000074
The quantization parameter representing this layer penalizes the model size using the adjustable parameter phi in order to limit the search process.
As another embodiment, the mixing precision quantization condition includes: the training performance sensitivity of each layer of parameters in the audio-visual voice separation model is selected based on a priori knowledge.
In this embodiment, the partial quantization model is relied upon to ensure the performance results of the audio-visual speech separation model. More precisely, the entire network will be divided into several parts by a priori knowledge. The audio-visual voice separation model neural network quantizes the parts to lower precision respectively, and experiments are carried out. From the experimental results on the validation set, the sensitivity of each portion to quantification can be determined, with manual selection based on the experimental results. Based on this, the performance of the trained lightweight audio-visual speech separation model is ensured.
According to the embodiment, the model is quantitatively optimized based on the cross direction multiplier method to train a lightweight audio-visual voice separation model, the model can be applied to low-resource equipment with weak computing capacity, the characteristics of different modes with different quantization sensitivities can be fully utilized through the multi-mode model, the balance of the computing amount and the performance of the lightweight audio-visual voice separation model is ensured, and the model can be applied to the low-resource equipment to obtain better voice separation performance.
Experiments were conducted with this approach, reproducing visual voice of similar performance using an ESPNet SE (End-to-End speech enhancement and separation toolkit designed, end-to-End speech enhancement and separation) tool. The multiple speaker-mixed voices in the training set are pre-generated and the corresponding verification set is prepared in advance.
For a fixed-precision quantization experiment, 1000 samples were randomly selected from the original training set to form a small set for fine tuning. To perform a hybrid precision combinatorial search, only 4 samples in the validation set are used. The model is always fine-tuned for 30 cycles, and the final result will be to select the model with the best verification performance among the resulting models. In ADMM-based QAT (quantization-aware training), the learning rate η 1 、η 2 Is set to 5.0X10 -6 And 5.0X10 -7 And ρ is set to 100. Since the sound attribute analysis network does not participate in the reasoning phase, it is not quantified in the training. Using phi=4×10 0 、6×10 1 、5×10 2 Models equal to the fixed 6,4, 3 bit (bit) model sizes are obtained in the KL divergence-based search, respectively. For greedy searching, the searching is performed in the order of lip network, face network, and finally U-Net. In all experiments, the activation of all layers was quantized to 8 bits using a min-max strategy.
In addition to applying the fixed precision QAT to the entire visual voice network, it is also divided into different parts, which are quantized separately, as shown in fig. 2. Although LipNet and FaceNet represent a lip motion analysis network and a facial attribute analysis network, respectively, U-Net is divided into two different schemes. One is a side-to-side approach, where the left side contains the first half of the U-Net and the right side contains the rest. The other is an inside-out scheme, where the inside contains 8 smaller layers in the middle of the U-Net and the outside contains 8 larger layers at the U-Net input and output. These two schemes are designed based on a priori knowledge of the U-Net structure. The results are shown in fig. 3 for different characteristics of the neural network portion with fixed quantization, where "Q-Part" represents the quantization portion.
From the results, it is clear that FaceNet is the least sensitive part of the overall network, since quantifying it to 3 bits hardly affects performance. LipNet shows the same tolerance at bit 4. However, when quantized to 3 bits, it exhibits serious degradation. With respect to U-Net, the original left and right parts do not show significant differences before quantization to 2 bits. Nevertheless, the inner part is more robust to quantization when partitioned according to the inside-outside scheme. This is considered a common phenomenon for U-Net structures. This is because UNet relies on skipped connections between symmetric outer layers to propagate low-level information. For the inner layer, no high precision calculations need to be performed, as the information stream is highly compressed. Another interesting observation is that for some parts, 2-bit quantization results are better than some higher precision quantization results. This is probably because in extreme quantization, high accuracy does not necessarily mean a small quantization error. Since uniform quantization is performed on the system, this phenomenon suggests that just the symbols (and zeros) that hold weights may be better than quantizing them to some discrete set.
To obtain better performance, 3 mixed precision fine tuning strategies are further applied to the above trained quantization system. The results are shown in fig. 4, which uses 3 metrics to evaluate the final performance: SDR, compression ratio and Bit Operation (BOP), wherein the "eq. Bits" column represents an equivalent fixed bit width setting of the current bit, where they are of approximately the same size. For manual strategies, we choose as high precision as possible for each part, while maintaining a model size equal to the fixed precision model. The selection is based on the valid SDR of the system listed in the table. 1, the trend is the same as the results in the table. Precisely, for a 6-bit equivalent manual selection, {6,4,6,8} is chosen for LipNet, faceNet, UNet Inner and UNet Outer, respectively. For a 4-bit selection it is {4,3,3,8}, and for a 3-bit selection it is {3,2,3,6}. It is evident from the figure that the three strategies give comparable results on a 6-bit equivalent setting. Meanwhile, the method is superior to 8-bit fixed precision setting in three indexes, and the effectiveness of mixed precision quantization is proved. On a 4-bit equivalent setting, the manual strategy is slightly better in SDR than the selection based on Hessian trace (the word is translated by Hessian trace) and the greedy search strategy based on KL divergence by about 1dB, and is also slightly better in compression ratio. This shows that the a priori knowledge of the partitioning network of the method corresponds well to the characteristics of each layer. However, the first two strategies fail to maintain acceptable SDR when a 3-bit equivalent setting is involved. Meanwhile, the greedy search based on the KL divergence, which is proposed by the method, can obtain a better result of 7.2dB in the SDR while maintaining the size of the competition model and the BOP. This is possible because the KL divergence is calculated directly from the output mask, and thus it is desirable to focus on optimizing the final accuracy rather than minimizing quantization errors.
For a detailed analysis of the quantization network, fig. 5 shows the combination of precision of layers in a 3-bit equivalent system based on KL divergence. The trend shown in the graph is very consistent with the observation as in fig. 3. UNet Inner follows FaceNet and is also designated as low precision. It can be observed that LipNet exhibits relatively severe degradation exceeding 4dB when quantized to 3 bits, while both are only affected by less than 2 dB. This explains why some layers in LipNet require high precision, while other layers can remain low. Finally, UNet Outer is the most sensitive part, designated as the highest precision. Based on the observations described above, designing a combination of accuracies by automatic search and a priori knowledge may be a promising strategy in more specific practical applications. By searching using some straightforward and automatic method (e.g., KL-divergence based), a near optimal combination can be found first. More suitable combinations may then be manually defined based on the search results, wherein a priori knowledge may be fused into them. By applying the above procedure, it may be desirable to extrude the last bit from the quantization network while maintaining as high performance as possible.
In general, the present approach is directed to reducing the size and computational effort of visual voice of an audio-visual speech separation system. Attempts have been made to train its quantized version using the ADMM-based quantized perceptual training method. To further optimize the trade-off between space, speed and size, three strategies are employed to generate a hybrid precision quantization network. Experimental results show that the manual selection strategy provides the best prediction accuracy under relatively relaxed constraints, which is comparable to or even better than the higher accuracy results. And the greedy search strategy based on KL divergence shows excellent performance under the condition of 3-bit extreme quantization, which is about 8dB higher than the other two strategies and about 13dB higher than the fixed-precision quantization result.
Fig. 6 is a schematic structural diagram of a training system for an audio-visual speech separation model according to an embodiment of the present invention, where the training system may execute the audio-visual speech separation model training method according to any of the above embodiments and is configured in a terminal.
The training system 10 for an audio-visual speech separation model provided in this embodiment includes: a predictive program module 11, an audiovisual feature determination program module 12 and a lightweight training program module 13.
The prediction program module 11 is configured to input mixed training audio of a plurality of speakers to an audio-visual speech separation model, and obtain prediction spectrograms of the plurality of speakers; the audiovisual feature determining program module 12 is configured to determine a predicted speaker audiovisual feature of the predicted spectrogram and a reference speaker audiovisual feature of a reference spectrogram of the mixed training audio; the lightweight training program module 13 is configured to perform training on the audio-visual speech separation model under a mixed precision quantization condition by using the cross-modal loss through a cross-direction multiplier method based on the cross-modal loss determined by the predicted speaker audio-visual feature and the reference speaker audio-visual feature, so as to obtain a lightweight audio-visual speech separation model.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the training method of the audio-visual voice separation model in any method embodiment;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
inputting mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a prediction spectrogram of the plurality of speakers;
determining predicted speaker audiovisual features of the predicted spectrogram and reference speaker audiovisual features of the reference spectrogram of the mixed training audio;
based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method to obtain a lightweight audio-visual voice separation model.
As a non-volatile computer readable storage medium, it may be used to store a non-volatile software program, a non-volatile computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the training method of the audio-visual speech separation model in any of the method embodiments described above.
Fig. 7 is a schematic hardware structure of an electronic device according to another embodiment of the present application, where the electronic device is a training method for an audio-visual speech separation model, and as shown in fig. 7, the device includes:
one or more processors 710, and a memory 720, one processor 710 being illustrated in fig. 7. The apparatus of the training method of the audio-visual speech separation model may further include: an input device 730 and an output device 740.
Processor 710, memory 720, input device 730, and output device 740 may be connected by a bus or other means, for example in fig. 7.
The memory 720 is used as a non-volatile computer readable storage medium, and can be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the training method of the audio-visual speech separation model in the embodiments of the present application. The processor 710 executes various functional applications of the server and data processing, i.e., implements the training method of the audiovisual speech separation model of the method embodiment described above, by running non-volatile software programs, instructions and modules stored in the memory 720.
Memory 720 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data, etc. In addition, memory 720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 720 may optionally include memory located remotely from processor 710, which may be connected to the mobile device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 730 may receive input numerical or character information. The output device 740 may include a display device such as a display screen.
The one or more modules are stored in the memory 720 that, when executed by the one or more processors 710, perform the training method of the audiovisual speech separation model in any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium may optionally include memory remotely located relative to the processor, which may be connected to the apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the invention also provides electronic equipment, which comprises: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the training method of the audio-visual speech separation model of any one of the embodiments of the invention.
The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:
(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.
(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc., such as tablet computers.
(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.
(4) Other electronic devices with data processing functions.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," comprising, "or" includes not only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of training an audio-visual speech separation model, comprising:
inputting mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a prediction spectrogram of the plurality of speakers;
determining predicted speaker audiovisual features of the predicted spectrogram and reference speaker audiovisual features of the reference spectrogram of the mixed training audio;
based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method to obtain a lightweight audio-visual voice separation model.
2. The method of claim 1, wherein the training of the audio-visual speech separation model with the cross-modal loss to mix-precision quantization conditions by cross-direction multiplier is a function of:
L ρ (W,G,λ)=f(W)+ρ/2||W-G-λ|| 2 -ρ/2||λ||
wherein W is a layer 1 to layer L parameter of the audio-visual speech separation model, g= { α i Q i } L i=1 And the alpha scale factor, Q is a quantized integer matrix, i is the ith layer of the audio-visual voice separation model, lambda is a Lagrangian multiplier, and rho is a hyper-parameter of the audio-visual voice separation model.
3. The method of claim 2, wherein the mixed precision quantization condition comprises: the first search space sensitivity of each layer of parameters in the audio-visual speech separation model is searched based on the Hessian trace.
4. The method of claim 2, wherein the mixed precision quantization condition comprises: and searching a second search space sensitivity of each layer of parameters in the audio-visual voice separation model based on the relative entropy.
5. The method of claim 2, wherein the mixed precision quantization condition comprises: the training performance sensitivity of each layer of parameters in the audio-visual voice separation model is selected based on a priori knowledge.
6. The method of any of claims 3-5, wherein the training of the audio-visual speech separation model with the cross-modal loss to mix-precision quantization conditions by cross-direction multiplier method comprises:
and training the audio-visual voice separation model by utilizing the cross-modal loss through a cross-direction multiplier method based on the mixed precision quantization condition determined by the first search space sensitivity, the second search space sensitivity and the training performance sensitivity.
7. A training system for an audio-visual speech separation model, comprising:
the prediction program module is used for inputting the mixed training audio of a plurality of speakers to the audio-visual voice separation model to obtain the prediction spectrograms of the plurality of speakers;
the audio-visual characteristic determining program module is used for determining the predicted speaker audio-visual characteristics of the predicted spectrogram and the reference speaker audio-visual characteristics of the reference spectrogram of the mixed training audio;
and the lightweight training program module is used for training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method based on the cross-modal loss determined by the predicted speaker audio-visual characteristics and the reference speaker audio-visual characteristics to obtain a lightweight audio-visual voice separation model.
8. The system of claim 7, wherein the training of the audio-visual speech separation model with the cross-modal loss to mix-precision quantization conditions by cross-direction multiplier is a function of:
L ρ (W,G,λ)=f(W)+ρ/2||W-G-λ|| 2 -ρ/2||λ||
wherein W is a layer 1 to layer L parameter of the audio-visual speech separation model, g= { α i Q i } L i=1 And the alpha scale factor, Q is a quantized integer matrix, i is the ith layer of the audio-visual voice separation model, lambda is a Lagrangian multiplier, and rho is a hyper-parameter of the audio-visual voice separation model.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-6.
10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1-6.
CN202211573033.6A 2022-12-08 2022-12-08 Training method for audio-visual voice separation model, electronic device and storage medium Pending CN116312607A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211573033.6A CN116312607A (en) 2022-12-08 2022-12-08 Training method for audio-visual voice separation model, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211573033.6A CN116312607A (en) 2022-12-08 2022-12-08 Training method for audio-visual voice separation model, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN116312607A true CN116312607A (en) 2023-06-23

Family

ID=86776788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211573033.6A Pending CN116312607A (en) 2022-12-08 2022-12-08 Training method for audio-visual voice separation model, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN116312607A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118016093A (en) * 2024-02-26 2024-05-10 山东大学 Target voice separation method and system based on cross-modal loss

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118016093A (en) * 2024-02-26 2024-05-10 山东大学 Target voice separation method and system based on cross-modal loss

Similar Documents

Publication Publication Date Title
CN109977212B (en) Reply content generation method of conversation robot and terminal equipment
Chen et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110110337B (en) Translation model training method, medium, device and computing equipment
US11195093B2 (en) Apparatus and method for student-teacher transfer learning network using knowledge bridge
CN110598224B (en) Training method of translation model, text processing method, device and storage medium
CN108960407B (en) Recurrent neural network language model training method, device, equipment and medium
Alvarez et al. On the efficient representation and execution of deep acoustic models
US20100138010A1 (en) Automatic gathering strategy for unsupervised source separation algorithms
CN112733964A (en) Convolutional neural network quantification method for reinforcement learning automatic perception weight distribution
CN111814448B (en) Pre-training language model quantization method and device
CN112861521B (en) Speech recognition result error correction method, electronic device and storage medium
KR20210141115A (en) Method and apparatus for estimating utterance time
CN116312607A (en) Training method for audio-visual voice separation model, electronic device and storage medium
Cord-Landwehr et al. Frame-wise and overlap-robust speaker embeddings for meeting diarization
Kim et al. WaveNODE: A continuous normalizing flow for speech synthesis
Fan et al. Utterance-level permutation invariant training with discriminative learning for single channel speech separation
Wu et al. Light-weight visualvoice: Neural network quantization on audio visual speech separation
CN114861907A (en) Data calculation method, device, storage medium and equipment
US20240096332A1 (en) Audio signal processing method, audio signal processing apparatus, computer device and storage medium
Liu et al. Multi-head monotonic chunkwise attention for online speech recognition
CN116644797A (en) Neural network model quantization compression method, electronic device and storage medium
CN117370890A (en) Knowledge question-answering method, system, device and storage medium
Peter et al. Resource-efficient dnns for keyword spotting using neural architecture search and quantization
CN113160801B (en) Speech recognition method, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination