CN116312607A - Training method for audio-visual voice separation model, electronic device and storage medium - Google Patents
Training method for audio-visual voice separation model, electronic device and storage medium Download PDFInfo
- Publication number
- CN116312607A CN116312607A CN202211573033.6A CN202211573033A CN116312607A CN 116312607 A CN116312607 A CN 116312607A CN 202211573033 A CN202211573033 A CN 202211573033A CN 116312607 A CN116312607 A CN 116312607A
- Authority
- CN
- China
- Prior art keywords
- audio
- visual
- separation model
- training
- cross
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 111
- 238000000034 method Methods 0.000 title claims abstract description 81
- 238000012549 training Methods 0.000 title claims abstract description 76
- 238000013139 quantization Methods 0.000 claims abstract description 55
- 230000035945 sensitivity Effects 0.000 claims abstract description 20
- 230000006870 function Effects 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 abstract description 9
- 238000013528 artificial neural network Methods 0.000 description 9
- 230000000007 visual effect Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000001815 facial effect Effects 0.000 description 4
- 238000010295 mobile communication Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011002 quantification Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Biomedical Technology (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The embodiment of the invention provides a training method of an audio-visual voice separation model, electronic equipment and a storage medium. The method comprises the following steps: inputting the mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a predicted spectrogram of the plurality of speakers; determining the predicted speaker audio-visual characteristics of the predicted spectrogram and the reference speaker audio-visual characteristics of the reference spectrogram of the mixed training audio; based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, the cross-modal loss is utilized to train the audio-visual voice separation model under the mixed precision quantization condition through a cross-direction multiplier method, and a lightweight audio-visual voice separation model is obtained. According to the embodiment of the invention, the lightweight audio-visual voice separation model is trained by carrying out quantization tuning on the model based on the cross direction multiplier method, and the calculation amount and performance balance of the lightweight audio-visual voice separation model can be ensured by fully utilizing the quantization sensitivity characteristics of different modes through the multi-mode model.
Description
Technical Field
The present invention relates to the field of intelligent speech, and in particular, to a training method for an audio-visual speech separation model, an electronic device, and a storage medium.
Background
With the development of speech technology, audio-visual speech separation systems using multiple modes exhibit speech separation performance superior to that of speech processing systems that are purely speech. However, multi-modal audio-visual speech separation systems require a significant amount of computational resources. On the one hand, audio-visual speech separation systems have a large number of parameters to model the modalities and their dependencies, which increases memory consumption. On the other hand, fusing information from both modalities requires computing a larger feature map, while requiring more floating point computing operations.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the related art:
this prevents the use of audio-visual speech separation systems in low-resource devices, as multi-modal audio-visual speech separation systems require a large amount of computational resources. Typically for application of a multi-modal audio-visual speech separation system to a low-resource device: 1. STE (straight-through estimator) uses a quantizer for the forward computation of the STE model, but ignores the gradient provided by the quantizer for the back propagation of the gradient, resulting in the optimization objective not being exactly the same as desired. 2. And the derivative quantizer uses a more complex derivative function to approximate a step function for quantization, so that the gradient can normally and reversely propagate, but the gradient is very easy to be unable to converge to a better solution in training.
Disclosure of Invention
In order to at least solve the problem that the multi-mode audio-visual voice separation system in the prior art is difficult to be applied to low-resource equipment. In a first aspect, an embodiment of the present invention provides a training method for an audio-visual speech separation model, including:
inputting mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a prediction spectrogram of the plurality of speakers;
determining predicted speaker audiovisual features of the predicted spectrogram and reference speaker audiovisual features of the reference spectrogram of the mixed training audio;
based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method to obtain a lightweight audio-visual voice separation model.
In a second aspect, an embodiment of the present invention provides a training system for an audio-visual speech separation model, including:
the prediction program module is used for inputting the mixed training audio of a plurality of speakers to the audio-visual voice separation model to obtain the prediction spectrograms of the plurality of speakers;
the audio-visual characteristic determining program module is used for determining the predicted speaker audio-visual characteristics of the predicted spectrogram and the reference speaker audio-visual characteristics of the reference spectrogram of the mixed training audio;
and the lightweight training program module is used for training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method based on the cross-modal loss determined by the predicted speaker audio-visual characteristics and the reference speaker audio-visual characteristics to obtain a lightweight audio-visual voice separation model.
In a third aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the training method of the audio-visual speech separation model of any one of the embodiments of the invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the training method of the audio-visual speech separation model of any of the embodiments of the present invention.
The embodiment of the invention has the beneficial effects that: the model is quantitatively optimized based on the cross direction multiplier method to train a lightweight audio-visual voice separation model, the model can be applied to low-resource equipment with weak computing capacity, the characteristics of different modes with different quantization sensitivities can be fully utilized through the multi-mode model, the balance of the computing capacity and the performance of the lightweight audio-visual voice separation model is ensured, and the model can be applied to the low-resource equipment to obtain better voice separation performance.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings may be obtained according to these drawings without the need for inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a training method for an audio-visual speech separation model according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a network structure of visual voice of a training method of an audio-visual voice separation model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a fixed quantization model and fine adjustment results of different parts thereof according to a training method of an audio-visual speech separation model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a hybrid accuracy fine tuning result of a training method of an audio-visual speech separation model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the precise combination of layers based on KL divergence of a training method of an audio-visual speech separation model according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a training system for audio-visual speech separation model according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an embodiment of an electronic device for training an audio-visual speech separation model according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of a training method of an audio-visual speech separation model according to an embodiment of the present invention, including the following steps:
s11: inputting mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a prediction spectrogram of the plurality of speakers;
s12: determining predicted speaker audiovisual features of the predicted spectrogram and reference speaker audiovisual features of the reference spectrogram of the mixed training audio;
s13: based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method to obtain a lightweight audio-visual voice separation model.
In this embodiment, the audio-visual speech separation model of the method is a multitasking learning framework visual voice model. In the audio-visual speech separation process, it receives the mixed speech of K speakers, and the lip video of the K speakers when speaking, and also provides K facial images as additional inputs for calculating speaker identity embedding and crossover pattern loss. In the aspect of lips, a video containing lip areas is processed through a lip reading network, which consists of a 3D convolution network and a SheffleNet V2, and finally, a time convolution network is used for extracting lip feature sequences. In the facial aspect, resNet-18 is used to extract speaker identity embedding from facial input. In terms of sound, a convolutional network of U-Net is used to process complex spectrograms of the input audio mix. The hidden audio features in the middle of the U-Net are connected with the lip feature sequences and the face features in the channel dimension to obtain the fused audio-visual features of lips, faces and sounds, and the fused audio-visual features are applied to the input spectrogram to obtain the prediction spectrograms of K speakers.
For step S11, at the time of training, mixed training audio of a plurality of speakers and a reference spectrogram of the mixed training audio are prepared. For training of the audio-visual speech separation model, mixed training audio of a plurality of speakers is input to the audio-visual speech separation model, and prediction spectrograms of the plurality of speakers are predicted by using the audio-visual speech separation model.
For step S12, a voice attribute analysis network module is introduced to determine reference speaker audiovisual features and predicted speaker audiovisual features from the reference spectrogram and the predicted spectrogram, the features including speaker identity embedding to learn correlations between audio and video embeddings.
For step S13, while visual voice achieves leading performance on many sets of audiovisual data including VoxCeleb2, LRS2, etc., its computational cost and model size are generally not acceptable for low-computational-amount small smart devices. Therefore, the method uses quantization technology to reduce the calculation and storage cost of visual voice so as to obtain a lightweight version thereof, wherein the lightweight version is a small intelligent device with a software architecture depending on environment, such as a recording pen, and the intelligent device with low calculation power is difficult to execute under the condition of not networking and is provided with a complex neural network for voice processing under the condition of hardware limitation. In order for such low-computing-power intelligent devices to have speech processing capabilities, they are often equipped with lightweight language models.
The neural network quantization of the visual voice separation model is defined as follows: given the parameter w= { W 1 ,W 2 ,...,W L A neural network N of }, wherein,indicating parameters of the i-th layer. The goal of the training is to find the scale factor +.>Each W i Quantized integer matrix +.>Is defined asWherein (1)>Bit precision representing the i-th layer, +.>Representing an effective set of accuracies.
As an embodiment, the training of the audio-visual speech separation model by using the cross-modal loss through the cross-direction multiplier method is as follows:
L ρ (W,G,λ)=f(W)+ρ/2||W-G-λ|| 2 -ρ/2||λ||
wherein W is a layer 1 to layer L parameter of the audio-visual speech separation model, g= { α i Q i } L i=1 The alpha scale factor, Q is a quantized integer matrix, i is the ith layer of the audio-visual voice separation model, lambda is a Lagrangian multiplier, and ρ is the visualAnd listening to the super-parameters of the speech separation model. In this embodiment, the method uses an optimization algorithm based on ADMM (Alternating Direction Methods of Multipliers, cross direction multiplier method) for the first time to quantify an audiovisual multimodal speech separation system. Neural network quantization is considered a non-convex optimization problem of discrete constraints. This allows the training process to naturally extend from fixed precision fine tuning to mixed precision quantization.
The specific algorithm is as follows:
extra-gradient method// Extra gradient method.
while{α i }and{Q i }are not converged do//{α i Sum { Q } i No convergence.
end while The final{α i }and{Q i }are denoted asand/>/(last { alpha }) i Sum { Q } i Denoted ∈ }>And->
end procedure
Through the optimization and quantification of the algorithm, the memory occupation and the calculation cost of the neural network of the audio-visual voice separation model are greatly reduced.
Although the tuning quantization can reduce the memory occupation and the calculation cost of the neural network of the audio-visual voice separation model, the extreme quantization can also reduce the performance of the audio-visual voice separation model. The application finds that the sensitivity of different layers of the model to quantization errors is different, so that the mixed precision quantization strategy can effectively improve the performance of the audio-visual voice separation model on the basis of the lightweight model.
In one embodiment, the audio-visual speech separation model is trained by cross-direction multiplier method using the cross-modal loss based on a hybrid accuracy quantization condition determined from a first search space sensitivity, a second search space sensitivity, and a training performance sensitivity.
Wherein the mixed precision quantization condition includes: the first search space sensitivity of each layer of parameters in the audio-visual speech separation model is searched based on the Hessian trace.
In this embodiment, the best precision combination can be searched based on the Hessian trace. The sensitivity of the network layer to quantization can be determined by multiplying the trace (trace) of the parameter Hessian matrix by the squared quantization error. The impact of quantizing each layer for the audio-visual speech separation model is assumed to be independent. Thus, the best combination may be found by searching for the combination in which the sum of the sensitivity scores of all layers is the smallest. Calculating the second-order gradient of the parameter, wherein the final sensitivity score formula is as follows:
wherein H is i Is W i Is a matrix of hessians of (c),is made of W i H of the average of the number of parameters in (a) i A track. The Hessian trace can be calculated using an open source toolkit. In order to reduce the calculation costs, by making +.>The higher layer keeps higher precision of the audio-visual voice separation model, further limits the search space, and can obtain better training results based on the lightweight voice separation model according to experiments.
As another embodiment, the mixing precision quantization condition includes: and searching a second search space sensitivity of each layer of parameters in the audio-visual voice separation model based on the relative entropy.
In this embodiment, the relative entropy is also referred to as the KL-Kullback-Leibler divergence divergence, which determines the KL bias based on the quantized model and the full-precision model output. When the neural network of the audio-visual speech separation model has a large number of layers, the search space becomes too large, resulting in unacceptable time consumption, in order to further reduce the search space of the model, the following greedy search-based algorithm is used:
for i in{1,…,L}do
End for
End procedure
where X is calibration data determined using reference speaker audiovisual features of a reference spectrogram and gw (X) represents the mask calculation process for a network with parameter W.Parameters representing the network, wherein the ith layer is quantized to b i Bit(s)>The quantization parameter representing this layer penalizes the model size using the adjustable parameter phi in order to limit the search process.
As another embodiment, the mixing precision quantization condition includes: the training performance sensitivity of each layer of parameters in the audio-visual voice separation model is selected based on a priori knowledge.
In this embodiment, the partial quantization model is relied upon to ensure the performance results of the audio-visual speech separation model. More precisely, the entire network will be divided into several parts by a priori knowledge. The audio-visual voice separation model neural network quantizes the parts to lower precision respectively, and experiments are carried out. From the experimental results on the validation set, the sensitivity of each portion to quantification can be determined, with manual selection based on the experimental results. Based on this, the performance of the trained lightweight audio-visual speech separation model is ensured.
According to the embodiment, the model is quantitatively optimized based on the cross direction multiplier method to train a lightweight audio-visual voice separation model, the model can be applied to low-resource equipment with weak computing capacity, the characteristics of different modes with different quantization sensitivities can be fully utilized through the multi-mode model, the balance of the computing amount and the performance of the lightweight audio-visual voice separation model is ensured, and the model can be applied to the low-resource equipment to obtain better voice separation performance.
Experiments were conducted with this approach, reproducing visual voice of similar performance using an ESPNet SE (End-to-End speech enhancement and separation toolkit designed, end-to-End speech enhancement and separation) tool. The multiple speaker-mixed voices in the training set are pre-generated and the corresponding verification set is prepared in advance.
For a fixed-precision quantization experiment, 1000 samples were randomly selected from the original training set to form a small set for fine tuning. To perform a hybrid precision combinatorial search, only 4 samples in the validation set are used. The model is always fine-tuned for 30 cycles, and the final result will be to select the model with the best verification performance among the resulting models. In ADMM-based QAT (quantization-aware training), the learning rate η 1 、η 2 Is set to 5.0X10 -6 And 5.0X10 -7 And ρ is set to 100. Since the sound attribute analysis network does not participate in the reasoning phase, it is not quantified in the training. Using phi=4×10 0 、6×10 1 、5×10 2 Models equal to the fixed 6,4, 3 bit (bit) model sizes are obtained in the KL divergence-based search, respectively. For greedy searching, the searching is performed in the order of lip network, face network, and finally U-Net. In all experiments, the activation of all layers was quantized to 8 bits using a min-max strategy.
In addition to applying the fixed precision QAT to the entire visual voice network, it is also divided into different parts, which are quantized separately, as shown in fig. 2. Although LipNet and FaceNet represent a lip motion analysis network and a facial attribute analysis network, respectively, U-Net is divided into two different schemes. One is a side-to-side approach, where the left side contains the first half of the U-Net and the right side contains the rest. The other is an inside-out scheme, where the inside contains 8 smaller layers in the middle of the U-Net and the outside contains 8 larger layers at the U-Net input and output. These two schemes are designed based on a priori knowledge of the U-Net structure. The results are shown in fig. 3 for different characteristics of the neural network portion with fixed quantization, where "Q-Part" represents the quantization portion.
From the results, it is clear that FaceNet is the least sensitive part of the overall network, since quantifying it to 3 bits hardly affects performance. LipNet shows the same tolerance at bit 4. However, when quantized to 3 bits, it exhibits serious degradation. With respect to U-Net, the original left and right parts do not show significant differences before quantization to 2 bits. Nevertheless, the inner part is more robust to quantization when partitioned according to the inside-outside scheme. This is considered a common phenomenon for U-Net structures. This is because UNet relies on skipped connections between symmetric outer layers to propagate low-level information. For the inner layer, no high precision calculations need to be performed, as the information stream is highly compressed. Another interesting observation is that for some parts, 2-bit quantization results are better than some higher precision quantization results. This is probably because in extreme quantization, high accuracy does not necessarily mean a small quantization error. Since uniform quantization is performed on the system, this phenomenon suggests that just the symbols (and zeros) that hold weights may be better than quantizing them to some discrete set.
To obtain better performance, 3 mixed precision fine tuning strategies are further applied to the above trained quantization system. The results are shown in fig. 4, which uses 3 metrics to evaluate the final performance: SDR, compression ratio and Bit Operation (BOP), wherein the "eq. Bits" column represents an equivalent fixed bit width setting of the current bit, where they are of approximately the same size. For manual strategies, we choose as high precision as possible for each part, while maintaining a model size equal to the fixed precision model. The selection is based on the valid SDR of the system listed in the table. 1, the trend is the same as the results in the table. Precisely, for a 6-bit equivalent manual selection, {6,4,6,8} is chosen for LipNet, faceNet, UNet Inner and UNet Outer, respectively. For a 4-bit selection it is {4,3,3,8}, and for a 3-bit selection it is {3,2,3,6}. It is evident from the figure that the three strategies give comparable results on a 6-bit equivalent setting. Meanwhile, the method is superior to 8-bit fixed precision setting in three indexes, and the effectiveness of mixed precision quantization is proved. On a 4-bit equivalent setting, the manual strategy is slightly better in SDR than the selection based on Hessian trace (the word is translated by Hessian trace) and the greedy search strategy based on KL divergence by about 1dB, and is also slightly better in compression ratio. This shows that the a priori knowledge of the partitioning network of the method corresponds well to the characteristics of each layer. However, the first two strategies fail to maintain acceptable SDR when a 3-bit equivalent setting is involved. Meanwhile, the greedy search based on the KL divergence, which is proposed by the method, can obtain a better result of 7.2dB in the SDR while maintaining the size of the competition model and the BOP. This is possible because the KL divergence is calculated directly from the output mask, and thus it is desirable to focus on optimizing the final accuracy rather than minimizing quantization errors.
For a detailed analysis of the quantization network, fig. 5 shows the combination of precision of layers in a 3-bit equivalent system based on KL divergence. The trend shown in the graph is very consistent with the observation as in fig. 3. UNet Inner follows FaceNet and is also designated as low precision. It can be observed that LipNet exhibits relatively severe degradation exceeding 4dB when quantized to 3 bits, while both are only affected by less than 2 dB. This explains why some layers in LipNet require high precision, while other layers can remain low. Finally, UNet Outer is the most sensitive part, designated as the highest precision. Based on the observations described above, designing a combination of accuracies by automatic search and a priori knowledge may be a promising strategy in more specific practical applications. By searching using some straightforward and automatic method (e.g., KL-divergence based), a near optimal combination can be found first. More suitable combinations may then be manually defined based on the search results, wherein a priori knowledge may be fused into them. By applying the above procedure, it may be desirable to extrude the last bit from the quantization network while maintaining as high performance as possible.
In general, the present approach is directed to reducing the size and computational effort of visual voice of an audio-visual speech separation system. Attempts have been made to train its quantized version using the ADMM-based quantized perceptual training method. To further optimize the trade-off between space, speed and size, three strategies are employed to generate a hybrid precision quantization network. Experimental results show that the manual selection strategy provides the best prediction accuracy under relatively relaxed constraints, which is comparable to or even better than the higher accuracy results. And the greedy search strategy based on KL divergence shows excellent performance under the condition of 3-bit extreme quantization, which is about 8dB higher than the other two strategies and about 13dB higher than the fixed-precision quantization result.
Fig. 6 is a schematic structural diagram of a training system for an audio-visual speech separation model according to an embodiment of the present invention, where the training system may execute the audio-visual speech separation model training method according to any of the above embodiments and is configured in a terminal.
The training system 10 for an audio-visual speech separation model provided in this embodiment includes: a predictive program module 11, an audiovisual feature determination program module 12 and a lightweight training program module 13.
The prediction program module 11 is configured to input mixed training audio of a plurality of speakers to an audio-visual speech separation model, and obtain prediction spectrograms of the plurality of speakers; the audiovisual feature determining program module 12 is configured to determine a predicted speaker audiovisual feature of the predicted spectrogram and a reference speaker audiovisual feature of a reference spectrogram of the mixed training audio; the lightweight training program module 13 is configured to perform training on the audio-visual speech separation model under a mixed precision quantization condition by using the cross-modal loss through a cross-direction multiplier method based on the cross-modal loss determined by the predicted speaker audio-visual feature and the reference speaker audio-visual feature, so as to obtain a lightweight audio-visual speech separation model.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the training method of the audio-visual voice separation model in any method embodiment;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
inputting mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a prediction spectrogram of the plurality of speakers;
determining predicted speaker audiovisual features of the predicted spectrogram and reference speaker audiovisual features of the reference spectrogram of the mixed training audio;
based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method to obtain a lightweight audio-visual voice separation model.
As a non-volatile computer readable storage medium, it may be used to store a non-volatile software program, a non-volatile computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the training method of the audio-visual speech separation model in any of the method embodiments described above.
Fig. 7 is a schematic hardware structure of an electronic device according to another embodiment of the present application, where the electronic device is a training method for an audio-visual speech separation model, and as shown in fig. 7, the device includes:
one or more processors 710, and a memory 720, one processor 710 being illustrated in fig. 7. The apparatus of the training method of the audio-visual speech separation model may further include: an input device 730 and an output device 740.
The memory 720 is used as a non-volatile computer readable storage medium, and can be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the training method of the audio-visual speech separation model in the embodiments of the present application. The processor 710 executes various functional applications of the server and data processing, i.e., implements the training method of the audiovisual speech separation model of the method embodiment described above, by running non-volatile software programs, instructions and modules stored in the memory 720.
The input device 730 may receive input numerical or character information. The output device 740 may include a display device such as a display screen.
The one or more modules are stored in the memory 720 that, when executed by the one or more processors 710, perform the training method of the audiovisual speech separation model in any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium may optionally include memory remotely located relative to the processor, which may be connected to the apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the invention also provides electronic equipment, which comprises: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the training method of the audio-visual speech separation model of any one of the embodiments of the invention.
The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:
(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.
(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc., such as tablet computers.
(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.
(4) Other electronic devices with data processing functions.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," comprising, "or" includes not only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. A method of training an audio-visual speech separation model, comprising:
inputting mixed training audio of a plurality of speakers to an audio-visual voice separation model to obtain a prediction spectrogram of the plurality of speakers;
determining predicted speaker audiovisual features of the predicted spectrogram and reference speaker audiovisual features of the reference spectrogram of the mixed training audio;
based on the predicted speaker audio-visual characteristics and the cross-modal loss determined by the reference speaker audio-visual characteristics, training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method to obtain a lightweight audio-visual voice separation model.
2. The method of claim 1, wherein the training of the audio-visual speech separation model with the cross-modal loss to mix-precision quantization conditions by cross-direction multiplier is a function of:
L ρ (W,G,λ)=f(W)+ρ/2||W-G-λ|| 2 -ρ/2||λ||
wherein W is a layer 1 to layer L parameter of the audio-visual speech separation model, g= { α i Q i } L i=1 And the alpha scale factor, Q is a quantized integer matrix, i is the ith layer of the audio-visual voice separation model, lambda is a Lagrangian multiplier, and rho is a hyper-parameter of the audio-visual voice separation model.
3. The method of claim 2, wherein the mixed precision quantization condition comprises: the first search space sensitivity of each layer of parameters in the audio-visual speech separation model is searched based on the Hessian trace.
4. The method of claim 2, wherein the mixed precision quantization condition comprises: and searching a second search space sensitivity of each layer of parameters in the audio-visual voice separation model based on the relative entropy.
5. The method of claim 2, wherein the mixed precision quantization condition comprises: the training performance sensitivity of each layer of parameters in the audio-visual voice separation model is selected based on a priori knowledge.
6. The method of any of claims 3-5, wherein the training of the audio-visual speech separation model with the cross-modal loss to mix-precision quantization conditions by cross-direction multiplier method comprises:
and training the audio-visual voice separation model by utilizing the cross-modal loss through a cross-direction multiplier method based on the mixed precision quantization condition determined by the first search space sensitivity, the second search space sensitivity and the training performance sensitivity.
7. A training system for an audio-visual speech separation model, comprising:
the prediction program module is used for inputting the mixed training audio of a plurality of speakers to the audio-visual voice separation model to obtain the prediction spectrograms of the plurality of speakers;
the audio-visual characteristic determining program module is used for determining the predicted speaker audio-visual characteristics of the predicted spectrogram and the reference speaker audio-visual characteristics of the reference spectrogram of the mixed training audio;
and the lightweight training program module is used for training the audio-visual voice separation model under the mixed precision quantization condition by utilizing the cross-modal loss through a cross direction multiplier method based on the cross-modal loss determined by the predicted speaker audio-visual characteristics and the reference speaker audio-visual characteristics to obtain a lightweight audio-visual voice separation model.
8. The system of claim 7, wherein the training of the audio-visual speech separation model with the cross-modal loss to mix-precision quantization conditions by cross-direction multiplier is a function of:
L ρ (W,G,λ)=f(W)+ρ/2||W-G-λ|| 2 -ρ/2||λ||
wherein W is a layer 1 to layer L parameter of the audio-visual speech separation model, g= { α i Q i } L i=1 And the alpha scale factor, Q is a quantized integer matrix, i is the ith layer of the audio-visual voice separation model, lambda is a Lagrangian multiplier, and rho is a hyper-parameter of the audio-visual voice separation model.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-6.
10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211573033.6A CN116312607A (en) | 2022-12-08 | 2022-12-08 | Training method for audio-visual voice separation model, electronic device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211573033.6A CN116312607A (en) | 2022-12-08 | 2022-12-08 | Training method for audio-visual voice separation model, electronic device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116312607A true CN116312607A (en) | 2023-06-23 |
Family
ID=86776788
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211573033.6A Pending CN116312607A (en) | 2022-12-08 | 2022-12-08 | Training method for audio-visual voice separation model, electronic device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116312607A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118016093A (en) * | 2024-02-26 | 2024-05-10 | 山东大学 | Target voice separation method and system based on cross-modal loss |
-
2022
- 2022-12-08 CN CN202211573033.6A patent/CN116312607A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118016093A (en) * | 2024-02-26 | 2024-05-10 | 山东大学 | Target voice separation method and system based on cross-modal loss |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977212B (en) | Reply content generation method of conversation robot and terminal equipment | |
Chen et al. | Wavlm: Large-scale self-supervised pre-training for full stack speech processing | |
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
CN110110337B (en) | Translation model training method, medium, device and computing equipment | |
US11195093B2 (en) | Apparatus and method for student-teacher transfer learning network using knowledge bridge | |
CN110598224B (en) | Training method of translation model, text processing method, device and storage medium | |
CN108960407B (en) | Recurrent neural network language model training method, device, equipment and medium | |
Alvarez et al. | On the efficient representation and execution of deep acoustic models | |
US20100138010A1 (en) | Automatic gathering strategy for unsupervised source separation algorithms | |
CN112733964A (en) | Convolutional neural network quantification method for reinforcement learning automatic perception weight distribution | |
CN111814448B (en) | Pre-training language model quantization method and device | |
CN112861521B (en) | Speech recognition result error correction method, electronic device and storage medium | |
KR20210141115A (en) | Method and apparatus for estimating utterance time | |
CN116312607A (en) | Training method for audio-visual voice separation model, electronic device and storage medium | |
Cord-Landwehr et al. | Frame-wise and overlap-robust speaker embeddings for meeting diarization | |
Kim et al. | WaveNODE: A continuous normalizing flow for speech synthesis | |
Fan et al. | Utterance-level permutation invariant training with discriminative learning for single channel speech separation | |
Wu et al. | Light-weight visualvoice: Neural network quantization on audio visual speech separation | |
CN114861907A (en) | Data calculation method, device, storage medium and equipment | |
US20240096332A1 (en) | Audio signal processing method, audio signal processing apparatus, computer device and storage medium | |
Liu et al. | Multi-head monotonic chunkwise attention for online speech recognition | |
CN116644797A (en) | Neural network model quantization compression method, electronic device and storage medium | |
CN117370890A (en) | Knowledge question-answering method, system, device and storage medium | |
Peter et al. | Resource-efficient dnns for keyword spotting using neural architecture search and quantization | |
CN113160801B (en) | Speech recognition method, device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |