WO2023216760A1 - 语音处理方法、装置、存储介质、计算机设备及程序产品 - Google Patents
语音处理方法、装置、存储介质、计算机设备及程序产品 Download PDFInfo
- Publication number
- WO2023216760A1 WO2023216760A1 PCT/CN2023/085321 CN2023085321W WO2023216760A1 WO 2023216760 A1 WO2023216760 A1 WO 2023216760A1 CN 2023085321 W CN2023085321 W CN 2023085321W WO 2023216760 A1 WO2023216760 A1 WO 2023216760A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- loss function
- training
- target
- noise
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 232
- 238000000034 method Methods 0.000 claims abstract description 67
- 230000006870 function Effects 0.000 claims description 245
- 238000012545 processing Methods 0.000 claims description 40
- 230000015654 memory Effects 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 description 29
- 238000005516 engineering process Methods 0.000 description 27
- 230000009467 reduction Effects 0.000 description 26
- 238000010586 diagram Methods 0.000 description 14
- 230000000694 effects Effects 0.000 description 10
- 238000013473 artificial intelligence Methods 0.000 description 9
- 238000001228 spectrum Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 238000007726 management method Methods 0.000 description 5
- 238000005070 sampling Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 230000003416 augmentation Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 230000001965 increasing effect Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Definitions
- the present application relates to the field of speech recognition technology, and more specifically, to a speech processing method, device, storage medium, computer equipment and program product.
- speech enhancement is speech noise reduction.
- the speech collected by the microphone is usually "contaminated” speech with different noises.
- the main purpose of speech enhancement is to extract information from these "polluted” noisy speech. Restore the clean speech we want, thereby effectively suppressing various interference signals and enhancing the target speech signal. This not only improves the speech quality, but also helps improve the performance of speech recognition.
- the application fields of speech enhancement include video conferencing and speech recognition. It is a preprocessing module for many speech coding and recognition systems. It can usually be divided into near-field speech enhancement and far-field speech enhancement.
- the existing speech enhancement uses a noise reduction and dereverberation solution based on a two-level network. However, the large amount of calculation of the two-level network makes speech enhancement impossible. Meet the performance requirements of practical applications.
- Embodiments of the present application provide a speech processing method, device, storage medium, computer equipment and program product, aiming to improve the performance of speech enhancement.
- the embodiment of the present application provides a speech processing method.
- the method includes: obtaining the initial speech features of the call speech; inputting the initial speech features into a pre-trained speech enhancement model to obtain the target speech features output by the speech enhancement model.
- the speech enhancement model is Obtained by step-by-step training based on the deep clustering loss function and the mask inference loss function; based on the target speech characteristics, the target speech with noise and reverberation removed is calculated.
- Embodiments of the present application also provide a speech processing device.
- the device includes: an acquisition module, used to obtain the initial speech features of the call speech; and an enhancement module, used to input the initial speech features into a pre-trained speech enhancement model to obtain speech enhancement.
- the target speech features output by the model.
- the speech enhancement model is trained step by step based on the deep clustering loss function and the mask inference loss function.
- the calculation model is used to calculate the target speech that removes noise and reverberation based on the characteristics of the target speech.
- An embodiment of the present application also provides a computer device.
- the computer device includes a processor and a memory.
- the memory stores computer program instructions. When the computer program instructions are called by the processor, the above speech processing method is executed.
- Embodiments of the present application also provide a computer-readable storage medium that stores program code, wherein the above-mentioned speech processing method is executed when the program code is run by a processor.
- Embodiments of the present application also provide a computer program product or computer program.
- the computer program product or computer program includes computer instructions, and the computer instructions are stored in a storage medium.
- the processor of the computer device reads the computer instructions from the storage medium, and the processor executes the computer instructions, so that the computer performs the steps in the above speech processing method.
- the embodiment of this application uses two different loss functions to perform model training on a preset speech enhancement model step by step, guiding the model to efficiently remove noise and reverberation in speech features, which can make noise reduction tasks and de-reverberation tasks easier. , can achieve the optimal training effect in a separate training process, which helps to improve the ability of the speech enhancement model to perform noise reduction and de-reverberation, and improve the performance of speech enhancement while reducing model computing resources.
- Figure 1 shows a schematic diagram of a common noise reduction and dereverberation method provided by an embodiment of the present application.
- Figure 2 shows a schematic architectural diagram of a speech processing system provided by an embodiment of the present application.
- Figure 3 shows a schematic flowchart of a speech processing method provided by an embodiment of the present application.
- Figure 4 shows a schematic diagram of an application scenario of a voice processing method provided by an embodiment of the present application.
- Figure 5 shows a schematic architectural diagram of a speech enhancement model provided by an embodiment of the present application.
- FIG. 6 shows a schematic flowchart of another speech processing method provided by an embodiment of the present application.
- Figure 7 shows a schematic flowchart of speech feature extraction provided by an embodiment of the present application.
- Figure 8 shows a schematic architectural diagram of a preset enhanced network provided by an embodiment of the present application.
- Figure 9 shows a module block diagram of a speech processing device provided by an embodiment of the present application.
- Figure 10 is a module block diagram of a computer device provided by an embodiment of the present application.
- Figure 11 is a module block diagram of a computer-readable storage medium provided by an embodiment of the present application.
- the client's near-end call is only suitable for single-person or short-distance calls with a small number of people, and the audio and video experience is average.
- the microphone array is divided into different subsets.
- Each subset passes through the first-level speech enhancement network to obtain the enhanced speech of each microphone.
- the enhanced speech is integrated together and then passes through the second-level speech enhancement network. Get the final output.
- this speech enhancement solution based on a two-level network requires a large amount of calculation during the training process, which is not suitable for the performance requirements of actual product applications. If the number of network parameters is reduced to reduce the amount of calculation, it will lead to network degradation. The effect becomes worse when performing speech enhancement.
- This method can obtain the initial speech features of the call speech and input the initial speech features into a pre-trained speech enhancement model to obtain speech enhancement.
- the target speech features output by the model.
- the speech enhancement model is obtained by step-by-step training based on the deep clustering loss function and the mask inference loss function, thereby fusing the two models (two-level networks) into the same model to reduce the model The computational cost of the training process.
- the target speech with noise and reverberation removed is calculated.
- the preset speech enhancement model is trained through different loss functions to guide the model to efficiently remove noise and reverberation from the initial speech features, which improves the performance of speech enhancement while reducing model computing resources.
- FIG. 2 shows a schematic architectural diagram of a speech processing system.
- the voice processing system 300 is applied in a remote video conferencing scenario.
- the voice processing system 300 may include a near-end client 310 , a far-end client 330 and a server 350 .
- the near-end client 310, the remote client 330 and the server 350 communicate through the network.
- the near-end client 310 and the remote client 330 can be large-screen terminals used for video.
- the server side 350 can be a cloud server.
- the remote client 330 can collect the initial speech with noise and reverberation from the participants, and transmit the initial speech to the server 350. After receiving the initial speech, the server 350 can use the pre-trained speech.
- the speech enhancement model performs noise reduction and de-reverberation on the initial speech to obtain enhanced clean speech (target speech), and transmits the clean speech to the near-end client 310.
- the speech enhancement model can also be configured on the near-end client 310 or the remote client 330 according to the needs of the actual application scenario.
- voice processing system 300 is only an example.
- the architecture and application scenarios of the voice processing system described in the embodiments of the present application are for the purpose of more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a requirement for the implementation of the present application.
- Those of ordinary skill in the art will know that with the evolution of speech processing system architecture and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
- Figure 3 shows a schematic flow chart of a speech processing method provided by an embodiment of the present application.
- the voice processing method is applied to the voice processing device 500 shown in Figure 9 and the computer device 600 ( Figure 10) configured with the voice processing device 500.
- the following will take computer equipment as an example to illustrate the specific process of the embodiment of the present application.
- the computer equipment applied in the embodiment of the present application can be a server or a terminal, etc.
- the server can be an independent physical server, or it can be It is a server cluster or distributed system composed of multiple physical servers. It can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, and CDN , blockchain, and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
- the terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smart watch, etc., but is not limited to this.
- FIG. 4 shows a schematic diagram of an application scenario of a voice processing method provided by an embodiment of the present application.
- voice processing The method can be applied to a specific speech enhancement system.
- the speech enhancement model 411 of the speech enhancement system can be deployed in the cloud server 410.
- the cloud server 410 can be located at the conference terminals of the two venues (the first conference terminal 430 and the second conference terminal).
- 450) Communication connection in which the first conference terminal 430 and the second conference terminal 450 can collect the voices of the participants at their respective venues (i.e., the original call voices), and upload the collected voices to the cloud server 410.
- the server 410 completes the voice enhancement of the voice to obtain clean voice.
- the cloud server 410 transmits the clean voice to the corresponding conference terminal for playback.
- the speech processing method may specifically include the following steps:
- Step S110 Obtain the initial voice characteristics of the call voice.
- the computer device can obtain the initial voice characteristics of the call voice that requires voice enhancement.
- the initial speech features are acoustic features obtained based on conversation speech conversion, such as logarithmic power spectrum (Logarithmic Power Spectrum, LPS) and Mel-Frequency Cepstral Coefficients (MFCC), etc., which will not be done here. limited.
- speech data often cannot be directly input into the model for training like image data, it does not have obvious feature changes in the long-term domain, so it is difficult to learn the characteristics of speech data.
- the time domain data of speech is usually composed of 16K sampling rate. , that is, 16,000 sampling points per second. Directly inputting time domain sampling points will lead to an excessive amount of training data and it will be difficult to train with practical effects. Therefore, in speech processing related tasks, speech data is usually converted into acoustic features as the input or output of the model.
- the call voice can be framed and windowed to obtain initial voice features.
- the call speech collected by all microphones is framed and windowed in sequence to obtain the speech signal frame of the call speech, and the speech signal frame is subjected to Fast Fourier Transformation (FFT) and the FFT is obtained.
- FFT Fast Fourier Transformation
- the discrete power spectrum is then calculated logarithmically on the obtained discrete power spectrum, and the logarithmic power spectrum is obtained as the initial speech feature.
- the call speech can be converted from a non-stationary time-varying signal in the time domain space into a stationary signal in the frequency domain space, which facilitates model training.
- the purpose of framing the speech signal is to divide several speech sampling points into one frame. Within this frame, the characteristics of the speech signal can be regarded as stable. Generally, the length of a frame should be short enough to ensure that the signal within the frame is stable, so the length of a frame should be less than the length of a phoneme, and the duration of a phoneme at normal speaking speed is about 50ms. In addition, to perform Fourier analysis, one frame must contain enough vibration periods. The male voice is around 100 Hz and the female voice is around 200 Hz. The converted periods are 10ms and 5ms. Therefore, the length of the general voice frame is 10 ⁇ 40ms.
- each frame will exhibit the characteristics of a periodic function.
- the window functions that can be used are: rectangular window, Hamming window, Hanning window, etc.
- the speech processing method provided by the embodiment of the present application can be used to perform speech enhancement processing on the speech of the participants, so as to remove noise and reverberation in the sound.
- the second conference terminal 450 collects the voice of the participant 420 in the conference venue through the microphone, that is, the call voice, and sends the call voice to the cloud server 410 through the network, and then the cloud server 410 receives the call voice. , perform frame processing, windowing processing and Fourier transform on the call speech to obtain the initial speech features.
- Step S120 Input the initial speech features into the pre-trained speech enhancement model to obtain the speech The target speech features output by the sound enhancement model.
- the call speech collected by the microphone array will contain both noise and reverberation.
- the two-level network used to denoise and de-reverberate the call speech due to the amount of parameters of the two networks during training is large, it requires a large amount of computing resources, and reducing the number of parameters of each network will also reduce the performance of the model in noise reduction and dereverberation.
- the two-level networks can be fused into the same network. Compared with the parameter amounts of the two networks, the number of parameters of the fused model will be reduced, which can greatly reduce the calculation amount of the training process and also improve the model's ability to perform speech enhancement. performance.
- the speech enhancement model can generate target speech features corresponding to the call speech based on the input initial speech features, that is, clean speech features with noise and reverberation removed after speech enhancement.
- Figure 5 shows a schematic architectural diagram of a speech enhancement model.
- the speech enhancement model may include multiple hidden layers, deep clustering layers, speech mask inference layers, and noise mask inference layers.
- the deep clustering layer, speech mask inference layer and noise mask inference layer can be linear layers, and the inputs of the three are uniformly from the output of the hidden layer.
- the hidden layer can calculate intermediate features based on the input initial speech features, which are the intermediate values of the speech enhancement process.
- the deep clustering layer can be implemented through normalization (Normalization) and tangent function (denoted as tanh).
- Normalization normalization
- tangent function denoted as tanh.
- the output of the hidden layer is first normalized to limit the output of the hidden layer to a certain range to facilitate subsequent Process, such as [0, 1] or [-1, 1], and then calculate the tangent function value for the normalized result as the output of the deep clustering layer.
- both the speech mask inference layer and the noise mask inference layer can be implemented through the softmax function.
- the speech mask inference layer can perform mask inference (MI) based on the intermediate features to obtain the target speech features that remove noise and reverberation.
- the noise mask inference layer can perform mask inference based on the intermediate features to obtain the noisy speech features.
- the deep clustering layer can assist the speech mask inference layer and the noise mask inference layer in noise reduction and dereverberation by performing deep clustering (DC) on the acquired intermediate features.
- DC deep clustering
- the hidden layer can be a long short-term memory network (Long Short-Term Memory, LSTM) or a variant, such as a bi-directional long-short term memory network (Bi-directional Long-Short Term Memory, Bi-LSTM), because the speech features have Short-term stationarity of time series, which is consistent with the long and short-term memory capabilities of LSTM.
- the hidden layer can also be other networks with memory properties, such as Gated Recurrent Unit (GRU).
- GRU Gated Recurrent Unit
- the model can be trained through the deep clustering loss function corresponding to the deep clustering layer, and the mask inference loss function corresponding to the speech mask inference layer and the noise mask inference layer.
- Step-by-step training For example, in the first step, the denoising model can be trained based on the deep clustering loss function and the mask inference loss function. When the denoising model converges, the training is stopped, where the mask inference loss function corresponding to the speech mask inference layer is Clean speech tags without noise and with reverb are used.
- train the dereverberation model Use the denoising model trained in the first step as the dereverberation model, and infer the loss function based on the deep clustering loss function and mask.
- the dereverberation model is trained several times. When the dereverberation model converges, the training is stopped.
- the mask inference loss function corresponding to the speech mask inference layer uses clean speech labels without noise and without reverberation. Therefore,
- the final dereverberation model that is, the speech enhancement model, has the ability to perform noise reduction and dereverberation at the same time.
- the deep clustering layer of the speech enhancement model is a binary loss based on time-frequency point clustering. Due to the regularization characteristics of the deep clustering loss, it is difficult to guide speech mask inference during the training process of related technologies.
- the layer and the noise mask inference layer effectively remove the noise and reverberation in the speech, which makes it difficult to effectively improve the performance of the model for speech enhancement.
- the step-by-step training scheme of the embodiment of the present application can achieve the optimal training effect for noise reduction tasks and de-reverberation tasks in separate training processes, thereby helping to improve the speech enhancement model for noise reduction and de-reverberation. The ability to make noise.
- the speech enhancement model obtained through the above training can obtain intermediate features through multi-layer LSTM.
- the speech mask inference layer can perform mask inference based on the intermediate features and calculate the speech mask, that is, the target speech feature.
- the initial voice features can be input to the voice enhancement model 411.
- the voice mask inference layer of the voice enhancement model 411 can be based on
- the intermediate features are used for mask inference to calculate the speech mask, that is, the target speech features.
- the intermediate features are obtained through multi-layer LSTM.
- the calculation amount of the speech enhancement process can be effectively reduced.
- Step S130 Calculate the target speech without noise and reverberation based on the characteristics of the target speech.
- inverse feature transformation can be performed on the acquired target speech features to calculate the target speech with noise and reverberation removed.
- IFT inverse Fourier Transform
- the cloud server 410 can convert the target speech features, that is, clean speech features, through inverse Fourier transform. into the target speech, thereby obtaining a clean speech with noise and reverberation removed.
- the cloud server 410 can send the clean speech to the first conference terminal 430, and the speaker of the first conference terminal 430 plays the speech of the participant 420 without noise and reverberation. Loud voice.
- the initial voice features of the call voice can be obtained, and the initial voice features can be input into a pre-trained voice enhancement model to obtain the target voice features output by the voice enhancement model.
- the voice enhancement model is based on a deep clustering loss function. It is obtained through step-by-step training with the mask inference loss function. According to the target speech characteristics, the target speech with noise and reverberation removed is calculated.
- the pre-set speech enhancement model is trained through different loss functions to guide the model to efficiently remove noise and reverberation from the initial speech features, thereby improving the performance of speech enhancement while reducing model computing resources.
- the speech processing device will be specifically integrated in a computer device as an example for description.
- Figure 6 shows another voice processing method provided by an embodiment of the present application.
- the video processing voice processing method is applied to the preset enhancement network shown in Figure 8.
- the process shown in Figure 5 will be described in detail below.
- Artificial Intelligence uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- Theory, method, technology and application system In other words, artificial intelligence is a comprehensive technology of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
- Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Artificial intelligence technology is a comprehensive subject that covers a wide range of fields, including both hardware-level technology and software-level technology.
- Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
- Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
- the solutions provided by the embodiments of this application involve artificial intelligence speech technology (Speech Technology) and other technologies.
- speech technology includes automatic speech recognition technology (Automatic Speech Recognition, ASR) and speech synthesis technology (Text To Speech, TTS) and speech technology.
- ASR Automatic Speech Recognition
- TTS Text To Speech
- VPR Voiceprint Recognition
- the speech processing method may specifically include the following steps:
- Step S210 The computer device obtains a training sample set.
- the speech processing method provided in the embodiment of the present application includes the training of a preset enhancement network. It is worth mentioning that the training of the preset enhancement network can be performed in advance based on the acquired training sample data set, and subsequent training is performed every time it is needed.
- the trained speech enhancement model can be used to calculate the target speech features that remove noise and reverberation, without having to train the preset enhancement network again every time speech enhancement is performed.
- the wsj0-2mix (Wall Street Journal) data set can be used to determine the training sample set.
- the wsj0-2mix data set contains a 30-hour speech training set and a 10-hour speech training set.
- the speech of different speakers is randomly selected from the corresponding set and mixed with a random relative signal-to-noise ratio (Signal to Noise Ratio, SNR) between 0dB and 10dB to generate noisy and mixed speech signals used for network training. Loud voice.
- SNR Signal to Noise Ratio
- the step of obtaining a training sample set by the computer device may include:
- the computer device acquires the first sample speech.
- the computer device extracts speech features from the first sample speech to obtain noise speech features.
- the computer device acquires the second sample speech.
- the computer device performs speech feature extraction on the second sample speech to obtain the first clean speech label and the second clean speech label.
- the computer device determines the deep cluster annotation based on the first sample speech and the second sample speech.
- the first sample speech is speech containing noise and reverberation collected based on the microphone.
- the second sample speech is a clean speech without noise and reverberation, and a clean speech without noise and no reverberation.
- Deep clustering annotation is the ratio of the features of the first sample speech and the second sample speech at each time-frequency point.
- the computer device can directly collect the call speech containing noise and reverberation through the microphone.
- the speech of the participants collected through the microphone of the large-screen conference terminal are used as the first sample speech.
- technicians can directly obtain the first sample speech from the already constructed noise reduction training corpus.
- the computer device can perform voice feature extraction on the acquired first sample voice.
- Figure 7 shows a schematic flow chart of voice feature extraction.
- the microphone collects the call voice containing noise and reverberation, that is, the first a sample voice Multi-frame speech signals are obtained through frame processing and windowing processing respectively. 0 ⁇ i ⁇ n&i ⁇ N * , n is the total number of frames, t represents the time domain space, N * represents a set of positive integers, and then the computer device can perform FFT on each frame of speech signal, and convert each frame of speech signal from the time domain Convert the space to the frequency domain space to obtain the corresponding discrete power spectrum, and calculate the logarithm of the obtained discrete power spectrum to obtain the logarithmic power spectrum.
- the noise speech label can be marked according to the noise speech feature, that is, in are the FFT transformation results of the speech signal from frame 1 to frame n respectively.
- Computer equipment can also obtain clean speech as a reference from the noise reduction training corpus, and use the clean speech as the second sample speech.
- clean speech without noise and reverberation can be obtained and clean speech without noise and reverberation, and then perform speech feature extraction on the clean speech without noise and reverberation, and obtain the first clean speech label, and perform speech feature extraction on the clean speech without noise and reverberation. , get the second clean voice label.
- the noisy speech label The first clean voice tag and the second clean voice tag
- the mathematical expression of is a feature vector (Embedding), also called an embedding vector, where the length of the feature vector is the dimension of the feature.
- the computer device can determine the deep clustering annotation by comparing the speech energy of the first sample speech and the second sample speech at each time-frequency point. Since the speech signal changes with time, its energy also changes with time. Therefore, when calculating the energy of the digitized speech signal, it does not calculate the overall energy, but calculates the energy at each time-frequency point by frame. .
- the computer device can convert speech without noise and reverberation into noise.
- the energy ratio of the acoustic speech is used as the deep clustering annotation.
- the energy ratio of the speech without noise and without reverberation and the noise speech can also be used as the deep clustering annotation.
- the deep clustering annotation is used for the calculation of the deep clustering loss function. .
- Step S220 The computer device obtains a preset enhanced network.
- the preset enhanced network includes hidden layers, deep clustering (Deep Clustering) layers and mask inference layers.
- the default enhancement network is a network with bottom-layer weight sharing and multi-head output.
- the deep clustering layer can assist the speech mask inference layer and the noise mask inference layer to perform mask inference, so that the speech mask inference layer and noise mask inference layer can
- the code inference layer can effectively distinguish noise and reverberation in speech during the network training process.
- the hidden layer can use LSTM or Bi-LSTM.
- the hidden layer shown in Figure 8 is LSTM.
- the mask inference layer includes a speech mask layer. (Clean-MI) and noise mask layer (Noise-MI).
- the speech mask inference layer can calculate the mask of speech, that is, the clean speech label, and the noise mask inference layer can calculate the mask of noise and reverberation, that is, the noisy speech label. It should be noted that during the application process, only the mask output by the speech mask inference layer is used to restore the speech. Therefore, the calculation amount of the speech enhancement process is not increased, thereby improving the speech enhancement efficiency.
- Step S230 The computer device performs noise removal training and reverberation removal training step by step on the preset enhancement network through the training sample set until the preset enhancement network meets the preset conditions, and obtains the trained target enhancement network as a speech enhancement model.
- the target enhancement network obtained after completing the training that is, the speech enhancement model needs to perform two enhancement tasks of noise reduction and dereverberation at the same time. If these two enhancement tasks are trained at the same time, the training of the preset enhancement network cannot be optimal. training effect. For this purpose, step-by-step training can be adopted, and the training process of the two tasks can be carried out separately.
- embodiments of the present application provide two step-by-step training methods. For example, noise removal training can be performed first, and then reverberation removal training can be performed, or reverberation removal training can be performed first, and then noise removal training can be performed.
- the purpose of noise removal training is to equip the network with the ability to reduce noise
- the purpose of reverberation removal training is to equip the network with the ability to dereverberate, so that both enhancement tasks can achieve optimal results in separate training processes. Training effect, thereby improving the performance of speech enhancement model for speech enhancement.
- the computer device performs noise removal training and reverberation removal training on the preset enhancement network step by step through the training sample set until the preset enhancement network meets the preset conditions.
- the steps may include:
- the computer equipment inputs the noise speech features into the hidden layer, and generates intermediate training through the hidden layer. Practice characteristics.
- the computer device inputs the intermediate training features into the deep clustering layer, and generates cluster training annotations through the deep clustering layer.
- the computer device inputs the intermediate training features into the speech mask inference layer, and generates clean speech training features through the speech mask inference layer.
- the computer device inputs the intermediate training features into the noise mask inference layer, and generates the noise speech training features through the noise mask inference layer.
- the computer equipment constructs a target loss function based on clean speech labels, noisy speech labels, deep cluster annotations, clean speech training features, noisy speech training features, and cluster training annotations, and stages the preset enhancement network according to the target loss function Noise removal training and reverberation removal training are performed until the preset enhancement network meets the preset conditions.
- the intermediate training features are the intermediate values generated by the hidden layer of the preset enhancement network, which can be input as a shared value to the deep clustering layer, speech mask inference layer and noise mask inference layer respectively, so as to achieve the underlying weight sharing to reduce The number of parameters for a small network.
- the speech mask inference layer and the noise mask inference layer can respectively generate clean speech training features y clean and noisy speech training features y noise based on the intermediate training features.
- the deep clustering layer can generate cluster training annotations y dc based on the intermediate training features.
- the computer device constructs a target loss function based on clean speech labels, noisy speech labels, deep cluster annotations, clean speech training features, noisy speech training features, and cluster training annotations, and calculates the preset value based on the target loss function.
- the enhancement network performs noise removal training and reverberation removal training step by step until the preset enhancement network meets the preset conditions. The steps may include:
- the computer device determines the first loss function based on the cluster training annotation and the deep cluster annotation.
- the first loss function is the deep clustering loss function.
- the first loss function y dc is the clustering training label, Labeling for deep clustering.
- the computer device determines the second loss function based on the clean speech training features and clean speech labels.
- two different second loss functions can be determined based on different clean speech labels.
- the computer device may train the feature y clean based on the clean speech and the first clean speech label Determine the noise removal loss function and use the noise removal loss function as the second loss function
- the computer device may train the feature y clean based on the clean speech and the second clean speech label Determine the noise removal loss function and use the noise removal loss function as the second loss function
- the computer device determines the third loss function based on the noise speech training characteristics and the noise speech label.
- y noise is the noise speech training feature, Label the noisy speech.
- the second loss function Loss clean and the third loss function That is the mask inference loss function are the second loss function Loss clean and the third loss function That is the mask inference loss function.
- the computer equipment constructs the target loss function of the preset enhancement network based on the first loss function, the second loss function and the third loss function, and performs step-by-step noise removal training and reverberation on the preset enhancement network based on the target loss function. Training is removed until the preset enhanced network meets the preset conditions.
- the computer device can construct the target loss function Loss of the preset enhancement network based on the first loss function Loss dc , the second loss function Loss clean and the third loss function Loss noise .
- the above three loss functions can be corresponding to
- the weight parameter is a weighted sum of the above three loss functions, as follows:
- ⁇ , ⁇ and ⁇ are weight parameters.
- the target loss function Loss performs noise removal training and reverberation removal training on the preset enhancement network step by step until the preset enhancement network meets the preset conditions.
- the preset enhancement network can be trained based on multi-task learning (Multi-Task Learning). Assume that the enhancement network is trained, so that the deep clustering loss function and the mask inference loss function are combined to simultaneously learn the two enhancement tasks of noise reduction and dereverberation. The two tasks can share parameters during the learning process. The information they learned enables the trained target enhancement network to achieve better generalization effects.
- noise refers to "unwanted sounds" in certain situations, such as human noise and various sudden sounds.
- Reverberation refers to the phenomenon of sound continuation that still exists after the indoor sound source stops emitting sound.
- the noise in the sound collected by the conference terminal is mainly removed, and in a professional recording venue, the reverberation in the sound collected by the recording equipment is mainly removed, as Therefore, different methods of step-by-step training can be performed according to the actual scenario used by the final speech enhancement model.
- the application scenario attributes can be obtained based on the actual scenario used by the final speech enhancement model, and the corresponding distribution training strategy can be determined based on the application scenario attributes.
- the target loss function of the preset enhancement network is constructed, and based on the target loss function, the preset enhancement network is step-by-step noise removal training and mixing The training is carried out until the preset enhanced network meets the preset conditions.
- the application scenario attributes are used to characterize the actual scenarios in which the speech enhancement model is applied, for example, focusing on noise reduction scene attributes or focusing on dereverberation scene attributes.
- the distribution training strategy includes a first distribution training strategy and a second distribution training strategy.
- the first distribution training strategy is used to focus on noise reduction scenarios, first perform noise removal training, and then perform reverberation removal training.
- the second distribution training strategy is used to focus on dereverberation scenarios, first perform reverberation removal training, and then perform noise removal training.
- the conference terminal collects not only the voice of the speaker, but also the voices of other speakers. It is necessary to perform noise reduction processing on the voice collected by the conference terminal.
- noise removal training can be performed first, and then mixing can be performed. Ring removal training.
- the computer device can determine a target loss function of the preset enhancement network based on the first distribution training strategy, the first loss function, the second loss function and the third loss function, where the second loss function is determined by the noise removal loss function, and then The preset enhancement network is iteratively trained for noise removal according to the target loss function until the preset enhancement network meets the preset conditions, and a noise removal network is obtained, which only plays a role in noise reduction.
- the computer device can determine the target loss function of the noise removal network based on the first loss function, the second loss function, and the third loss function.
- the second loss function is determined by the reverberation removal loss function, and then based on The target loss function iteratively performs reverberation removal training on the noise removal network until the noise removal network meets the preset conditions. In this way, separate noise removal training first can avoid the training process being interfered by reverberation factors, so that the generated target enhancement network has better noise reduction performance.
- reverberation removal training can be performed first. Then perform noise removal training.
- the computer device may determine a target loss function of the preset enhancement network based on the first loss function, the second loss function and the third loss function based on the second distribution training strategy, where the second loss function is determined by the reverberation removal loss function, Then, the preset enhancement network is iteratively trained for reverberation based on the target loss function until the preset enhancement network meets the preset conditions, and a reverberation removal network is obtained.
- the reverberation removal network only plays a role in dereverberation.
- the computer device can determine the target loss function of the reverberation removal network based on the first loss function, the second loss function, and the third loss function.
- the second loss function is determined by the noise removal loss function, and then based on The target loss function iteratively performs noise removal training on the reverberation removal network until the reverberation removal network meets the preset conditions. In this way, performing separate reverberation removal training first can avoid the training process being interfered by noise factors, so that the generated target enhancement network has better dereverberation performance.
- the preset enhancement network can be trained for noise removal first, and then Reverberation removal training, so as to learn the ability to remove reverberation based on an excellent noise reduction network. In this way, the optimal training effect can be achieved in both training processes, thereby improving the performance of the speech enhancement model for speech enhancement. .
- the preset conditions may be: the total loss value of the target loss function is less than the preset value, the total loss value of the target loss function no longer changes, or the number of training times reaches the preset number, etc.
- an optimizer can be used to optimize the target loss function, and the learning rate, batch size during training, and training period (epoch) can be set based on experimental experience.
- each cycle includes multiple iterative trainings, and the parameters of the network to be trained are continuously optimized, the total loss value above will become smaller and smaller. , and finally becomes smaller than a fixed value, or less than the above preset value.
- the network to be trained has converged; of course, it can also be determined after the number of training times reaches the preset number, and the preset enhancement network/noise removal network/ The reverberation removal network has converged.
- the mask inference loss is only used during the validation process of the target augmentation network, that is, the speech enhancement model selection.
- the output of the mask inference branch is used as the mask after speech enhancement, that is, the target speech feature.
- Step S240 The computer device obtains the initial voice characteristics of the call voice.
- Step S250 The computer device inputs the initial speech features into the hidden layer, and generates intermediate features through the hidden layer.
- Step S260 The computer device inputs the intermediate features into the speech mask inference layer, generates clean speech features through the speech mask inference layer, and uses the clean speech features as target speech features.
- the computer device can perform voice feature extraction on the call voice, including frame processing, windowing processing and Fourier transform on the call voice to obtain the initial voice features.
- the computer device The initial speech features can be input into the hidden layer of the speech enhancement network, and the intermediate features are generated through the hidden layer.
- the computer device can input the intermediate features into the speech mask inference layer, generate the clean speech features through the speech mask inference layer, and convert the clean speech features into as target speech features.
- Step S270 The computer device performs feature inverse transformation on the target speech features, and calculates the target speech with noise and reverberation removed.
- the computer device can perform feature inverse transformation on the target speech features, converting the target speech features (mask) in the frequency domain space into the target speech in the time domain space.
- the inverse feature transform may be an inverse Fourier transform.
- a training sample set and a preset enhancement network can be obtained, and the preset enhancement network can be step-by-step noise removal training and reverberation removal training through the training sample set until the preset enhancement network meets the preset conditions.
- the trained target enhancement network is obtained as a speech enhancement model.
- the initial speech features are input into the hidden layer, intermediate features are generated through the hidden layer, and the intermediate features are input into the speech mask inference layer.
- Clean speech features are generated through the speech mask inference layer, and The clean speech features are used as target speech features, and then the target speech features are inversely transformed to calculate the target speech with noise and reverberation removed. Therefore, it is only necessary to use the target speech features output by the speech mask inference layer of the speech enhancement model to restore the speech, avoiding increasing the calculation amount of the speech enhancement process, thereby improving the efficiency of speech enhancement.
- FIG. 9 shows a structural block diagram of a speech processing device 500 provided by an embodiment of the present application.
- the speech processing device 500 includes: an acquisition module 510, configured to obtain the initial speech features of the call speech; an enhancement module 520, configured to input the initial speech features into a pre-trained speech enhancement model to obtain the target speech features output by the speech enhancement model, Speech enhancement model based Obtained by step-by-step training of the deep clustering loss function and the mask inference loss function; the calculation model 530 is configured to calculate the target speech with noise and reverberation removed according to the target speech characteristics.
- the speech processing device 500 may also include: a sample acquisition module, a network acquisition module, and a model training module.
- the sample acquisition module is configured to obtain a training sample set, which includes noisy speech features, clean speech labels, noisy speech labels, and deep clustering annotations;
- the network acquisition module is configured to obtain a preset enhancement network, and the preset enhancement network includes hidden layer, deep clustering layer and mask inference layer;
- the network training module is configured to perform noise removal training and reverberation removal training on the preset enhancement network step by step through the training sample set, until the preset enhancement network meets the preset conditions, and we get
- the trained target enhancement network is used as a speech enhancement model.
- the mask inference layer includes a speech mask inference layer and a noise mask inference layer
- the network training module may include: a hidden unit configured to input noise speech features into the hidden layer and generate intermediate training features through the hidden layer;
- the deep clustering unit is configured to input the intermediate training features into the deep clustering layer, and generate cluster training annotations through the deep clustering layer;
- the speech inference unit is configured to input the intermediate training features into the speech mask inference layer, and infer through the speech mask The layer generates clean speech training features;
- the noise inference unit is configured to input the intermediate training features into the noise mask inference layer, and generate the noisy speech training features through the noise mask inference layer;
- the network training unit is configured to input the clean speech label, the noisy speech label based on the noise mask inference layer , deep cluster annotation, clean speech training features, noisy speech training features and cluster training annotations to construct a target loss function, and perform noise removal training and reverberation removal training on the preset enhancement network step by step according to the target loss function until the preset
- the network training unit includes: a first subunit configured to determine the first loss function based on cluster training annotations and deep clustering annotations; a second subunit configured to determine the first loss function based on clean voice training features and clean voice label to determine the second loss function; the third subunit is configured to determine the third loss function based on the noise speech training features and the noise speech label; the training subunit is configured to determine the third loss function based on the first loss function, the second loss function and the third Loss function, construct the target loss function of the preset enhancement network, and perform noise removal training and reverberation removal training on the preset enhancement network step by step according to the target loss function until the preset enhancement network meets the preset conditions.
- the second subunit may be specifically configured to: determine the noise removal loss function based on the clean speech training features and the first clean speech label; use the noise removal loss function as the second loss function, and the first clean speech label is Speech tags obtained from speech without noise and reverberation.
- the second subunit can also be specifically configured to: determine the reverberation removal loss function based on the clean speech training features and the second clean speech label; use the reverberation removal loss function as the second loss function, and the second clean speech label.
- Voice tags are voice tags obtained based on speech without noise and without reverberation.
- the training subunit may be specifically configured to: determine the target loss function of the preset enhancement network based on the first loss function, the second loss function, and the third loss function, and modify the preset enhancement network based on the target loss function. Iteratively perform noise removal training until preset enhancement The network meets the preset conditions and obtains the noise removal network, in which the second loss function is determined by the noise removal loss function; according to the first loss function, the reverberation removal loss function and the third loss function, the target loss function of the noise removal network is determined , and perform reverberation removal training on the noise removal network iteratively according to the target loss function until the noise removal network meets the preset conditions, where the second loss function is determined by the reverberation removal loss function.
- the training subunit may be specifically configured to: determine the target loss function of the preset enhancement network based on the first loss function, the second loss function, and the third loss function, and modify the preset enhancement network based on the target loss function.
- the reverberation removal training is iteratively performed until the preset enhancement network meets the preset conditions, and the reverberation removal network is obtained, in which the second loss function is determined by the reverberation removal loss function; according to the first loss function, the second loss function and the third Three loss functions, determine the target loss function of the reverberation removal network, and iteratively perform noise removal training on the reverberation removal network according to the target loss function until the reverberation removal network meets the preset conditions, where the second loss function is the noise removal loss The function is determined.
- the sample acquisition module may be specifically configured to: acquire a first sample voice, which is a noisy voice collected based on a microphone; perform voice feature extraction on the first sample voice to obtain a noisy voice Features; Obtain the second sample speech, which includes clean speech without noise and reverberation and clean speech without noise and reverberation; perform speech feature extraction on the second sample speech to obtain the first clean speech label and the second clean speech label; determine the deep cluster labeling based on the first sample speech and the second sample speech.
- the speech enhancement model includes a hidden layer, a deep clustering layer, a speech mask inference layer and a noise mask inference layer.
- the enhancement module 520 can be specifically configured to: input the initial speech features into the hidden layer, and generate Intermediate features; input the intermediate features into the speech mask inference layer, generate clean speech features through the speech mask inference layer, and use the clean speech features as the target speech features;
- the calculation model 530 may be specifically configured to perform feature inverse transformation on the target speech features, and calculate the target speech with noise and reverberation removed.
- the coupling between modules may be electrical, mechanical or other forms of coupling.
- each functional module in each embodiment of the present application can be integrated into one processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.
- the above integrated modules can be implemented in the form of hardware or software function modules.
- the solution provided by this application can obtain the initial speech features of the call speech, input the initial speech features into the pre-trained speech enhancement model, and obtain the target speech features output by the speech enhancement model.
- the speech enhancement model is based on the deep clustering loss function and the mask inference loss function Obtained from rows of step-by-step training, the target speech with noise and reverberation removed is calculated based on the characteristics of the target speech.
- the pre-set speech enhancement model is trained through different loss functions to guide the model to efficiently remove noise and reverberation in speech, which improves the performance of speech enhancement while reducing model computing resources.
- the embodiment of the present application also provides a computer device 600.
- the computer device 600 includes a processor 610, a memory 620, a power supply 630, and an input unit 640.
- the memory 620 stores computer program instructions, and the computer program instructions are processed.
- the processor 610 is called, various method steps provided by the above embodiments may be executed.
- the structure of the computer equipment shown in the figures does not constitute a limitation on the computer equipment, and may include more or fewer components than shown in the figures, or combine certain components, or arrange different components. in:
- Processor 610 may include one or more processing cores.
- the processor 610 uses various interfaces and lines to connect various parts of the entire battery management system, runs or executes instructions, programs, code sets or instruction sets stored in the memory 620, calls data stored in the memory 620, and executes Various functions and processing data of the battery management system, as well as performing various functions and processing data of the computer device, thereby overall controlling the computer device.
- the processor 610 may adopt a digital signal processing (Digital Signal Processing, DSP), a field-programmable gate array (Field-Programmable Gate Array, FPGA), or a programmable logic array (Programmable Logic Array, PLA). At least one form of hardware implementation.
- the processor 610 may integrate one or a combination of a central processing unit 610 (Central Processing Unit, CPU), an image processor 610 (Graphics Processing Unit, GPU), a modem, etc.
- CPU Central Processing Unit
- image processor 610 Graphics Processing Unit, GPU
- modem mainly handles the operating system, user interface, and applications
- the GPU is responsible for rendering and drawing the display content
- the modem is used to handle wireless communications. It can be understood that the above-mentioned modem may not be integrated into the processor 610 and may be implemented solely through a communication chip.
- the memory 620 may include a random access memory 620 (Random Access Memory, RAM) or a read-only memory 620 (Read-Only Memory).
- the memory 620 map may be used to store instructions, programs, codes, sets of codes, or sets of instructions.
- the memory 620 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing an operating system and instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.) , instructions for implementing various method embodiments described below, etc.
- the storage data area can also store data created during use of the computer device (such as phone books and audio and video data). Accordingly, the memory 620 may also include a memory controller to provide the processor 610 with access to the memory 620 .
- the power supply 630 can be logically connected to the processor 610 through a power management system, thereby implementing functions such as charging, discharging, and power consumption management through the power management system.
- Power supply 630 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
- Input unit 640 which can be used to receive input numeric or character information, and generate keyboard, mouse, joystick, optical or track information related to user settings and function control. Trackball signal input.
- the computer device 600 may also include a display unit and the like, which will not be described again here.
- the processor 610 in the computer device will load the executable files corresponding to the processes of one or more application programs into the memory 620 according to the following instructions, and the processor 610 will run the stored program.
- the application program in the memory 620 thereby implements various method steps provided by the foregoing embodiments.
- the embodiment of the present application also provides a computer-readable storage medium 700.
- the computer-readable storage medium 700 stores computer program instructions 710.
- the computer program instructions 710 can be called by the processor to execute the above embodiment. the method described in .
- the computer-readable storage medium may be electronic memory such as flash memory, electrically erasable programmable read-only memory (EEPROM), EPROM, hard disk, or ROM.
- the computer-readable storage medium includes non-transitory computer-readable storage medium (Non-Transitory Computer-Readable Storage Medium).
- the computer-readable storage medium 700 has storage space for program codes that perform any method steps in the above methods. These program codes can be read from or written into one or more computer program products. The program code may, for example, be compressed in a suitable form.
- a computer program product or computer program includes computer instructions stored in a computer-readable storage medium.
- the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various optional implementations provided by the above embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
Claims (17)
- 一种语音处理方法,所述方法包括:获取通话语音的初始语音特征;将所述初始语音特征输入至语音增强模型,得到所述语音增强模型输出的目标语音特征,其中,所述语音增强模型为基于深度聚类损失函数和掩码推断损失函数进行分步训练得到的;根据所述目标语音特征,计算出去除噪声和混响的目标语音。
- 根据权利要求1所述的方法,其中,所述方法还包括:通过如下方式预训练所述语音增强模型:获取训练样本集合,其中,所述训练样本集合包括噪声语音特征、干净语音标签、噪声语音标签以及深度聚类标注;获取预设增强网络;通过所述训练样本集合对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,将得到训练后的目标增强网络作为所述语音增强模型。
- 根据权利要求2所述的方法,其中,所述预设增强网络包括隐藏层、深度聚类层以及掩码推断层,所述掩码推断层包括语音掩码推断层以及噪声掩码推断层,所述通过所述训练样本集合对预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:将所述噪声语音特征输入所述隐藏层,通过所述隐藏层生成中间训练特征;将所述中间训练特征输入所述深度聚类层,通过所述深度聚类层生成聚类训练标注;将所述中间训练特征输入所述语音掩码推断层,通过所述语音掩码推断层生成干净语音训练特征;将所述中间训练特征输入所述噪声掩码推断层,通过所述噪声掩码推断层生成噪声语音训练特征;根据所述干净语音标签、所述噪声语音标签、所述深度聚类标注、所述干净语音训练特征、所述噪声语音训练特征以及所述聚类训练标注构建目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件。
- 根据权利要求3所述的方法,其中,所述根据所述干净语音标签、所述噪声语音标签、所述深度聚类标注、所述干净语音训练特征、所述噪声语音训练特征以及所述聚类训练标注构建目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:根据所述聚类训练标注和所述深度聚类标注,确定第一损失函数;根据所述干净语音训练特征和所述干净语音标签,确定第二损失函数;根据所述噪声语音训练特征和所述噪声语音标签,确定第三损失函数;根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件。
- 根据权利要求4所述的方法,其中,所述根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,包括:基于所述第一损失函数,所述第二损失函数和所述第三损失函数分别对应的权重参数,对所述第一损失函数,所述第二损失函数和所述第三损失函数进行加权求和,得到所述预设增强网络的目标损失函数。
- 根据权利要求4所述的方法,其中,所述干净语音标签包括第一干净语音标签,所述根据所述干净语音训练特征和所述干净语音标签,确定第二损失函数,包括:根据所述干净语音训练特征和所述第一干净语音标签,确定噪声去除损失函数;将所述噪声去除损失函数作为第二损失函数,所述第一干净语音标签为基于不带噪声带混响的语音获取的语音标签。
- 根据权利要求4所述的方法,其中,所述干净语音标签包括第二干净语音标签,所述根据所述干净语音训练特征和所述干净语音标签,确定第二损失函数,包括:根据所述干净语音训练特征和所述第二干净语音标签,确定混响去除损失函数;将所述混响去除损失函数作为第二损失函数,所述第二干净语音标签为基于不带噪声不带混响的语音获取的语音标签。
- 根据权利要求5、6或7所述的方法,其中,所述根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:获取应用场景属性;根据所述应用场景属性确定对应的分布训练策略;基于所述分布训练策略,根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件。
- 根据权利要求8所述的方法,其中,所述分布训练策略包括第一分布训练策略,所述基于所述分布训练策略,根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:在所述分布训练策略为第一分布训练策略时,根据所述第一损失函数、所述第二损失函数以及所述第三损失函数,确定所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络迭代进行噪声去除训练,直至所述预设增强网络满足预设条件,得到噪声去除网络,其中,所述第二损失函数为噪声去除损失函数确定的;根据所述第一损失函数、所述第二损失函数以及所述第三损失函数,确定所述噪声去除网络的目标损失函数,并根据所述目标损失函数对所述噪声去除网络迭代进行混响去除训练,直至所述噪声去除网络满足预设条件,其中,所述第二损失函数为混响去除损失函数确定的。
- 根据权利要求8所述的方法,其中,所述分布训练策略包括第二分布训练策略,所述基于所述分布训练策略,根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:在所述分布训练策略为第二分布训练策略时,根据所述第一损失函数、所述第二损失函数以及所述第三损失函数,确定所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络迭代进行混响去除训练,直至所述预设增强网络满足预设条件,得到混响去除网络,其中,所述第二损失函数为混响去除损失函数确定的;根据所述第一损失函数、所述第二损失函数以及所述第三损失函数,确定所述混响去除网络的目标损失函数,并根据所述目标损失函数对所述混响去除网络迭代进行噪声去除训练,直至所述混响去除网络满足预设条件,其中,所述第二损失函数为噪声去除损失函数确定的。
- 根据权利要求2所述的方法,其中,所述获取训练样本集合,包括:获取第一样本语音,所述第一样本语音为基于麦克风采集的含有噪声和混响的语音;对所述第一样本语音进行语音特征提取,得到噪声语音特征;获取第二样本语音,所述第二样本语音包括不带噪声带混响的干净语音以及不带噪声不带混响的干净语音;对所述第二样本语音进行语音特征提取,得到第一干净语音标签以及第二干净语音标签;根据所述第一样本语音以及所述第二样本语音,确定深度聚类标注。
- 根据权利要求3至11任一项所述的方法,其中,所述预设条件包括以下之一:所述目标损失函数的总损失值小于预设值,所述目标损失函数的总损失值不再变化,训练次数达到预设次数。
- 根据权利要求1所述的方法,其中,所述语音增强模型包括隐藏层、深度聚类层、语音掩码推断层以及噪声掩码推断层,所述将所述初始语音特征输入至预先训练的语音增强模型,得到所述语音增强模型输出的目标语音特征,包括:将所述初始语音特征输入所述隐藏层,通过所述隐藏层生成中间特征;将所述中间特征输入所述语音掩码推断层,通过所述语音掩码推断层生成干净语音特征,并将所述干净语音特征作为目标语音特征;所述根据所述目标语音特征,计算出去除噪声和混响的目标语音,包括:对所述目标语音特征进行特征逆变换,计算出去除噪声和混响的目标语音。
- 一种语音处理装置,所述装置包括:获取模块,配置为获取通话语音的初始语音特征;增强模块,配置为将所述初始语音特征输入至预先训练的语音增强模型,得到所述语音增强模型输出的目标语音特征,所述语音增强模型为基于深度聚类损失函数和掩码推断损失函数进行的分步训练得到;计算模型,配置为根据所述目标语音特征,计算出去除噪声和混响的目标语音。
- 一种计算机可读存储介质,所述计算机可读存储介质中存储有程序代码,所述程序代码可被处理器调用执行如权利要求1~13任一项所述的方法。
- 一种计算机设备,包括:存储器;一个或多个处理器,与所述存储器耦接;一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个应用程序配置用于执行如权利要求1~13任一项所述的方法。
- 一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在存储介质中,计算机设备的处理器从存储介质读取所述计算机指令,处理器执行所述计算机指令,使得所述计算机执行如权利要求1~13任一项所述的方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP23802536.5A EP4404186A1 (en) | 2022-05-07 | 2023-03-31 | Speech processing method and apparatus, and storage medium, computer device and program product |
US18/658,964 US20240290338A1 (en) | 2022-05-07 | 2024-05-08 | Speech processing |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210495197.5 | 2022-05-07 | ||
CN202210495197.5A CN117059068A (zh) | 2022-05-07 | 2022-05-07 | 语音处理方法、装置、存储介质及计算机设备 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/658,964 Continuation US20240290338A1 (en) | 2022-05-07 | 2024-05-08 | Speech processing |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023216760A1 true WO2023216760A1 (zh) | 2023-11-16 |
Family
ID=88667966
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/085321 WO2023216760A1 (zh) | 2022-05-07 | 2023-03-31 | 语音处理方法、装置、存储介质、计算机设备及程序产品 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240290338A1 (zh) |
EP (1) | EP4404186A1 (zh) |
CN (1) | CN117059068A (zh) |
WO (1) | WO2023216760A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117894319A (zh) * | 2024-03-14 | 2024-04-16 | 南京土星信息科技有限公司 | 基于机器学习数据生成的小样本声纹识别模型训练方法 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117594056A (zh) * | 2024-01-18 | 2024-02-23 | 深圳市龙芯威半导体科技有限公司 | 一种基于sift的rnn语音降噪与去混响方法及系统 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110176243A (zh) * | 2018-08-10 | 2019-08-27 | 腾讯科技(深圳)有限公司 | 语音增强方法、模型训练方法、装置和计算机设备 |
CN110600017A (zh) * | 2019-09-12 | 2019-12-20 | 腾讯科技(深圳)有限公司 | 语音处理模型的训练方法、语音识别方法、系统及装置 |
US20200066296A1 (en) * | 2018-08-21 | 2020-02-27 | 2Hz, Inc | Speech Enhancement And Noise Suppression Systems And Methods |
US20210074282A1 (en) * | 2019-09-11 | 2021-03-11 | Massachusetts Institute Of Technology | Systems and methods for improving model-based speech enhancement with neural networks |
-
2022
- 2022-05-07 CN CN202210495197.5A patent/CN117059068A/zh active Pending
-
2023
- 2023-03-31 EP EP23802536.5A patent/EP4404186A1/en active Pending
- 2023-03-31 WO PCT/CN2023/085321 patent/WO2023216760A1/zh active Application Filing
-
2024
- 2024-05-08 US US18/658,964 patent/US20240290338A1/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110176243A (zh) * | 2018-08-10 | 2019-08-27 | 腾讯科技(深圳)有限公司 | 语音增强方法、模型训练方法、装置和计算机设备 |
US20200066296A1 (en) * | 2018-08-21 | 2020-02-27 | 2Hz, Inc | Speech Enhancement And Noise Suppression Systems And Methods |
US20210074282A1 (en) * | 2019-09-11 | 2021-03-11 | Massachusetts Institute Of Technology | Systems and methods for improving model-based speech enhancement with neural networks |
CN110600017A (zh) * | 2019-09-12 | 2019-12-20 | 腾讯科技(深圳)有限公司 | 语音处理模型的训练方法、语音识别方法、系统及装置 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117894319A (zh) * | 2024-03-14 | 2024-04-16 | 南京土星信息科技有限公司 | 基于机器学习数据生成的小样本声纹识别模型训练方法 |
CN117894319B (zh) * | 2024-03-14 | 2024-05-17 | 南京土星信息科技有限公司 | 基于机器学习数据生成的小样本声纹识别模型训练方法 |
Also Published As
Publication number | Publication date |
---|---|
CN117059068A (zh) | 2023-11-14 |
EP4404186A1 (en) | 2024-07-24 |
US20240290338A1 (en) | 2024-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110379412B (zh) | 语音处理的方法、装置、电子设备及计算机可读存储介质 | |
US11894014B2 (en) | Audio-visual speech separation | |
Zhao et al. | Monaural speech dereverberation using temporal convolutional networks with self attention | |
WO2023216760A1 (zh) | 语音处理方法、装置、存储介质、计算机设备及程序产品 | |
US20220230651A1 (en) | Voice signal dereverberation processing method and apparatus, computer device and storage medium | |
WO2022178942A1 (zh) | 情绪识别方法、装置、计算机设备和存储介质 | |
CN108922525B (zh) | 语音处理方法、装置、存储介质及电子设备 | |
Xiang et al. | A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network | |
CN111951823B (zh) | 一种音频处理方法、装置、设备及介质 | |
CN114338623B (zh) | 音频的处理方法、装置、设备及介质 | |
WO2024027295A1 (zh) | 语音增强模型的训练、增强方法、装置、电子设备、存储介质及程序产品 | |
US20230186943A1 (en) | Voice activity detection method and apparatus, and storage medium | |
JP7548482B2 (ja) | 音声通話の制御方法、装置、コンピュータプログラム及び電子機器 | |
CN114333874B (zh) | 处理音频信号的方法 | |
US11996114B2 (en) | End-to-end time-domain multitask learning for ML-based speech enhancement | |
CN116741193B (zh) | 语音增强网络的训练方法、装置、存储介质及计算机设备 | |
WO2024114303A1 (zh) | 音素识别方法、装置、电子设备及存储介质 | |
CN113763978B (zh) | 语音信号处理方法、装置、电子设备以及存储介质 | |
CN115083440A (zh) | 音频信号降噪方法、电子设备和存储介质 | |
WO2024055751A1 (zh) | 音频数据处理方法、装置、设备、存储介质及程序产品 | |
CN113571075B (zh) | 音频处理的方法、装置、电子设备和存储介质 | |
Li et al. | An improved fully convolutional network based on post-processing with global variance equalization and noise-aware training for speech enhancement | |
CN115798501A (zh) | 一种语音降噪方法、装置及电子设备 | |
CN117894318A (zh) | 音频处理模型的训练方法及装置、存储介质、电子设备 | |
CN117649848A (zh) | 语音信号的处理设备及方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23802536 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023802536 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2023802536 Country of ref document: EP Effective date: 20240416 |
|
ENP | Entry into the national phase |
Ref document number: 2024532312 Country of ref document: JP Kind code of ref document: A |