WO2023216760A1 - 语音处理方法、装置、存储介质、计算机设备及程序产品 - Google Patents

语音处理方法、装置、存储介质、计算机设备及程序产品 Download PDF

Info

Publication number
WO2023216760A1
WO2023216760A1 PCT/CN2023/085321 CN2023085321W WO2023216760A1 WO 2023216760 A1 WO2023216760 A1 WO 2023216760A1 CN 2023085321 W CN2023085321 W CN 2023085321W WO 2023216760 A1 WO2023216760 A1 WO 2023216760A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
loss function
training
target
noise
Prior art date
Application number
PCT/CN2023/085321
Other languages
English (en)
French (fr)
Inventor
黄�俊
王燕南
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP23802536.5A priority Critical patent/EP4404186A1/en
Publication of WO2023216760A1 publication Critical patent/WO2023216760A1/zh
Priority to US18/658,964 priority patent/US20240290338A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the present application relates to the field of speech recognition technology, and more specifically, to a speech processing method, device, storage medium, computer equipment and program product.
  • speech enhancement is speech noise reduction.
  • the speech collected by the microphone is usually "contaminated” speech with different noises.
  • the main purpose of speech enhancement is to extract information from these "polluted” noisy speech. Restore the clean speech we want, thereby effectively suppressing various interference signals and enhancing the target speech signal. This not only improves the speech quality, but also helps improve the performance of speech recognition.
  • the application fields of speech enhancement include video conferencing and speech recognition. It is a preprocessing module for many speech coding and recognition systems. It can usually be divided into near-field speech enhancement and far-field speech enhancement.
  • the existing speech enhancement uses a noise reduction and dereverberation solution based on a two-level network. However, the large amount of calculation of the two-level network makes speech enhancement impossible. Meet the performance requirements of practical applications.
  • Embodiments of the present application provide a speech processing method, device, storage medium, computer equipment and program product, aiming to improve the performance of speech enhancement.
  • the embodiment of the present application provides a speech processing method.
  • the method includes: obtaining the initial speech features of the call speech; inputting the initial speech features into a pre-trained speech enhancement model to obtain the target speech features output by the speech enhancement model.
  • the speech enhancement model is Obtained by step-by-step training based on the deep clustering loss function and the mask inference loss function; based on the target speech characteristics, the target speech with noise and reverberation removed is calculated.
  • Embodiments of the present application also provide a speech processing device.
  • the device includes: an acquisition module, used to obtain the initial speech features of the call speech; and an enhancement module, used to input the initial speech features into a pre-trained speech enhancement model to obtain speech enhancement.
  • the target speech features output by the model.
  • the speech enhancement model is trained step by step based on the deep clustering loss function and the mask inference loss function.
  • the calculation model is used to calculate the target speech that removes noise and reverberation based on the characteristics of the target speech.
  • An embodiment of the present application also provides a computer device.
  • the computer device includes a processor and a memory.
  • the memory stores computer program instructions. When the computer program instructions are called by the processor, the above speech processing method is executed.
  • Embodiments of the present application also provide a computer-readable storage medium that stores program code, wherein the above-mentioned speech processing method is executed when the program code is run by a processor.
  • Embodiments of the present application also provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a storage medium.
  • the processor of the computer device reads the computer instructions from the storage medium, and the processor executes the computer instructions, so that the computer performs the steps in the above speech processing method.
  • the embodiment of this application uses two different loss functions to perform model training on a preset speech enhancement model step by step, guiding the model to efficiently remove noise and reverberation in speech features, which can make noise reduction tasks and de-reverberation tasks easier. , can achieve the optimal training effect in a separate training process, which helps to improve the ability of the speech enhancement model to perform noise reduction and de-reverberation, and improve the performance of speech enhancement while reducing model computing resources.
  • Figure 1 shows a schematic diagram of a common noise reduction and dereverberation method provided by an embodiment of the present application.
  • Figure 2 shows a schematic architectural diagram of a speech processing system provided by an embodiment of the present application.
  • Figure 3 shows a schematic flowchart of a speech processing method provided by an embodiment of the present application.
  • Figure 4 shows a schematic diagram of an application scenario of a voice processing method provided by an embodiment of the present application.
  • Figure 5 shows a schematic architectural diagram of a speech enhancement model provided by an embodiment of the present application.
  • FIG. 6 shows a schematic flowchart of another speech processing method provided by an embodiment of the present application.
  • Figure 7 shows a schematic flowchart of speech feature extraction provided by an embodiment of the present application.
  • Figure 8 shows a schematic architectural diagram of a preset enhanced network provided by an embodiment of the present application.
  • Figure 9 shows a module block diagram of a speech processing device provided by an embodiment of the present application.
  • Figure 10 is a module block diagram of a computer device provided by an embodiment of the present application.
  • Figure 11 is a module block diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • the client's near-end call is only suitable for single-person or short-distance calls with a small number of people, and the audio and video experience is average.
  • the microphone array is divided into different subsets.
  • Each subset passes through the first-level speech enhancement network to obtain the enhanced speech of each microphone.
  • the enhanced speech is integrated together and then passes through the second-level speech enhancement network. Get the final output.
  • this speech enhancement solution based on a two-level network requires a large amount of calculation during the training process, which is not suitable for the performance requirements of actual product applications. If the number of network parameters is reduced to reduce the amount of calculation, it will lead to network degradation. The effect becomes worse when performing speech enhancement.
  • This method can obtain the initial speech features of the call speech and input the initial speech features into a pre-trained speech enhancement model to obtain speech enhancement.
  • the target speech features output by the model.
  • the speech enhancement model is obtained by step-by-step training based on the deep clustering loss function and the mask inference loss function, thereby fusing the two models (two-level networks) into the same model to reduce the model The computational cost of the training process.
  • the target speech with noise and reverberation removed is calculated.
  • the preset speech enhancement model is trained through different loss functions to guide the model to efficiently remove noise and reverberation from the initial speech features, which improves the performance of speech enhancement while reducing model computing resources.
  • FIG. 2 shows a schematic architectural diagram of a speech processing system.
  • the voice processing system 300 is applied in a remote video conferencing scenario.
  • the voice processing system 300 may include a near-end client 310 , a far-end client 330 and a server 350 .
  • the near-end client 310, the remote client 330 and the server 350 communicate through the network.
  • the near-end client 310 and the remote client 330 can be large-screen terminals used for video.
  • the server side 350 can be a cloud server.
  • the remote client 330 can collect the initial speech with noise and reverberation from the participants, and transmit the initial speech to the server 350. After receiving the initial speech, the server 350 can use the pre-trained speech.
  • the speech enhancement model performs noise reduction and de-reverberation on the initial speech to obtain enhanced clean speech (target speech), and transmits the clean speech to the near-end client 310.
  • the speech enhancement model can also be configured on the near-end client 310 or the remote client 330 according to the needs of the actual application scenario.
  • voice processing system 300 is only an example.
  • the architecture and application scenarios of the voice processing system described in the embodiments of the present application are for the purpose of more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a requirement for the implementation of the present application.
  • Those of ordinary skill in the art will know that with the evolution of speech processing system architecture and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
  • Figure 3 shows a schematic flow chart of a speech processing method provided by an embodiment of the present application.
  • the voice processing method is applied to the voice processing device 500 shown in Figure 9 and the computer device 600 ( Figure 10) configured with the voice processing device 500.
  • the following will take computer equipment as an example to illustrate the specific process of the embodiment of the present application.
  • the computer equipment applied in the embodiment of the present application can be a server or a terminal, etc.
  • the server can be an independent physical server, or it can be It is a server cluster or distributed system composed of multiple physical servers. It can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, and CDN , blockchain, and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • the terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smart watch, etc., but is not limited to this.
  • FIG. 4 shows a schematic diagram of an application scenario of a voice processing method provided by an embodiment of the present application.
  • voice processing The method can be applied to a specific speech enhancement system.
  • the speech enhancement model 411 of the speech enhancement system can be deployed in the cloud server 410.
  • the cloud server 410 can be located at the conference terminals of the two venues (the first conference terminal 430 and the second conference terminal).
  • 450) Communication connection in which the first conference terminal 430 and the second conference terminal 450 can collect the voices of the participants at their respective venues (i.e., the original call voices), and upload the collected voices to the cloud server 410.
  • the server 410 completes the voice enhancement of the voice to obtain clean voice.
  • the cloud server 410 transmits the clean voice to the corresponding conference terminal for playback.
  • the speech processing method may specifically include the following steps:
  • Step S110 Obtain the initial voice characteristics of the call voice.
  • the computer device can obtain the initial voice characteristics of the call voice that requires voice enhancement.
  • the initial speech features are acoustic features obtained based on conversation speech conversion, such as logarithmic power spectrum (Logarithmic Power Spectrum, LPS) and Mel-Frequency Cepstral Coefficients (MFCC), etc., which will not be done here. limited.
  • speech data often cannot be directly input into the model for training like image data, it does not have obvious feature changes in the long-term domain, so it is difficult to learn the characteristics of speech data.
  • the time domain data of speech is usually composed of 16K sampling rate. , that is, 16,000 sampling points per second. Directly inputting time domain sampling points will lead to an excessive amount of training data and it will be difficult to train with practical effects. Therefore, in speech processing related tasks, speech data is usually converted into acoustic features as the input or output of the model.
  • the call voice can be framed and windowed to obtain initial voice features.
  • the call speech collected by all microphones is framed and windowed in sequence to obtain the speech signal frame of the call speech, and the speech signal frame is subjected to Fast Fourier Transformation (FFT) and the FFT is obtained.
  • FFT Fast Fourier Transformation
  • the discrete power spectrum is then calculated logarithmically on the obtained discrete power spectrum, and the logarithmic power spectrum is obtained as the initial speech feature.
  • the call speech can be converted from a non-stationary time-varying signal in the time domain space into a stationary signal in the frequency domain space, which facilitates model training.
  • the purpose of framing the speech signal is to divide several speech sampling points into one frame. Within this frame, the characteristics of the speech signal can be regarded as stable. Generally, the length of a frame should be short enough to ensure that the signal within the frame is stable, so the length of a frame should be less than the length of a phoneme, and the duration of a phoneme at normal speaking speed is about 50ms. In addition, to perform Fourier analysis, one frame must contain enough vibration periods. The male voice is around 100 Hz and the female voice is around 200 Hz. The converted periods are 10ms and 5ms. Therefore, the length of the general voice frame is 10 ⁇ 40ms.
  • each frame will exhibit the characteristics of a periodic function.
  • the window functions that can be used are: rectangular window, Hamming window, Hanning window, etc.
  • the speech processing method provided by the embodiment of the present application can be used to perform speech enhancement processing on the speech of the participants, so as to remove noise and reverberation in the sound.
  • the second conference terminal 450 collects the voice of the participant 420 in the conference venue through the microphone, that is, the call voice, and sends the call voice to the cloud server 410 through the network, and then the cloud server 410 receives the call voice. , perform frame processing, windowing processing and Fourier transform on the call speech to obtain the initial speech features.
  • Step S120 Input the initial speech features into the pre-trained speech enhancement model to obtain the speech The target speech features output by the sound enhancement model.
  • the call speech collected by the microphone array will contain both noise and reverberation.
  • the two-level network used to denoise and de-reverberate the call speech due to the amount of parameters of the two networks during training is large, it requires a large amount of computing resources, and reducing the number of parameters of each network will also reduce the performance of the model in noise reduction and dereverberation.
  • the two-level networks can be fused into the same network. Compared with the parameter amounts of the two networks, the number of parameters of the fused model will be reduced, which can greatly reduce the calculation amount of the training process and also improve the model's ability to perform speech enhancement. performance.
  • the speech enhancement model can generate target speech features corresponding to the call speech based on the input initial speech features, that is, clean speech features with noise and reverberation removed after speech enhancement.
  • Figure 5 shows a schematic architectural diagram of a speech enhancement model.
  • the speech enhancement model may include multiple hidden layers, deep clustering layers, speech mask inference layers, and noise mask inference layers.
  • the deep clustering layer, speech mask inference layer and noise mask inference layer can be linear layers, and the inputs of the three are uniformly from the output of the hidden layer.
  • the hidden layer can calculate intermediate features based on the input initial speech features, which are the intermediate values of the speech enhancement process.
  • the deep clustering layer can be implemented through normalization (Normalization) and tangent function (denoted as tanh).
  • Normalization normalization
  • tangent function denoted as tanh.
  • the output of the hidden layer is first normalized to limit the output of the hidden layer to a certain range to facilitate subsequent Process, such as [0, 1] or [-1, 1], and then calculate the tangent function value for the normalized result as the output of the deep clustering layer.
  • both the speech mask inference layer and the noise mask inference layer can be implemented through the softmax function.
  • the speech mask inference layer can perform mask inference (MI) based on the intermediate features to obtain the target speech features that remove noise and reverberation.
  • the noise mask inference layer can perform mask inference based on the intermediate features to obtain the noisy speech features.
  • the deep clustering layer can assist the speech mask inference layer and the noise mask inference layer in noise reduction and dereverberation by performing deep clustering (DC) on the acquired intermediate features.
  • DC deep clustering
  • the hidden layer can be a long short-term memory network (Long Short-Term Memory, LSTM) or a variant, such as a bi-directional long-short term memory network (Bi-directional Long-Short Term Memory, Bi-LSTM), because the speech features have Short-term stationarity of time series, which is consistent with the long and short-term memory capabilities of LSTM.
  • the hidden layer can also be other networks with memory properties, such as Gated Recurrent Unit (GRU).
  • GRU Gated Recurrent Unit
  • the model can be trained through the deep clustering loss function corresponding to the deep clustering layer, and the mask inference loss function corresponding to the speech mask inference layer and the noise mask inference layer.
  • Step-by-step training For example, in the first step, the denoising model can be trained based on the deep clustering loss function and the mask inference loss function. When the denoising model converges, the training is stopped, where the mask inference loss function corresponding to the speech mask inference layer is Clean speech tags without noise and with reverb are used.
  • train the dereverberation model Use the denoising model trained in the first step as the dereverberation model, and infer the loss function based on the deep clustering loss function and mask.
  • the dereverberation model is trained several times. When the dereverberation model converges, the training is stopped.
  • the mask inference loss function corresponding to the speech mask inference layer uses clean speech labels without noise and without reverberation. Therefore,
  • the final dereverberation model that is, the speech enhancement model, has the ability to perform noise reduction and dereverberation at the same time.
  • the deep clustering layer of the speech enhancement model is a binary loss based on time-frequency point clustering. Due to the regularization characteristics of the deep clustering loss, it is difficult to guide speech mask inference during the training process of related technologies.
  • the layer and the noise mask inference layer effectively remove the noise and reverberation in the speech, which makes it difficult to effectively improve the performance of the model for speech enhancement.
  • the step-by-step training scheme of the embodiment of the present application can achieve the optimal training effect for noise reduction tasks and de-reverberation tasks in separate training processes, thereby helping to improve the speech enhancement model for noise reduction and de-reverberation. The ability to make noise.
  • the speech enhancement model obtained through the above training can obtain intermediate features through multi-layer LSTM.
  • the speech mask inference layer can perform mask inference based on the intermediate features and calculate the speech mask, that is, the target speech feature.
  • the initial voice features can be input to the voice enhancement model 411.
  • the voice mask inference layer of the voice enhancement model 411 can be based on
  • the intermediate features are used for mask inference to calculate the speech mask, that is, the target speech features.
  • the intermediate features are obtained through multi-layer LSTM.
  • the calculation amount of the speech enhancement process can be effectively reduced.
  • Step S130 Calculate the target speech without noise and reverberation based on the characteristics of the target speech.
  • inverse feature transformation can be performed on the acquired target speech features to calculate the target speech with noise and reverberation removed.
  • IFT inverse Fourier Transform
  • the cloud server 410 can convert the target speech features, that is, clean speech features, through inverse Fourier transform. into the target speech, thereby obtaining a clean speech with noise and reverberation removed.
  • the cloud server 410 can send the clean speech to the first conference terminal 430, and the speaker of the first conference terminal 430 plays the speech of the participant 420 without noise and reverberation. Loud voice.
  • the initial voice features of the call voice can be obtained, and the initial voice features can be input into a pre-trained voice enhancement model to obtain the target voice features output by the voice enhancement model.
  • the voice enhancement model is based on a deep clustering loss function. It is obtained through step-by-step training with the mask inference loss function. According to the target speech characteristics, the target speech with noise and reverberation removed is calculated.
  • the pre-set speech enhancement model is trained through different loss functions to guide the model to efficiently remove noise and reverberation from the initial speech features, thereby improving the performance of speech enhancement while reducing model computing resources.
  • the speech processing device will be specifically integrated in a computer device as an example for description.
  • Figure 6 shows another voice processing method provided by an embodiment of the present application.
  • the video processing voice processing method is applied to the preset enhancement network shown in Figure 8.
  • the process shown in Figure 5 will be described in detail below.
  • Artificial Intelligence uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Theory, method, technology and application system In other words, artificial intelligence is a comprehensive technology of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive subject that covers a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the solutions provided by the embodiments of this application involve artificial intelligence speech technology (Speech Technology) and other technologies.
  • speech technology includes automatic speech recognition technology (Automatic Speech Recognition, ASR) and speech synthesis technology (Text To Speech, TTS) and speech technology.
  • ASR Automatic Speech Recognition
  • TTS Text To Speech
  • VPR Voiceprint Recognition
  • the speech processing method may specifically include the following steps:
  • Step S210 The computer device obtains a training sample set.
  • the speech processing method provided in the embodiment of the present application includes the training of a preset enhancement network. It is worth mentioning that the training of the preset enhancement network can be performed in advance based on the acquired training sample data set, and subsequent training is performed every time it is needed.
  • the trained speech enhancement model can be used to calculate the target speech features that remove noise and reverberation, without having to train the preset enhancement network again every time speech enhancement is performed.
  • the wsj0-2mix (Wall Street Journal) data set can be used to determine the training sample set.
  • the wsj0-2mix data set contains a 30-hour speech training set and a 10-hour speech training set.
  • the speech of different speakers is randomly selected from the corresponding set and mixed with a random relative signal-to-noise ratio (Signal to Noise Ratio, SNR) between 0dB and 10dB to generate noisy and mixed speech signals used for network training. Loud voice.
  • SNR Signal to Noise Ratio
  • the step of obtaining a training sample set by the computer device may include:
  • the computer device acquires the first sample speech.
  • the computer device extracts speech features from the first sample speech to obtain noise speech features.
  • the computer device acquires the second sample speech.
  • the computer device performs speech feature extraction on the second sample speech to obtain the first clean speech label and the second clean speech label.
  • the computer device determines the deep cluster annotation based on the first sample speech and the second sample speech.
  • the first sample speech is speech containing noise and reverberation collected based on the microphone.
  • the second sample speech is a clean speech without noise and reverberation, and a clean speech without noise and no reverberation.
  • Deep clustering annotation is the ratio of the features of the first sample speech and the second sample speech at each time-frequency point.
  • the computer device can directly collect the call speech containing noise and reverberation through the microphone.
  • the speech of the participants collected through the microphone of the large-screen conference terminal are used as the first sample speech.
  • technicians can directly obtain the first sample speech from the already constructed noise reduction training corpus.
  • the computer device can perform voice feature extraction on the acquired first sample voice.
  • Figure 7 shows a schematic flow chart of voice feature extraction.
  • the microphone collects the call voice containing noise and reverberation, that is, the first a sample voice Multi-frame speech signals are obtained through frame processing and windowing processing respectively. 0 ⁇ i ⁇ n&i ⁇ N * , n is the total number of frames, t represents the time domain space, N * represents a set of positive integers, and then the computer device can perform FFT on each frame of speech signal, and convert each frame of speech signal from the time domain Convert the space to the frequency domain space to obtain the corresponding discrete power spectrum, and calculate the logarithm of the obtained discrete power spectrum to obtain the logarithmic power spectrum.
  • the noise speech label can be marked according to the noise speech feature, that is, in are the FFT transformation results of the speech signal from frame 1 to frame n respectively.
  • Computer equipment can also obtain clean speech as a reference from the noise reduction training corpus, and use the clean speech as the second sample speech.
  • clean speech without noise and reverberation can be obtained and clean speech without noise and reverberation, and then perform speech feature extraction on the clean speech without noise and reverberation, and obtain the first clean speech label, and perform speech feature extraction on the clean speech without noise and reverberation. , get the second clean voice label.
  • the noisy speech label The first clean voice tag and the second clean voice tag
  • the mathematical expression of is a feature vector (Embedding), also called an embedding vector, where the length of the feature vector is the dimension of the feature.
  • the computer device can determine the deep clustering annotation by comparing the speech energy of the first sample speech and the second sample speech at each time-frequency point. Since the speech signal changes with time, its energy also changes with time. Therefore, when calculating the energy of the digitized speech signal, it does not calculate the overall energy, but calculates the energy at each time-frequency point by frame. .
  • the computer device can convert speech without noise and reverberation into noise.
  • the energy ratio of the acoustic speech is used as the deep clustering annotation.
  • the energy ratio of the speech without noise and without reverberation and the noise speech can also be used as the deep clustering annotation.
  • the deep clustering annotation is used for the calculation of the deep clustering loss function. .
  • Step S220 The computer device obtains a preset enhanced network.
  • the preset enhanced network includes hidden layers, deep clustering (Deep Clustering) layers and mask inference layers.
  • the default enhancement network is a network with bottom-layer weight sharing and multi-head output.
  • the deep clustering layer can assist the speech mask inference layer and the noise mask inference layer to perform mask inference, so that the speech mask inference layer and noise mask inference layer can
  • the code inference layer can effectively distinguish noise and reverberation in speech during the network training process.
  • the hidden layer can use LSTM or Bi-LSTM.
  • the hidden layer shown in Figure 8 is LSTM.
  • the mask inference layer includes a speech mask layer. (Clean-MI) and noise mask layer (Noise-MI).
  • the speech mask inference layer can calculate the mask of speech, that is, the clean speech label, and the noise mask inference layer can calculate the mask of noise and reverberation, that is, the noisy speech label. It should be noted that during the application process, only the mask output by the speech mask inference layer is used to restore the speech. Therefore, the calculation amount of the speech enhancement process is not increased, thereby improving the speech enhancement efficiency.
  • Step S230 The computer device performs noise removal training and reverberation removal training step by step on the preset enhancement network through the training sample set until the preset enhancement network meets the preset conditions, and obtains the trained target enhancement network as a speech enhancement model.
  • the target enhancement network obtained after completing the training that is, the speech enhancement model needs to perform two enhancement tasks of noise reduction and dereverberation at the same time. If these two enhancement tasks are trained at the same time, the training of the preset enhancement network cannot be optimal. training effect. For this purpose, step-by-step training can be adopted, and the training process of the two tasks can be carried out separately.
  • embodiments of the present application provide two step-by-step training methods. For example, noise removal training can be performed first, and then reverberation removal training can be performed, or reverberation removal training can be performed first, and then noise removal training can be performed.
  • the purpose of noise removal training is to equip the network with the ability to reduce noise
  • the purpose of reverberation removal training is to equip the network with the ability to dereverberate, so that both enhancement tasks can achieve optimal results in separate training processes. Training effect, thereby improving the performance of speech enhancement model for speech enhancement.
  • the computer device performs noise removal training and reverberation removal training on the preset enhancement network step by step through the training sample set until the preset enhancement network meets the preset conditions.
  • the steps may include:
  • the computer equipment inputs the noise speech features into the hidden layer, and generates intermediate training through the hidden layer. Practice characteristics.
  • the computer device inputs the intermediate training features into the deep clustering layer, and generates cluster training annotations through the deep clustering layer.
  • the computer device inputs the intermediate training features into the speech mask inference layer, and generates clean speech training features through the speech mask inference layer.
  • the computer device inputs the intermediate training features into the noise mask inference layer, and generates the noise speech training features through the noise mask inference layer.
  • the computer equipment constructs a target loss function based on clean speech labels, noisy speech labels, deep cluster annotations, clean speech training features, noisy speech training features, and cluster training annotations, and stages the preset enhancement network according to the target loss function Noise removal training and reverberation removal training are performed until the preset enhancement network meets the preset conditions.
  • the intermediate training features are the intermediate values generated by the hidden layer of the preset enhancement network, which can be input as a shared value to the deep clustering layer, speech mask inference layer and noise mask inference layer respectively, so as to achieve the underlying weight sharing to reduce The number of parameters for a small network.
  • the speech mask inference layer and the noise mask inference layer can respectively generate clean speech training features y clean and noisy speech training features y noise based on the intermediate training features.
  • the deep clustering layer can generate cluster training annotations y dc based on the intermediate training features.
  • the computer device constructs a target loss function based on clean speech labels, noisy speech labels, deep cluster annotations, clean speech training features, noisy speech training features, and cluster training annotations, and calculates the preset value based on the target loss function.
  • the enhancement network performs noise removal training and reverberation removal training step by step until the preset enhancement network meets the preset conditions. The steps may include:
  • the computer device determines the first loss function based on the cluster training annotation and the deep cluster annotation.
  • the first loss function is the deep clustering loss function.
  • the first loss function y dc is the clustering training label, Labeling for deep clustering.
  • the computer device determines the second loss function based on the clean speech training features and clean speech labels.
  • two different second loss functions can be determined based on different clean speech labels.
  • the computer device may train the feature y clean based on the clean speech and the first clean speech label Determine the noise removal loss function and use the noise removal loss function as the second loss function
  • the computer device may train the feature y clean based on the clean speech and the second clean speech label Determine the noise removal loss function and use the noise removal loss function as the second loss function
  • the computer device determines the third loss function based on the noise speech training characteristics and the noise speech label.
  • y noise is the noise speech training feature, Label the noisy speech.
  • the second loss function Loss clean and the third loss function That is the mask inference loss function are the second loss function Loss clean and the third loss function That is the mask inference loss function.
  • the computer equipment constructs the target loss function of the preset enhancement network based on the first loss function, the second loss function and the third loss function, and performs step-by-step noise removal training and reverberation on the preset enhancement network based on the target loss function. Training is removed until the preset enhanced network meets the preset conditions.
  • the computer device can construct the target loss function Loss of the preset enhancement network based on the first loss function Loss dc , the second loss function Loss clean and the third loss function Loss noise .
  • the above three loss functions can be corresponding to
  • the weight parameter is a weighted sum of the above three loss functions, as follows:
  • ⁇ , ⁇ and ⁇ are weight parameters.
  • the target loss function Loss performs noise removal training and reverberation removal training on the preset enhancement network step by step until the preset enhancement network meets the preset conditions.
  • the preset enhancement network can be trained based on multi-task learning (Multi-Task Learning). Assume that the enhancement network is trained, so that the deep clustering loss function and the mask inference loss function are combined to simultaneously learn the two enhancement tasks of noise reduction and dereverberation. The two tasks can share parameters during the learning process. The information they learned enables the trained target enhancement network to achieve better generalization effects.
  • noise refers to "unwanted sounds" in certain situations, such as human noise and various sudden sounds.
  • Reverberation refers to the phenomenon of sound continuation that still exists after the indoor sound source stops emitting sound.
  • the noise in the sound collected by the conference terminal is mainly removed, and in a professional recording venue, the reverberation in the sound collected by the recording equipment is mainly removed, as Therefore, different methods of step-by-step training can be performed according to the actual scenario used by the final speech enhancement model.
  • the application scenario attributes can be obtained based on the actual scenario used by the final speech enhancement model, and the corresponding distribution training strategy can be determined based on the application scenario attributes.
  • the target loss function of the preset enhancement network is constructed, and based on the target loss function, the preset enhancement network is step-by-step noise removal training and mixing The training is carried out until the preset enhanced network meets the preset conditions.
  • the application scenario attributes are used to characterize the actual scenarios in which the speech enhancement model is applied, for example, focusing on noise reduction scene attributes or focusing on dereverberation scene attributes.
  • the distribution training strategy includes a first distribution training strategy and a second distribution training strategy.
  • the first distribution training strategy is used to focus on noise reduction scenarios, first perform noise removal training, and then perform reverberation removal training.
  • the second distribution training strategy is used to focus on dereverberation scenarios, first perform reverberation removal training, and then perform noise removal training.
  • the conference terminal collects not only the voice of the speaker, but also the voices of other speakers. It is necessary to perform noise reduction processing on the voice collected by the conference terminal.
  • noise removal training can be performed first, and then mixing can be performed. Ring removal training.
  • the computer device can determine a target loss function of the preset enhancement network based on the first distribution training strategy, the first loss function, the second loss function and the third loss function, where the second loss function is determined by the noise removal loss function, and then The preset enhancement network is iteratively trained for noise removal according to the target loss function until the preset enhancement network meets the preset conditions, and a noise removal network is obtained, which only plays a role in noise reduction.
  • the computer device can determine the target loss function of the noise removal network based on the first loss function, the second loss function, and the third loss function.
  • the second loss function is determined by the reverberation removal loss function, and then based on The target loss function iteratively performs reverberation removal training on the noise removal network until the noise removal network meets the preset conditions. In this way, separate noise removal training first can avoid the training process being interfered by reverberation factors, so that the generated target enhancement network has better noise reduction performance.
  • reverberation removal training can be performed first. Then perform noise removal training.
  • the computer device may determine a target loss function of the preset enhancement network based on the first loss function, the second loss function and the third loss function based on the second distribution training strategy, where the second loss function is determined by the reverberation removal loss function, Then, the preset enhancement network is iteratively trained for reverberation based on the target loss function until the preset enhancement network meets the preset conditions, and a reverberation removal network is obtained.
  • the reverberation removal network only plays a role in dereverberation.
  • the computer device can determine the target loss function of the reverberation removal network based on the first loss function, the second loss function, and the third loss function.
  • the second loss function is determined by the noise removal loss function, and then based on The target loss function iteratively performs noise removal training on the reverberation removal network until the reverberation removal network meets the preset conditions. In this way, performing separate reverberation removal training first can avoid the training process being interfered by noise factors, so that the generated target enhancement network has better dereverberation performance.
  • the preset enhancement network can be trained for noise removal first, and then Reverberation removal training, so as to learn the ability to remove reverberation based on an excellent noise reduction network. In this way, the optimal training effect can be achieved in both training processes, thereby improving the performance of the speech enhancement model for speech enhancement. .
  • the preset conditions may be: the total loss value of the target loss function is less than the preset value, the total loss value of the target loss function no longer changes, or the number of training times reaches the preset number, etc.
  • an optimizer can be used to optimize the target loss function, and the learning rate, batch size during training, and training period (epoch) can be set based on experimental experience.
  • each cycle includes multiple iterative trainings, and the parameters of the network to be trained are continuously optimized, the total loss value above will become smaller and smaller. , and finally becomes smaller than a fixed value, or less than the above preset value.
  • the network to be trained has converged; of course, it can also be determined after the number of training times reaches the preset number, and the preset enhancement network/noise removal network/ The reverberation removal network has converged.
  • the mask inference loss is only used during the validation process of the target augmentation network, that is, the speech enhancement model selection.
  • the output of the mask inference branch is used as the mask after speech enhancement, that is, the target speech feature.
  • Step S240 The computer device obtains the initial voice characteristics of the call voice.
  • Step S250 The computer device inputs the initial speech features into the hidden layer, and generates intermediate features through the hidden layer.
  • Step S260 The computer device inputs the intermediate features into the speech mask inference layer, generates clean speech features through the speech mask inference layer, and uses the clean speech features as target speech features.
  • the computer device can perform voice feature extraction on the call voice, including frame processing, windowing processing and Fourier transform on the call voice to obtain the initial voice features.
  • the computer device The initial speech features can be input into the hidden layer of the speech enhancement network, and the intermediate features are generated through the hidden layer.
  • the computer device can input the intermediate features into the speech mask inference layer, generate the clean speech features through the speech mask inference layer, and convert the clean speech features into as target speech features.
  • Step S270 The computer device performs feature inverse transformation on the target speech features, and calculates the target speech with noise and reverberation removed.
  • the computer device can perform feature inverse transformation on the target speech features, converting the target speech features (mask) in the frequency domain space into the target speech in the time domain space.
  • the inverse feature transform may be an inverse Fourier transform.
  • a training sample set and a preset enhancement network can be obtained, and the preset enhancement network can be step-by-step noise removal training and reverberation removal training through the training sample set until the preset enhancement network meets the preset conditions.
  • the trained target enhancement network is obtained as a speech enhancement model.
  • the initial speech features are input into the hidden layer, intermediate features are generated through the hidden layer, and the intermediate features are input into the speech mask inference layer.
  • Clean speech features are generated through the speech mask inference layer, and The clean speech features are used as target speech features, and then the target speech features are inversely transformed to calculate the target speech with noise and reverberation removed. Therefore, it is only necessary to use the target speech features output by the speech mask inference layer of the speech enhancement model to restore the speech, avoiding increasing the calculation amount of the speech enhancement process, thereby improving the efficiency of speech enhancement.
  • FIG. 9 shows a structural block diagram of a speech processing device 500 provided by an embodiment of the present application.
  • the speech processing device 500 includes: an acquisition module 510, configured to obtain the initial speech features of the call speech; an enhancement module 520, configured to input the initial speech features into a pre-trained speech enhancement model to obtain the target speech features output by the speech enhancement model, Speech enhancement model based Obtained by step-by-step training of the deep clustering loss function and the mask inference loss function; the calculation model 530 is configured to calculate the target speech with noise and reverberation removed according to the target speech characteristics.
  • the speech processing device 500 may also include: a sample acquisition module, a network acquisition module, and a model training module.
  • the sample acquisition module is configured to obtain a training sample set, which includes noisy speech features, clean speech labels, noisy speech labels, and deep clustering annotations;
  • the network acquisition module is configured to obtain a preset enhancement network, and the preset enhancement network includes hidden layer, deep clustering layer and mask inference layer;
  • the network training module is configured to perform noise removal training and reverberation removal training on the preset enhancement network step by step through the training sample set, until the preset enhancement network meets the preset conditions, and we get
  • the trained target enhancement network is used as a speech enhancement model.
  • the mask inference layer includes a speech mask inference layer and a noise mask inference layer
  • the network training module may include: a hidden unit configured to input noise speech features into the hidden layer and generate intermediate training features through the hidden layer;
  • the deep clustering unit is configured to input the intermediate training features into the deep clustering layer, and generate cluster training annotations through the deep clustering layer;
  • the speech inference unit is configured to input the intermediate training features into the speech mask inference layer, and infer through the speech mask The layer generates clean speech training features;
  • the noise inference unit is configured to input the intermediate training features into the noise mask inference layer, and generate the noisy speech training features through the noise mask inference layer;
  • the network training unit is configured to input the clean speech label, the noisy speech label based on the noise mask inference layer , deep cluster annotation, clean speech training features, noisy speech training features and cluster training annotations to construct a target loss function, and perform noise removal training and reverberation removal training on the preset enhancement network step by step according to the target loss function until the preset
  • the network training unit includes: a first subunit configured to determine the first loss function based on cluster training annotations and deep clustering annotations; a second subunit configured to determine the first loss function based on clean voice training features and clean voice label to determine the second loss function; the third subunit is configured to determine the third loss function based on the noise speech training features and the noise speech label; the training subunit is configured to determine the third loss function based on the first loss function, the second loss function and the third Loss function, construct the target loss function of the preset enhancement network, and perform noise removal training and reverberation removal training on the preset enhancement network step by step according to the target loss function until the preset enhancement network meets the preset conditions.
  • the second subunit may be specifically configured to: determine the noise removal loss function based on the clean speech training features and the first clean speech label; use the noise removal loss function as the second loss function, and the first clean speech label is Speech tags obtained from speech without noise and reverberation.
  • the second subunit can also be specifically configured to: determine the reverberation removal loss function based on the clean speech training features and the second clean speech label; use the reverberation removal loss function as the second loss function, and the second clean speech label.
  • Voice tags are voice tags obtained based on speech without noise and without reverberation.
  • the training subunit may be specifically configured to: determine the target loss function of the preset enhancement network based on the first loss function, the second loss function, and the third loss function, and modify the preset enhancement network based on the target loss function. Iteratively perform noise removal training until preset enhancement The network meets the preset conditions and obtains the noise removal network, in which the second loss function is determined by the noise removal loss function; according to the first loss function, the reverberation removal loss function and the third loss function, the target loss function of the noise removal network is determined , and perform reverberation removal training on the noise removal network iteratively according to the target loss function until the noise removal network meets the preset conditions, where the second loss function is determined by the reverberation removal loss function.
  • the training subunit may be specifically configured to: determine the target loss function of the preset enhancement network based on the first loss function, the second loss function, and the third loss function, and modify the preset enhancement network based on the target loss function.
  • the reverberation removal training is iteratively performed until the preset enhancement network meets the preset conditions, and the reverberation removal network is obtained, in which the second loss function is determined by the reverberation removal loss function; according to the first loss function, the second loss function and the third Three loss functions, determine the target loss function of the reverberation removal network, and iteratively perform noise removal training on the reverberation removal network according to the target loss function until the reverberation removal network meets the preset conditions, where the second loss function is the noise removal loss The function is determined.
  • the sample acquisition module may be specifically configured to: acquire a first sample voice, which is a noisy voice collected based on a microphone; perform voice feature extraction on the first sample voice to obtain a noisy voice Features; Obtain the second sample speech, which includes clean speech without noise and reverberation and clean speech without noise and reverberation; perform speech feature extraction on the second sample speech to obtain the first clean speech label and the second clean speech label; determine the deep cluster labeling based on the first sample speech and the second sample speech.
  • the speech enhancement model includes a hidden layer, a deep clustering layer, a speech mask inference layer and a noise mask inference layer.
  • the enhancement module 520 can be specifically configured to: input the initial speech features into the hidden layer, and generate Intermediate features; input the intermediate features into the speech mask inference layer, generate clean speech features through the speech mask inference layer, and use the clean speech features as the target speech features;
  • the calculation model 530 may be specifically configured to perform feature inverse transformation on the target speech features, and calculate the target speech with noise and reverberation removed.
  • the coupling between modules may be electrical, mechanical or other forms of coupling.
  • each functional module in each embodiment of the present application can be integrated into one processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or software function modules.
  • the solution provided by this application can obtain the initial speech features of the call speech, input the initial speech features into the pre-trained speech enhancement model, and obtain the target speech features output by the speech enhancement model.
  • the speech enhancement model is based on the deep clustering loss function and the mask inference loss function Obtained from rows of step-by-step training, the target speech with noise and reverberation removed is calculated based on the characteristics of the target speech.
  • the pre-set speech enhancement model is trained through different loss functions to guide the model to efficiently remove noise and reverberation in speech, which improves the performance of speech enhancement while reducing model computing resources.
  • the embodiment of the present application also provides a computer device 600.
  • the computer device 600 includes a processor 610, a memory 620, a power supply 630, and an input unit 640.
  • the memory 620 stores computer program instructions, and the computer program instructions are processed.
  • the processor 610 is called, various method steps provided by the above embodiments may be executed.
  • the structure of the computer equipment shown in the figures does not constitute a limitation on the computer equipment, and may include more or fewer components than shown in the figures, or combine certain components, or arrange different components. in:
  • Processor 610 may include one or more processing cores.
  • the processor 610 uses various interfaces and lines to connect various parts of the entire battery management system, runs or executes instructions, programs, code sets or instruction sets stored in the memory 620, calls data stored in the memory 620, and executes Various functions and processing data of the battery management system, as well as performing various functions and processing data of the computer device, thereby overall controlling the computer device.
  • the processor 610 may adopt a digital signal processing (Digital Signal Processing, DSP), a field-programmable gate array (Field-Programmable Gate Array, FPGA), or a programmable logic array (Programmable Logic Array, PLA). At least one form of hardware implementation.
  • the processor 610 may integrate one or a combination of a central processing unit 610 (Central Processing Unit, CPU), an image processor 610 (Graphics Processing Unit, GPU), a modem, etc.
  • CPU Central Processing Unit
  • image processor 610 Graphics Processing Unit, GPU
  • modem mainly handles the operating system, user interface, and applications
  • the GPU is responsible for rendering and drawing the display content
  • the modem is used to handle wireless communications. It can be understood that the above-mentioned modem may not be integrated into the processor 610 and may be implemented solely through a communication chip.
  • the memory 620 may include a random access memory 620 (Random Access Memory, RAM) or a read-only memory 620 (Read-Only Memory).
  • the memory 620 map may be used to store instructions, programs, codes, sets of codes, or sets of instructions.
  • the memory 620 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing an operating system and instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.) , instructions for implementing various method embodiments described below, etc.
  • the storage data area can also store data created during use of the computer device (such as phone books and audio and video data). Accordingly, the memory 620 may also include a memory controller to provide the processor 610 with access to the memory 620 .
  • the power supply 630 can be logically connected to the processor 610 through a power management system, thereby implementing functions such as charging, discharging, and power consumption management through the power management system.
  • Power supply 630 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
  • Input unit 640 which can be used to receive input numeric or character information, and generate keyboard, mouse, joystick, optical or track information related to user settings and function control. Trackball signal input.
  • the computer device 600 may also include a display unit and the like, which will not be described again here.
  • the processor 610 in the computer device will load the executable files corresponding to the processes of one or more application programs into the memory 620 according to the following instructions, and the processor 610 will run the stored program.
  • the application program in the memory 620 thereby implements various method steps provided by the foregoing embodiments.
  • the embodiment of the present application also provides a computer-readable storage medium 700.
  • the computer-readable storage medium 700 stores computer program instructions 710.
  • the computer program instructions 710 can be called by the processor to execute the above embodiment. the method described in .
  • the computer-readable storage medium may be electronic memory such as flash memory, electrically erasable programmable read-only memory (EEPROM), EPROM, hard disk, or ROM.
  • the computer-readable storage medium includes non-transitory computer-readable storage medium (Non-Transitory Computer-Readable Storage Medium).
  • the computer-readable storage medium 700 has storage space for program codes that perform any method steps in the above methods. These program codes can be read from or written into one or more computer program products. The program code may, for example, be compressed in a suitable form.
  • a computer program product or computer program includes computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various optional implementations provided by the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请公开了一种语音处理方法、装置、存储介质、计算机设备及程序产品,方法包括:获取通话语音的初始语音特征;将所述初始语音特征输入至预先训练的语音增强模型,得到所述语音增强模型输出的目标语音特征,所述语音增强模型为基于深度聚类损失函数和掩码推断损失函数进行的分步训练得到;根据所述目标语音特征,计算出去除噪声和混响的目标语音。

Description

语音处理方法、装置、存储介质、计算机设备及程序产品
相关申请的交叉引用
本申请基于申请号为202210495197.5、申请日为2022年5月7日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及语音识别技术领域,更具体地,涉及一种语音处理方法、装置、存储介质、计算机设备及程序产品。
背景技术
语音增强(Speech Enhancement)其本质就是语音降噪,日常生活中,麦克风采集的语音通常是带有不同噪声的“污染”语音,语音增强的主要目的就是从这些被“污染”的带噪语音中恢复出我们想要的干净语音,从而有效抑制各种干扰信号,增强目标语音信号,这样不仅可以提高语音话音质量,还有助于提高语音识别的性能。
语音增强的应用领域包括视频会议和语音识别等,是许多语音编码和识别系统的预处理模块,通常可以分为近场语音增强和远场语音增强。在复杂的语音采集环境下,由于噪声和混响会同时存在,现有的语音增强采用基于两级网络的降噪去混响方案,然而,该两级网络较大的计算量使得语音增强无法满足实际应用的性能需求。
发明内容
本申请实施例提供一种语音处理方法、装置、存储介质、计算机设备及程序产品,旨在提升语音增强的性能。
本申请实施例提供一种语音处理方法,该方法包括:获取通话语音的初始语音特征;将初始语音特征输入至预先训练的语音增强模型,得到语音增强模型输出的目标语音特征,语音增强模型为基于深度聚类损失函数和掩码推断损失函数进行的分步训练得到的;根据目标语音特征,计算出去除噪声和混响的目标语音。
本申请实施例还提供一种语音处理装置,该装置包括:获取模块,用于获取通话语音的初始语音特征;增强模块,用于将初始语音特征输入至预先训练的语音增强模型,得到语音增强模型输出的目标语音特征,语音增强模型为基于深度聚类损失函数和掩码推断损失函数进行的分步训练得 到的;计算模型,用于根据目标语音特征,计算出去除噪声和混响的目标语音。
本申请实施例还提供一种计算机设备,该计算机设备包括处理器以及存储器,存储器存储有计算机程序指令,计算机程序指令被处理器调用时执行上述的语音处理方法。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有程序代码,其中,在所述程序代码被处理器运行时执行上述的语音处理方法。
本申请实施例还提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在存储介质中。计算机设备的处理器从存储介质读取所述计算机指令,处理器执行所述计算机指令,使得所述计算机执行上述语音处理方法中的步骤。
本申请实施例通过两种不同的损失函数对预先设置的语音增强模型分步进行模型训练,引导模型高效地对语音特征中的噪声和混响进行去除,可以让降噪任务和去混响任务,在独自的训练过程都能达到最优的训练效果,从而有助于提高语音增强模型进行降噪和去混响的能力,在降低模型计算资源的同时,提高语音增强的性能。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本申请实施例提供的一种常用降噪和去混响的方法示意图。
图2示出了本申请实施例提供的一种语音处理系统的架构示意图。
图3示出了本申请实施例提供的一种语音处理方法的流程示意图。
图4示出了本申请实施例提供的一种语音处理方法的应用场景示意图。
图5示出了本申请实施例提供的一种语音增强模型的架构示意图。
图6示出了本申请实施例提供的另一种语音处理方法的流程示意图。
图7示出了本申请实施例提供的一种语音特征提取的流程示意图。
图8示出了本申请实施例提供的一种预设增强网络的架构示意图。
图9示出了本申请实施例提供的一种语音处理装置的模块框图。
图10是本申请实施例提供的一种计算机设备的模块框图。
图11是本申请实施例提供的一种计算机可读存储介质的模块框图。
具体实施方式
下面详细描述本申请的实施方式,实施方式的示例在附图中示出,其 中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性地,仅用于解释本申请,而不能理解为对本申请的限制。
为了使本技术领域的人员更好地理解本申请的方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整的描述。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
日常生活中,经常会遇到在噪声干扰下进行语音通信的问题。例如在汽车、火车上使用移动电话,环境的喧闹声,以及多人视频会议时麦克风采集的带有噪声的远端语音等,因此需要借助语音增强技术从带噪语音信号中提取尽可能纯净的原始语音。根据通话场景的不同,用户利用客户端进行的通话类型可以包括近端通话和远端通话,就通话的参与者而言,近端是参与者所在的位置,远端是远程会议中其它参与者所在位置。每个位置都至少有一个麦克风和一个扬声器。但客户端的近端通话只适合单人或者人数较少的近距离通话,且音视频体验一般。
为了提升用户体验,工业上侧重于研究大屏通信设备下的远端通话。然而,远端通话由于通话距离更远,信噪比更低,且通话语音通常伴有噪声和混响,所以需要利用性能更好的远场语音增强来对通话语音进行降噪去混响。相关技术的语音增强方案通常采用两个模型分别进行降噪和去混响,针对带噪带混响语音,请参阅图1,图1示出了常用的降噪和去混响的两种方案,包括先降噪后去混响和先去混响后降噪。
例如,将麦克风阵列分为不同子集,每个子集通过第一级的语音增强网络,得到每个麦克风增强后的语音,将增强后的语音整合到一起再通过第二级的语音增强网络,得到最终输出。然而,这种基于两级网络的语音增强方案,训练过程需要消耗较大的计算量,并不适合产品实际应用的性能需求,若通过减小网络参数的数量以降低计算量,则会导致网络进行语音增强时的效果变差。
为了解决上述问题,申请人经过研究,提出了本申请实施例提供的语音处理方法,该方法可以获取通话语音的初始语音特征,并将初始语音特征输入至预先训练的语音增强模型,得到语音增强模型输出的目标语音特征,该语音增强模型为基于深度聚类损失函数和掩码推断损失函数进行的分步训练得到,从而将两个模型(两级网络)融合为同一个模型,以减少模型训练过程的计算成本。根据目标语音特征,计算出去除噪声和混响的目标语音。如此,通过不同损失函数对预先设置的语音增强模型进行模型训练,引导模型高效地对初始语音特征中的噪声和混响进行去除,在降低模型计算资源的同时,提高语音增强的性能。
下面先对本申请所涉及到的语音处理方法的一种应用场景进行介绍。 图2示出了一种语音处理系统的架构示意图。在一些实施例中,语音处理系统300应用于远程视频会议的场景中,该语音处理系统300可以包括近端客户端310、远端客户端330以及服务器端350。其中,近端客户端310、远端客户端330以及服务器端350通过网络进行通信连接,作为一种实施方式,近端客户端310和远端客户端330可以是用于视频的大屏终端,服务器端350可以为云服务器。
示例性地,远端客户端330可以采集参会人员发出的带有噪声和混响的初始语音,并将初始语音传送至服务器端350,服务器端350接受到初始语音后,可以利用预先训练好的语音增强模型对该初始语音进行降噪和去混响,得到增强后的干净语音(目标语音),并将干净语音传送至近端客户端310。在一些实施例中,语音增强模型也可以根据实际应用场景的需要,配置于近端客户端310或远端客户端330。
需要说明的是,上述语音处理系统300仅仅是一个示例,本申请实施例描述的语音处理系统的架构以及应用场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着语音处理系统架构的演变和新的应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
请参阅图3,图3示出了本申请一个实施例提供的语音处理方法的流程示意图。在具体的实施例中,所述语音处理方法应用于如图9所示的语音处理装置500以及配置有语音处理装置500的计算机设备600(图10)。
下面将以计算机设备为例,说明本申请实施例的具体流程,当然,可以理解的是,本申请实施例所应用的计算机设备可以为服务器或者终端等,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、区块链以及大数据和人工智能平台等基础云计算服务的云服务器。终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。
下面将针对图3所示的流程结合图4所示的应用场景进行详细的阐述,图4示出了本申请实施例提供的一种语音处理方法的应用场景示意图,该应用场景中,语音处理方法可以应用在具体的语音增强系统,该语音增强系统的语音增强模型411可以部署在云端服务器410中,云端服务器410可以分别于两个会场的会议终端(第一会议终端430和第二会议终端450)通信连接,其中,第一会议终端430和第二会议终端450可以采集各自所在会场的参会人员的语音(即原始的通话语音),并将采集的语音上传至云端服务器410,由云端服务器410完成对语音的语音增强得到干净语音,最后,云端服务器410将干净语音传至对应的会议终端进行播放。所述语音处理方法具体可以包括以下步骤:
步骤S110:获取通话语音的初始语音特征。
在本申请实施例中,计算机设备可以获取需要进行语音增强的通话语音的初始语音特征。其中,初始语音特征为基于通话语音转化得到的声学特征,例如,对数功率谱(Logarithmic Power Spectrum,LPS)和梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)等,在此不做限定。
由于语音数据往往不能像图像数据那样直接输入到模型中训练,其在长时域上没有明显的特征变化,所以很难学习到语音数据的特征,加之语音的时域数据通常由16K采样率构成,即1秒16000个采样点,直接输入时域采样点会导致训练数据量过大且很难训练出具有实际意义的效果。因此,在语音处理相关任务中,通常是将语音数据转化为声学特征作为模型的输入或者输出。
作为一种实施方式,在获取通话语音后,可以对通话语音进行分帧处理和加窗处理,得到初始语音特征。例如,依次对所有麦克风采集的通话语音进行分帧处理和加窗处理,得到通话语音的语音信号帧,并对语音信号帧进行快速傅里叶变换(Fast Fourier Transformation,FFT)并求取FFT之后的离散功率谱,进而对获得的离散功率谱进行对数计算,得到对数功率谱作为初始语音特征。通过对通话语音进行分帧处理和加窗处理,可将通话语音由时域空间的非平稳时变信号转化为频域空间的平稳信号,便于模型的训练。
语音信号分帧的目的是把若干个语音采样点分为一帧,在这一帧内,语音信号的特性可视为是稳定的。通常,一帧的长度应该足够短来保证帧内信号是平稳的,因此一帧的长度应该小于一个音素的长度,正常语速下一个音素持续时间大约为50ms。此外,要进行傅里叶分析,一帧必须包含足够多的振动周期,男声在100赫兹左右,女声在200赫兹左右,换算成周期就是10ms和5ms。因此,一般语音分帧的长度取10~40ms。
分帧后每一帧的开始和结束都会出现间断,因此分割的帧越多,与原始信号的误差就越大,加窗就是为了解决这个问题,使成帧后的信号变得连续,并且每一帧都会表现出周期函数的特性。例如,可以使用的窗函数有:矩形窗、汉明窗、汉宁窗等。
在图4所示的视频会议场景中,由于参会人员与会议终端之间具有一定的距离,进而导致会议终端采集的参会人员语音中带有噪声和混响。为此可以利用本申请实施例提供的语音处理方法对参会人员的语音进行语音增强处理,以便去除声音中的噪声和混响。
示例性地,第二会议终端450通过麦克风在会场中采集到参会人员420的语音,也即通话语音,并将该通话语音通过网络发送至云端服务器410,进而云端服务器410接收到通话语音后,对通话语音进行分帧处理、加窗处理以及傅里叶变换,得到初始语音特征。
步骤S120:将初始语音特征输入至预先训练的语音增强模型,得到语 音增强模型输出的目标语音特征。
实际的应用场景中,麦克风阵列采集的通话语音会同时包含有噪声和混响,考虑到用于对通话语音进行降噪和去混响的两级网络,由于在训练时两个网络的参数量较大,因此需要消耗大量的计算资源,并且,若将每个网络的参数量减小也会降低模型进行降噪和去混响的性能。为此,可以将两级网络融合为同一个网络,相对于两个网络的参数量,融合后的模型的参数量会减少,可以大大减少训练过程的计算量,也能提高模型进行语音增强的性能。
在本申请实施例中,语音增强模型可以基于输入的初始语音特征生成通话语音对应的目标语音特征,也即经过语音增强后,除去噪声和混响的干净的语音特征。请参阅图5,图5示出了一种语音增强模型的架构示意图。该语音增强模型可以包括多个隐藏层、深度聚类层、语音掩码推断层以及噪声掩码推断层。
其中,深度聚类层、语音掩码推断层以及噪声掩码推断层可以为线性层,三者的输入统一来自隐藏层的输出。隐藏层可以基于输入的初始语音特征计算得到中间特征,该中间特征为语音增强过程的中间值。
例如,深度聚类层可以通过归一化(Normalization)和正切函数(记作tanh)实现,隐藏层的输出首先进行归一化处理,将隐藏层的输出限制到一定的范围内,以方便后续处理,例如[0,1]或者[-1,1],然后对归一化结果计算正切函数值,作为深度聚类层的输出。
例如,语音掩码推断层和噪声掩码推断层均可以通过softmax函数实现。
语音掩码推断层可以基于中间特征进行掩码推断(Mask Inference,MI),得到去除噪声和混响的目标语音特征,噪声掩码推断层可以基于中间特征进行掩码推断,得到带有噪声的语音特征,深度聚类层通过对获取的中间特征进行深度聚类(Deep Clustering,DC)可以辅助语音掩码推断层和噪声掩码推断层降噪和去混响。例如,隐藏层可以为长短期记忆网络(Long Short-Term Memory,LSTM)或变体,例如双向长短期记忆网络(Bi-directional Long-Short Term Memory,Bi-LSTM),这是因为语音特征具有短时平稳性的时序序列,这与LSTM的长短期记忆能力相吻合。隐藏层也可以是其他具有记性特性的网络,例如门控循环单元(Gated Recurrent Unit,GRU)。
作为一种实施方式,在模型训练过程中,可以通过深度聚类层对应的深度聚类损失函数,以及语音掩码推断层和噪声掩码推断层各自对应的掩码推断损失函数,对模型进行分步训练。示例性地,第一步,可以基于深度聚类损失函数和掩码推断损失函数训练降噪模型,当降噪模型收敛之后,停止训练,其中,语音掩码推断层对应的掩码推断损失函数使用的是不带噪声且带混响的干净语音标签。第二步,训练去混响模型,将第一步训练好的降噪模型作为去混响模型,基于深度聚类损失函数和掩码推断损失函 数训练去混响模型,当去混响模型收敛后,停止训练,其中,语音掩码推断层对应的掩码推断损失函数使用的是不带噪声且不带混响的干净语音标签,从而,最终得到的去混响模型,也即语音增强模型具备同时进行降噪和去混响的能力。
需要说明的是,语音增强模型的深度聚类层为一个基于时频点聚类的二值损失,由于深度聚类损失的正则化特性,在相关技术的训练过程中,难以引导语音掩码推断层以及噪声掩码推断层对语音中的噪声和混响进行有效去除,进而难以有效提升模型进行语音增强的性能。而本申请实施例的分步训练方案,可以让降噪任务和去混响任务,在独自的训练过程都能达到最优的训练效果,从而有助于提高语音增强模型进行降噪和去混响的能力。
以此,通过上述训练得到的语音增强模型,可以通过多层LSTM得到中间特征,语音掩码推断层可以基于中间特征进行掩码推断,计算出语音的掩码,也即目标语音特征。示例性地,如图4所示的视频会议场景中,云端服务器410得到初始语音特征之后,可以将该初始语音特征输入至语音增强模型411,该语音增强模型411的语音掩码推断层可以基于中间特征进行掩码推断计算出语音的掩码,也即目标语音特征,其中,中间特征是通过多层LSTM得到的。在语音增强的应用场景中,由于仅仅需要利用语音掩码推断层输出的目标语音特征恢复语音,从而可以有效减少语音增强过程的计算量。
步骤S130:根据目标语音特征,计算出去除噪声和混响的目标语音。
作为一种实施方式,可以对获取的目标语音特征进行特征逆变换,计算出去除噪声和混响的目标语音。例如,可以对目标语音特征进行傅里叶逆变换(Inverse Fourier Transform,IFT)将目标语音特征从频域转换到时域,从而获得语音增强后的时域语音,也即目标语音。示例性地,如图4所示的视频会议场景中,云端服务器410在获取语音增强模型411输出的目标语音特征之后,可以通过傅里叶逆变换将目标语音特征,也即将干净的语音特征转换成目标语音,从而得到去除噪声和混响的干净语音,云端服务器410可以将干净语音发送至第一会议终端430,由第一会议终端430的扬声器播放出参会人员420的不带噪声和混响的语音。
本申请实施例中,可以获取通话语音的初始语音特征,并将初始语音特征输入至预先训练的语音增强模型,得到语音增强模型输出的目标语音特征,该语音增强模型为基于深度聚类损失函数和掩码推断损失函数进行的分步训练得到,根据目标语音特征,计算出去除噪声和混响的目标语音。由此,通过不同损失函数对预先设置的语音增强模型进行模型训练,引导模型高效地对初始语音特征中的噪声和混响进行去除,从而在降低模型计算资源的同时,提高语音增强的性能。
结合上述实施例所描述的方法,以下将举例作进一步详细说明。
在本申请实施例中,将以该语音处理装置具体集成在计算机设备中为例进行说明。
请参阅图6,图6示出了本申请实施例提供的另一种语音处理方法,在具体的实施例中,该视频处理语音处理方法运用到如图8所示的预设增强网络。下面将针对图5所示的流程进行详细的阐述。
本申请实施例结合人工智能(Artificial Intelligence,AI)技术,人工智能技术是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本申请实施例提供的方案涉及人工智能的语音技术(Speech Technology)等技术,语音技术的关键技术有自动语音识别技术(Automatic Speech Recognition,ASR)和语音合成技术(Text To Speech,TTS)以及声纹识别技术(Voiceprint Recognition,VPR)。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
下面将结合图6所示的流程和图8所示的网络架构图进行详细的阐述,该语音处理方法具体可以包括以下步骤:
步骤S210:计算机设备获取训练样本集合。
本申请实施例中提供的语音处理方法包括对预设增强网络的训练,值得说明的是,对预设增强网络的训练可以是根据获取的训练样本数据集合预先进行的,后续在每次需要对通话语音的初始语音特征进行语音增强时,可以利用训练得到的语音增强模型计算出去除噪声和混响的目标语音特征,而无需每次进行语音增强时,再次对预设增强网络进行训练。
在一些实施例中,可以利用wsj0-2mix(Wall Street Journal)数据集来确定训练样本集合,该wsj0-2mix数据集包含有30个小时的语音训练集和10个小时的语音训练集,通过从相应的集合中随机选择不同的说话者的语音,并以0dB和10dB之间的随机相对信噪比(Signal to Noise Ratio,SNR)进行混合,可以生成用于网络训练所使用的带噪声和混响的语音。
作为一种实施方式,该计算机设备获取训练样本集合的步骤可以包括:
(1)计算机设备获取第一样本语音。
(2)计算机设备对第一样本语音进行语音特征提取,得到噪声语音特征。
(3)计算机设备获取第二样本语音。
(4)计算机设备对第二样本语音进行语音特征提取,得到第一干净语音标签以及第二干净语音标签。
(5)计算机设备根据第一样本语音以及第二样本语音,确定深度聚类标注。
其中,第一样本语音为基于麦克风采集的含有噪声和混响的语音。第二样本语音为不带噪声带混响的干净语音以及不带噪声且不带混响的干净语音。深度聚类标注为第一样本语音以及第二样本语音在每一个时频点上的特征的比值。
示例性地,计算机设备可以直接通过麦克风采集含有噪声和混响的通话语音,例如,视频会议中,通过大屏会议终端的麦克风采集的参会人员的发言作为第一样本语音,实际的训练过程,技术人员可以直接从已经构建好的降噪训练语料中获取第一样本语音。
计算机设备可以对获取的第一样本语音进行语音特征提取,请参阅图7,图7示出了一种语音特征提取的流程示意图,将麦克风采集含有噪声和混响的通话语音,也即第一样本语音分别通过分帧处理和加窗处理得到多帧语音信号0<i<n&i∈N*,n为总帧数,t表征时域空间,N*表示正整数集合,进而计算机设备可以对每一帧语音信号进行FFT,将每一帧语音信号由时域空间转换到频域空间,得到对应的离散功率谱,并对获得的离散功率谱求对数,得到对数功率谱f表征频域空间,将所有麦克风的特征拼接在一起即可得到最终的噪声语音特征xnoisy,在一些实施例中,可以根据噪声语音特征标记出噪声语音标签,也即其中分别为第1帧到第n帧的语音信号的FFT变换结果。
计算机设备也可以从降噪训练语料中获取作为参考的干净语音,并将干净语音作为第二样本语音,为了便于对预设增强网络进行分步训练,可以获取不带噪声带混响的干净语音以及不带噪声不带混响的干净语音,进而对不带噪声带混响的干净语音进行语音特征提取,得到第一干净语音标签,对不带噪声不带混响的干净语音进行语音特征提取,得到第二干净语音标签。在计算过程中,噪声语音标签第一干净语音标签以及第二干净语音标签的数学表达为特征向量(Embedding),也称为嵌入向量,其中,特征向量的长度为特征的维度。
作为一种实施方式,计算机设备可以通过比较第一样本语音以及第二样本语音在每个时频点上的语音能量来确定出深度聚类标注由于语音信号是随时间变化的,所以其能量也是随时间变化的,所以在计算数字化的语音信号的能量时,并不是计算整体的能量,而是按帧来计算每个时频点上的能量。示例性地,计算机设备可以将不带噪声且带混响的语音和噪 声语音的能量比作为深度聚类标注,也可以将不带噪声且不带混响的语音和噪声语音的能量比作为深度聚类标注,该深度聚类标注用于深度聚类损失函数的计算。
步骤S220:计算机设备获取预设增强网络。
考虑到语音增强技术的相关产品在工业化落地时,对延时也即实时性要求非常严格,因此需要将语音增强模型的参数量尽可能的减小,但是这样会导致模型进行语音增强的效果大幅下降。为此,在本申请实施例中,提出将两级网络融合为同一个网络,使得语音增强模型可以同时进行降噪和去混响,从而在未减少模型的参数量的情况下,仍然能够提高语音增强的效果。
请参阅图8,图8示出了一种预设增强网络的架构示意图。该预设增强网络包括隐藏层、深度聚类(Deep Clustering)层以及掩码推断层。预设增强网络是一个底层权重共享、多头输出的网络,其中,深度聚类层中可以可以辅助语音掩码推断层以及噪声掩码推断层进行掩码推断,使得语音掩码推断层以及噪声掩码推断层在网络训练的过程中可以有效地区分语音中的噪声和混响,隐藏层可以利用LSTM或者Bi-LSTM,图8所示的隐藏层为LSTM,掩码推断层包括语音掩码层(Clean-MI)和噪声掩码层(Noise-MI)。
语音掩码推断层可以计算出语音的掩码,也即干净语音标签,噪声掩码推断层可以计算出噪声和混响的掩码,也即噪声语音标签。需要说明的是,在应用过程中,仅仅需要利用语音掩码推断层输出的掩码来恢复语音,因此,没有增加语音增强过程的计算量,从而提高了语音增强效率。
步骤S230:计算机设备通过训练样本集合对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件,得到训练后的目标增强网络作为语音增强模型。
完成训练后得到的目标增强网络,也即语音增强模型需要同时进行降噪和去混响的两个增强任务,如果同时对这两个增强任务进行训练,预设增强网络的训练无法达到最优的训练效果。为此,可以采取分步训练的方式,将两个任务的训练过程单独进行。
具体地,本申请实施例提供两种分步训练的方式,例如,可以先进行噪声去除训练,再进行混响去除训练,也可以先进行混响去除训练,再进行噪声去除训练。其中,噪声去除训练的目的是让网络具备降噪的能力,混响去除训练的目的是让网络具备去混响的能力,可以让两个增强任务,在独自的训练过程都能达到最优的训练效果,从而提高语音增强模型进行语音增强的性能。
在一些实施例中,该计算机设备通过训练样本集合对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件的步骤可以包括:
(1)计算机设备将噪声语音特征输入隐藏层,通过隐藏层生成中间训 练特征。
(2)计算机设备将中间训练特征输入深度聚类层,通过深度聚类层生成聚类训练标注。
(3)计算机设备将中间训练特征输入语音掩码推断层,通过语音掩码推断层生成干净语音训练特征。
(4)计算机设备将中间训练特征输入噪声掩码推断层,通过噪声掩码推断层生成噪声语音训练特征。
(5)计算机设备根据干净语音标签、噪声语音标签、深度聚类标注、干净语音训练特征、噪声语音训练特征以及聚类训练标注构建目标损失函数,并根据目标损失函数对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件。
其中,中间训练特征为预设增强网络的隐藏层生成的中间值,可以作为一个共享值分别输入至深度聚类层、语音掩码推断层以及噪声掩码推断层,从而达到底层权重共享以减小网络的参数量。语音掩码推断层和噪声掩码推断层可以基于中间训练特征分别对应生成干净语音训练特征yclean和噪声语音训练特征ynoise。深度聚类层可以基于中间训练特征生成聚类训练标注ydc
作为一种实施方式,该计算机设备根据干净语音标签、噪声语音标签、深度聚类标注、干净语音训练特征、噪声语音训练特征以及聚类训练标注构建目标损失函数,并根据目标损失函数对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件的步骤可以包括:
(5.1)计算机设备根据聚类训练标注和深度聚类标注,确定第一损失函数。
其中,第一损失函数即为深度聚类损失函数,示例性地,第一损失函数ydc为聚类训练标注,为深度聚类标注。
(5.2)计算机设备根据干净语音训练特征和干净语音标签,确定第二损失函数。
针对两种分步训练的方式,可以根据不同的干净语音标签确定两种不同的第二损失函数。
在一些实施例中,计算机设备可以根据干净语音训练特征yclean和第一干净语音标签确定噪声去除损失函数并将噪声去除损失函数作为第二损失函数
在一些实施例中,计算机设备可以根据干净语音训练特征yclean和第二干净语音标签确定噪声去除损失函数并将噪声去除损失函数作为第二损失函数
(5.3)计算机设备根据噪声语音训练特征和噪声语音标签,确定第三损失函数。
示例性地,第三损失函数其中,ynoise为噪声语音训练特征,为噪声语音标签。
其中,第二损失函数Lossclean和第三损失函数即为掩码推断损失函数。
(5.4)计算机设备根据第一损失函数,第二损失函数和第三损失函数,构建预设增强网络的目标损失函数,并根据目标损失函数对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件。
示例性地,计算机设备可以根据第一损失函数Lossdc,第二损失函数Lossclean和第三损失函数Lossnoise,构建预设增强网络的目标损失函数Loss,可以将上述3个损失函数分别对应的权重参数,对上述3个损失函数进行加权求和,如下述公式:
其中,α,β和γ为权重参数。目标损失函数Loss对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件,在一些实施例中,可以基于多任务学习(Multi-Task Learning)对预设增强网络进行训练,从而联合深度聚类损失函数和掩码推断损失函数同时进行降噪和去混响的两个增强任务的学习,两个任务之间通过共享参数可以在学习过程中可以共享它们所学到的信息,使得训练得到的目标增强网络取得更好的泛化(Generalization)效果。
通常噪声指的是在某些场合“不需要的声音”,例如,人的嘈杂声及各种突发的声响等。混响指的是室内声源停止发声后仍然存在的声延续现象。考虑到不同应用场景对语音增强的需求方向有所不同,例如,在多人会场中主要除去会议终端采集的声音中的噪声,在专业录音场地主要除去录音设备采集的声音中的混响,为此,可以根据最终语音增强模型所使用的实际场景进行不同方式的分步训练。
在一些实施例中,可以根据最终语音增强模型所使用的实际场景,获取应用场景属性,并根据应用场景属性确定对应的分布训练策略。基于分布训练策略,根据第一损失函数,第二损失函数和第三损失函数,构建预设增强网络的目标损失函数,并根据该目标损失函数对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件。
其中,应用场景属性用于表征语音增强模型所应用的实际场景,例如,侧重降噪场景属性或者侧重去混响场景属性。分布训练策略包括第一分布训练策略和第二分布训练策略,第一分布训练策略用于针对侧重降噪场景,先进行噪声去除训练,再进行混响去除训练。第二分布训练策略用于针对侧重去混响场景,先进行混响去除训练,再进行噪声去除训练。
作为一种实施方式,在以去除噪声为目的的应用场景中,例如,多人 参与的视频会议中,会议终端采集的除了发言人发出的声音,还包括其他说话人的声音,需要针对会议终端采集的通货语音进行降噪处理,为此可以先进行噪声去除训练,再进行混响去除训练。计算机设备可以基于第一分布训练策略,根据第一损失函数、第二损失函数以及第三损失函数,确定预设增强网络的目标损失函数,该第二损失函数为噪声去除损失函数确定的,进而根据目标损失函数对预设增强网络迭代进行噪声去除训练,直至预设增强网络满足预设条件,得到噪声去除网络,该噪声去除网络仅起到降噪作用。
在一些实施例中,计算机设备可以根据第一损失函数、第二损失函数以及第三损失函数,确定噪声去除网络的目标损失函数,该第二损失函数为混响去除损失函数确定的,进而根据目标损失函数对噪声去除网络迭代进行混响去除训练,直至噪声去除网络满足预设条件。如此,先进行单独的噪声去除训练,可以避免训练过程受到混响因素的干扰,从而使得生成的目标增强网络具备更好的降噪性能。
作为另一种实施方式,在以去除混响为目的的应用场景中,例如,录音棚中对音质要求相对较高,去除不必要的混响尤其重要,为此可以先进行混响去除训练,再进行噪声去除训练。计算机设备可以基于第二分布训练策略,根据第一损失函数、第二损失函数以及第三损失函数,确定预设增强网络的目标损失函数,该第二损失函数为混响去除损失函数确定的,进而根据目标损失函数对预设增强网络迭代进行混响去除训练,直至预设增强网络满足预设条件,得到混响去除网络,该混响去除网络仅起到去混响作用。
在一些实施例中,计算机设备可以根据第一损失函数、第二损失函数以及第三损失函数,确定混响去除网络的目标损失函数,该第二损失函数为噪声去除损失函数确定的,进而根据目标损失函数对混响去除网络迭代进行噪声去除训练,直至混响去除网络满足预设条件。如此,先进行单独的混响去除训练,可以避免训练过程受到噪声因素的干扰,从而使得生成的目标增强网络具备更好的去混响性能。
例如,在对噪声进行精确定义的情况下,噪声概念中实质包含混响,为此,在对语音增强模型的应用场景没有特殊需求时,可以对预设增强网络先进行噪声去除训练,再进行混响去除训练,从而在一个优秀的降噪网络的基础上再学习去混响的能力,如此,在两个训练过程都能达到最优的训练效果,从而提高语音增强模型进行语音增强的性能。
需要说明的是,预设条件可以为:目标损失函数的总损失值小于预设值、目标损失函数的总损失值不再变化、或者训练次数达到预设次数等。例如,可以采用优化器去优化目标损失函数,基于实验经验设置学习率、训练时的批量尺寸(batch size)训练的周期(epoch)。
可以理解的是,在根据训练样本数据集合对待训练网络(预设增强网 络/噪声去除网络/混响去除网络)进行多个周期的迭代训练后,其中,每个周期包括多次的迭代训练,不断对待训练网络的参数进行优化,则以上总损失值越来越小,最后变小为一个固定值,或者小于以上预设值,此时,则表示待训练网络已收敛;当然也可以是在训练次数达到预设次数后,确定预设增强网络/噪声去除网络/混响去除网络已经收敛。
通过多任务学习对预设增强网络进行的训练,虽然使用深度聚类损失和掩码推断损失的组合进行训练,但只在目标增强网络也即语音增强模型选择的验证过程中使用掩码推断损失,在语音增强模型运行时,使用掩码推断分支的输出作为语音增强后的掩码,也即目标语音特征。
步骤S240:计算机设备获取通话语音的初始语音特征。
步骤S250:计算机设备将初始语音特征输入隐藏层,通过隐藏层生成中间特征。
步骤S260:计算机设备将中间特征输入语音掩码推断层,通过语音掩码推断层生成干净语音特征,并将干净语音特征作为目标语音特征。
作为一种实施方式,计算机设备在采集到通话语音后,可以对该通话语音进行语音特征提取,包括对通话语音进行分帧处理、加窗处理以及傅里叶变换,得到初始语音特征,计算机设备可以将初始语音特征输入到语音增强网络的隐藏层,通过隐藏层生成中间特征,计算机设备可以将中间特征输入语音掩码推断层,通过语音掩码推断层生成干净语音特征,并将干净语音特征作为目标语音特征。
步骤S270:计算机设备对目标语音特征进行特征逆变换,计算出去除噪声和混响的目标语音。
作为一种实施方式,计算机设备在获取目标语音特征之后,可以对目标语音特征进行特征逆变换,将频域空间的目标语音特征(掩码)转换到时域空间的目标语音。在一些实施例中,特征逆变换可以为傅里叶逆变换。本申请实施例中,可以获取训练样本集合以及获取预设增强网络,并通过训练样本集合对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件,得到训练后的目标增强网络作为语音增强模型,将初始语音特征输入隐藏层,通过隐藏层生成中间特征,并将中间特征输入语音掩码推断层,通过语音掩码推断层生成干净语音特征,并将干净语音特征作为目标语音特征,进而对目标语音特征进行特征逆变换,计算出去除噪声和混响的目标语音。由此,仅需要利用语音增强模型的语音掩码推断层输出的目标语音特征来恢复语音,避免增加语音增强过程的计算量,从而提高了语音增强效率。
请参阅图9,其示出了本申请实施例提供的一种语音处理装置500的结构框图。该语音处理装置500包括:获取模块510,配置为获取通话语音的初始语音特征;增强模块520,配置为将初始语音特征输入至预先训练的语音增强模型,得到语音增强模型输出的目标语音特征,语音增强模型为基 于深度聚类损失函数和掩码推断损失函数进行的分步训练得到;计算模型530,配置为根据目标语音特征,计算出去除噪声和混响的目标语音。
在一些实施例中,语音处理装置500还可以包括:样本获取模块、网络获取模块以及模型训练模块。样本获取模块,配置为获取训练样本集合,训练样本集合包括噪声语音特征、干净语音标签、噪声语音标签以及深度聚类标注;网络获取模块,配置为获取预设增强网络,预设增强网络包括隐藏层、深度聚类层以及掩码推断层;网络训练模块,配置为通过训练样本集合对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件,得到训练后的目标增强网络作为语音增强模型。
在一些实施例中,掩码推断层包括语音掩码推断层以及噪声掩码推断层,网络训练模块可以包括:隐藏单元,配置为将噪声语音特征输入隐藏层,通过隐藏层生成中间训练特征;深度聚类单元,配置为将中间训练特征输入深度聚类层,通过深度聚类层生成聚类训练标注;语音推断单元,配置为将中间训练特征输入语音掩码推断层,通过语音掩码推断层生成干净语音训练特征;噪声推断单元,配置为将中间训练特征输入噪声掩码推断层,通过噪声掩码推断层生成噪声语音训练特征;网络训练单元,配置为根据干净语音标签、噪声语音标签、深度聚类标注、干净语音训练特征、噪声语音训练特征以及聚类训练标注构建目标损失函数,并根据目标损失函数对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件。
在一些实施例中,网络训练单元包括:第一子单元,配置为根据聚类训练标注和深度聚类标注,确定第一损失函数;第二子单元,配置为根据干净语音训练特征和干净语音标签,确定第二损失函数;第三子单元,配置为根据噪声语音训练特征和噪声语音标签,确定第三损失函数;训练子单元,配置为根据第一损失函数,第二损失函数和第三损失函数,构建预设增强网络的目标损失函数,并根据目标损失函数对预设增强网络分步进行噪声去除训练以及混响去除训练,直至预设增强网络满足预设条件。
在一些实施例中,第二子单元可以具体配置为:根据干净语音训练特征和第一干净语音标签,确定噪声去除损失函数;将噪声去除损失函数作为第二损失函数,第一干净语音标签为基于不带噪声带混响的语音获取的语音标签。
在一些实施例中,第二子单元还可以具体配置为:根据干净语音训练特征和第二干净语音标签,确定混响去除损失函数;将混响去除损失函数作为第二损失函数,第二干净语音标签为基于不带噪声不带混响的语音获取的语音标签。
在一些实施例中,训练子单元可以具体配置为:根据第一损失函数、第二损失函数以及第三损失函数,确定预设增强网络的目标损失函数,并根据目标损失函数对预设增强网络迭代进行噪声去除训练,直至预设增强 网络满足预设条件,得到噪声去除网络,其中,第二损失函数为噪声去除损失函数确定的;根据第一损失函数、混响去除损失函数以及第三损失函数,确定噪声去除网络的目标损失函数,并根据目标损失函数对噪声去除网络迭代进行混响去除训练,直至噪声去除网络满足预设条件,其中,第二损失函数为混响去除损失函数确定的。
在一些实施例中,训练子单元可以具体配置为:根据第一损失函数、第二损失函数以及第三损失函数,确定预设增强网络的目标损失函数,并根据目标损失函数对预设增强网络迭代进行混响去除训练,直至预设增强网络满足预设条件,得到混响去除网络,其中,第二损失函数为混响去除损失函数确定的;根据第一损失函数、第二损失函数以及第三损失函数,确定混响去除网络的目标损失函数,并根据目标损失函数对混响去除网络迭代进行噪声去除训练,直至混响去除网络满足预设条件,其中,第二损失函数为噪声去除损失函数确定的。
在一些实施例中,样本获取模块可以具体配置为:获取第一样本语音,第一样本语音为基于麦克风采集的含有噪声的语音;对第一样本语音进行语音特征提取,得到噪声语音特征;获取第二样本语音,第二样本语音包括不带噪声带混响的干净语音以及不带噪声不带混响的干净语音;对第二样本语音进行语音特征提取,得到第一干净语音标签以及第二干净语音标签;根据第一样本语音以及第二样本语音,确定深度聚类标注。
在一些实施例中,语音增强模型包括隐藏层、深度聚类层、语音掩码推断层以及噪声掩码推断层,增强模块520可以具体配置为:将初始语音特征输入隐藏层,通过隐藏层生成中间特征;将中间特征输入语音掩码推断层,通过语音掩码推断层生成干净语音特征,并将干净语音特征作为目标语音特征;
计算模型530可以具体配置为对目标语音特征进行特征逆变换,计算出去除噪声和混响的目标语音。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,模块相互之间的耦合可以是电性,机械或其它形式的耦合。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
本申请提供的方案,可以获取通话语音的初始语音特征,并将初始语音特征输入至预先训练的语音增强模型,得到语音增强模型输出的目标语音特征,该语音增强模型为基于深度聚类损失函数和掩码推断损失函数进 行的分步训练得到,根据目标语音特征,计算出去除噪声和混响的目标语音。由此,通过不同损失函数对预先设置的语音增强模型进行模型训练,引导模型高效地对语音中的噪声和混响进行去除,在降低模型计算资源的同时,提高语音增强的性能。
如图10所示,本申请实施例还提供一种计算机设备600,该计算机设备600包括处理器610、存储器620、电源630和输入单元640,存储器620存储有计算机程序指令,计算机程序指令被处理器610调用时,可实执行上述的实施例提供的各种方法步骤。本领域技术人员可以理解,图中示出的计算机设备的结构并不构成对计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器610可以包括一个或多个处理核。处理器610利用各种接口和线路连接整个电池管理系统内的各种部分,通过运行或执行存储在存储器620内的指令、程序、代码集或指令集,调用存储在存储器620内的数据,执行电池管理系统的各种功能和处理数据,以及执行计算机设备的各种功能和处理数据,从而对计算机设备进行整体控制。在一些实施例中,处理器610可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器610可集成中央处理器610(Central Processing Unit,CPU)、图像处理器610(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户界面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器610中,单独通过一块通信芯片进行实现。
存储器620可以包括随机存储器620(Random Access Memory,RAM),也可以包括只读存储器620(Read-Only Memory)。存储器620图可用于存储指令、程序、代码、代码集或指令集。存储器620可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各种方法实施例的指令等。存储数据区还可以存储计算机设备在使用中所创建的数据(比如电话本和音视频数据)等。相应地,存储器620还可以包括存储器控制器,以提供处理器610对存储器620的访问。
电源630可以通过电源管理系统与处理器610逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源630还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
输入单元640,该输入单元640可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨 迹球信号输入。
尽管未示出,计算机设备600还可以包括显示单元等,在此不再赘述。具体在本申请实施例中,计算机设备中的处理器610会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器620中,并由处理器610来运行存储在存储器620中的应用程序,从而实现前述实施例提供的各种方法步骤。
如图11所示,本申请实施例还提供一种计算机可读存储介质700,该计算机可读存储介质700中存储有计算机程序指令710,计算机程序指令710可被处理器调用以执行上述实施例中所描述的方法。
计算机可读存储介质可以是诸如闪存、电可擦除可编程只读存储器(EEPROM)、EPROM、硬盘或者ROM之类的电子存储器。在一些实施例中,计算机可读存储介质包括非易失性计算机可读存储介质(Non-Transitory Computer-Readable Storage Medium)。计算机可读存储介质700具有执行上述方法中的任何方法步骤的程序代码的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码可以例如以适当形式进行压缩。
根据本申请的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例提供的各种可选实现方式中提供的方法。
以上,仅是本申请的较佳实施例而已,并非对本申请作任何形式上的限制,虽然本申请已以较佳实施例揭示如上,然而并非用以限定本申请,任何本领域技术人员,在不脱离本申请技术方案范围内,当可利用上述揭示的技术内容做出些许更动或修饰为等同变化的等效实施例,但凡是未脱离本申请技术方案内容,依据本申请的技术实质对以上实施例所作的任何简介修改、等同变化与修饰,均仍属于本申请技术方案的范围内。

Claims (17)

  1. 一种语音处理方法,所述方法包括:
    获取通话语音的初始语音特征;
    将所述初始语音特征输入至语音增强模型,得到所述语音增强模型输出的目标语音特征,其中,所述语音增强模型为基于深度聚类损失函数和掩码推断损失函数进行分步训练得到的;
    根据所述目标语音特征,计算出去除噪声和混响的目标语音。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    通过如下方式预训练所述语音增强模型:
    获取训练样本集合,其中,所述训练样本集合包括噪声语音特征、干净语音标签、噪声语音标签以及深度聚类标注;
    获取预设增强网络;
    通过所述训练样本集合对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,将得到训练后的目标增强网络作为所述语音增强模型。
  3. 根据权利要求2所述的方法,其中,
    所述预设增强网络包括隐藏层、深度聚类层以及掩码推断层,所述掩码推断层包括语音掩码推断层以及噪声掩码推断层,所述通过所述训练样本集合对预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:
    将所述噪声语音特征输入所述隐藏层,通过所述隐藏层生成中间训练特征;
    将所述中间训练特征输入所述深度聚类层,通过所述深度聚类层生成聚类训练标注;
    将所述中间训练特征输入所述语音掩码推断层,通过所述语音掩码推断层生成干净语音训练特征;
    将所述中间训练特征输入所述噪声掩码推断层,通过所述噪声掩码推断层生成噪声语音训练特征;
    根据所述干净语音标签、所述噪声语音标签、所述深度聚类标注、所述干净语音训练特征、所述噪声语音训练特征以及所述聚类训练标注构建目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件。
  4. 根据权利要求3所述的方法,其中,所述根据所述干净语音标签、所述噪声语音标签、所述深度聚类标注、所述干净语音训练特征、所述噪声语音训练特征以及所述聚类训练标注构建目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:
    根据所述聚类训练标注和所述深度聚类标注,确定第一损失函数;
    根据所述干净语音训练特征和所述干净语音标签,确定第二损失函数;
    根据所述噪声语音训练特征和所述噪声语音标签,确定第三损失函数;
    根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件。
  5. 根据权利要求4所述的方法,其中,所述根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,包括:
    基于所述第一损失函数,所述第二损失函数和所述第三损失函数分别对应的权重参数,对所述第一损失函数,所述第二损失函数和所述第三损失函数进行加权求和,得到所述预设增强网络的目标损失函数。
  6. 根据权利要求4所述的方法,其中,所述干净语音标签包括第一干净语音标签,所述根据所述干净语音训练特征和所述干净语音标签,确定第二损失函数,包括:
    根据所述干净语音训练特征和所述第一干净语音标签,确定噪声去除损失函数;
    将所述噪声去除损失函数作为第二损失函数,所述第一干净语音标签为基于不带噪声带混响的语音获取的语音标签。
  7. 根据权利要求4所述的方法,其中,所述干净语音标签包括第二干净语音标签,所述根据所述干净语音训练特征和所述干净语音标签,确定第二损失函数,包括:
    根据所述干净语音训练特征和所述第二干净语音标签,确定混响去除损失函数;
    将所述混响去除损失函数作为第二损失函数,所述第二干净语音标签为基于不带噪声不带混响的语音获取的语音标签。
  8. 根据权利要求5、6或7所述的方法,其中,所述根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:
    获取应用场景属性;
    根据所述应用场景属性确定对应的分布训练策略;
    基于所述分布训练策略,根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件。
  9. 根据权利要求8所述的方法,其中,所述分布训练策略包括第一分布训练策略,所述基于所述分布训练策略,根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:
    在所述分布训练策略为第一分布训练策略时,根据所述第一损失函数、所述第二损失函数以及所述第三损失函数,确定所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络迭代进行噪声去除训练,直至所述预设增强网络满足预设条件,得到噪声去除网络,其中,所述第二损失函数为噪声去除损失函数确定的;
    根据所述第一损失函数、所述第二损失函数以及所述第三损失函数,确定所述噪声去除网络的目标损失函数,并根据所述目标损失函数对所述噪声去除网络迭代进行混响去除训练,直至所述噪声去除网络满足预设条件,其中,所述第二损失函数为混响去除损失函数确定的。
  10. 根据权利要求8所述的方法,其中,所述分布训练策略包括第二分布训练策略,所述基于所述分布训练策略,根据所述第一损失函数,所述第二损失函数和所述第三损失函数,构建所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络分步进行噪声去除训练以及混响去除训练,直至所述预设增强网络满足预设条件,包括:
    在所述分布训练策略为第二分布训练策略时,根据所述第一损失函数、所述第二损失函数以及所述第三损失函数,确定所述预设增强网络的目标损失函数,并根据所述目标损失函数对所述预设增强网络迭代进行混响去除训练,直至所述预设增强网络满足预设条件,得到混响去除网络,其中,所述第二损失函数为混响去除损失函数确定的;
    根据所述第一损失函数、所述第二损失函数以及所述第三损失函数,确定所述混响去除网络的目标损失函数,并根据所述目标损失函数对所述混响去除网络迭代进行噪声去除训练,直至所述混响去除网络满足预设条件,其中,所述第二损失函数为噪声去除损失函数确定的。
  11. 根据权利要求2所述的方法,其中,所述获取训练样本集合,包括:
    获取第一样本语音,所述第一样本语音为基于麦克风采集的含有噪声和混响的语音;
    对所述第一样本语音进行语音特征提取,得到噪声语音特征;
    获取第二样本语音,所述第二样本语音包括不带噪声带混响的干净语音以及不带噪声不带混响的干净语音;
    对所述第二样本语音进行语音特征提取,得到第一干净语音标签以及第二干净语音标签;
    根据所述第一样本语音以及所述第二样本语音,确定深度聚类标注。
  12. 根据权利要求3至11任一项所述的方法,其中,
    所述预设条件包括以下之一:
    所述目标损失函数的总损失值小于预设值,所述目标损失函数的总损失值不再变化,训练次数达到预设次数。
  13. 根据权利要求1所述的方法,其中,所述语音增强模型包括隐藏层、深度聚类层、语音掩码推断层以及噪声掩码推断层,所述将所述初始语音特征输入至预先训练的语音增强模型,得到所述语音增强模型输出的目标语音特征,包括:
    将所述初始语音特征输入所述隐藏层,通过所述隐藏层生成中间特征;
    将所述中间特征输入所述语音掩码推断层,通过所述语音掩码推断层生成干净语音特征,并将所述干净语音特征作为目标语音特征;
    所述根据所述目标语音特征,计算出去除噪声和混响的目标语音,包括:
    对所述目标语音特征进行特征逆变换,计算出去除噪声和混响的目标语音。
  14. 一种语音处理装置,所述装置包括:
    获取模块,配置为获取通话语音的初始语音特征;
    增强模块,配置为将所述初始语音特征输入至预先训练的语音增强模型,得到所述语音增强模型输出的目标语音特征,所述语音增强模型为基于深度聚类损失函数和掩码推断损失函数进行的分步训练得到;
    计算模型,配置为根据所述目标语音特征,计算出去除噪声和混响的目标语音。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储有程序代码,所述程序代码可被处理器调用执行如权利要求1~13任一项所述的方法。
  16. 一种计算机设备,包括:
    存储器;
    一个或多个处理器,与所述存储器耦接;
    一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个应用程序配置用于执行如权利要求1~13任一项所述的方法。
  17. 一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在存储介质中,计算机设备的处理器从存储介质读取所述计算机指令,处理器执行所述计算机指令,使得所述计算机执行如权利要求1~13任一项所述的方法。
PCT/CN2023/085321 2022-05-07 2023-03-31 语音处理方法、装置、存储介质、计算机设备及程序产品 WO2023216760A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP23802536.5A EP4404186A1 (en) 2022-05-07 2023-03-31 Speech processing method and apparatus, and storage medium, computer device and program product
US18/658,964 US20240290338A1 (en) 2022-05-07 2024-05-08 Speech processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210495197.5 2022-05-07
CN202210495197.5A CN117059068A (zh) 2022-05-07 2022-05-07 语音处理方法、装置、存储介质及计算机设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/658,964 Continuation US20240290338A1 (en) 2022-05-07 2024-05-08 Speech processing

Publications (1)

Publication Number Publication Date
WO2023216760A1 true WO2023216760A1 (zh) 2023-11-16

Family

ID=88667966

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/085321 WO2023216760A1 (zh) 2022-05-07 2023-03-31 语音处理方法、装置、存储介质、计算机设备及程序产品

Country Status (4)

Country Link
US (1) US20240290338A1 (zh)
EP (1) EP4404186A1 (zh)
CN (1) CN117059068A (zh)
WO (1) WO2023216760A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117894319A (zh) * 2024-03-14 2024-04-16 南京土星信息科技有限公司 基于机器学习数据生成的小样本声纹识别模型训练方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117594056A (zh) * 2024-01-18 2024-02-23 深圳市龙芯威半导体科技有限公司 一种基于sift的rnn语音降噪与去混响方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176243A (zh) * 2018-08-10 2019-08-27 腾讯科技(深圳)有限公司 语音增强方法、模型训练方法、装置和计算机设备
CN110600017A (zh) * 2019-09-12 2019-12-20 腾讯科技(深圳)有限公司 语音处理模型的训练方法、语音识别方法、系统及装置
US20200066296A1 (en) * 2018-08-21 2020-02-27 2Hz, Inc Speech Enhancement And Noise Suppression Systems And Methods
US20210074282A1 (en) * 2019-09-11 2021-03-11 Massachusetts Institute Of Technology Systems and methods for improving model-based speech enhancement with neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176243A (zh) * 2018-08-10 2019-08-27 腾讯科技(深圳)有限公司 语音增强方法、模型训练方法、装置和计算机设备
US20200066296A1 (en) * 2018-08-21 2020-02-27 2Hz, Inc Speech Enhancement And Noise Suppression Systems And Methods
US20210074282A1 (en) * 2019-09-11 2021-03-11 Massachusetts Institute Of Technology Systems and methods for improving model-based speech enhancement with neural networks
CN110600017A (zh) * 2019-09-12 2019-12-20 腾讯科技(深圳)有限公司 语音处理模型的训练方法、语音识别方法、系统及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117894319A (zh) * 2024-03-14 2024-04-16 南京土星信息科技有限公司 基于机器学习数据生成的小样本声纹识别模型训练方法
CN117894319B (zh) * 2024-03-14 2024-05-17 南京土星信息科技有限公司 基于机器学习数据生成的小样本声纹识别模型训练方法

Also Published As

Publication number Publication date
CN117059068A (zh) 2023-11-14
EP4404186A1 (en) 2024-07-24
US20240290338A1 (en) 2024-08-29

Similar Documents

Publication Publication Date Title
CN110379412B (zh) 语音处理的方法、装置、电子设备及计算机可读存储介质
US11894014B2 (en) Audio-visual speech separation
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
WO2023216760A1 (zh) 语音处理方法、装置、存储介质、计算机设备及程序产品
US20220230651A1 (en) Voice signal dereverberation processing method and apparatus, computer device and storage medium
WO2022178942A1 (zh) 情绪识别方法、装置、计算机设备和存储介质
CN108922525B (zh) 语音处理方法、装置、存储介质及电子设备
Xiang et al. A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network
CN111951823B (zh) 一种音频处理方法、装置、设备及介质
CN114338623B (zh) 音频的处理方法、装置、设备及介质
WO2024027295A1 (zh) 语音增强模型的训练、增强方法、装置、电子设备、存储介质及程序产品
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
JP7548482B2 (ja) 音声通話の制御方法、装置、コンピュータプログラム及び電子機器
CN114333874B (zh) 处理音频信号的方法
US11996114B2 (en) End-to-end time-domain multitask learning for ML-based speech enhancement
CN116741193B (zh) 语音增强网络的训练方法、装置、存储介质及计算机设备
WO2024114303A1 (zh) 音素识别方法、装置、电子设备及存储介质
CN113763978B (zh) 语音信号处理方法、装置、电子设备以及存储介质
CN115083440A (zh) 音频信号降噪方法、电子设备和存储介质
WO2024055751A1 (zh) 音频数据处理方法、装置、设备、存储介质及程序产品
CN113571075B (zh) 音频处理的方法、装置、电子设备和存储介质
Li et al. An improved fully convolutional network based on post-processing with global variance equalization and noise-aware training for speech enhancement
CN115798501A (zh) 一种语音降噪方法、装置及电子设备
CN117894318A (zh) 音频处理模型的训练方法及装置、存储介质、电子设备
CN117649848A (zh) 语音信号的处理设备及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802536

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023802536

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023802536

Country of ref document: EP

Effective date: 20240416

ENP Entry into the national phase

Ref document number: 2024532312

Country of ref document: JP

Kind code of ref document: A