WO2021143327A1 - 语音识别方法、装置和计算机可读存储介质 - Google Patents

语音识别方法、装置和计算机可读存储介质 Download PDF

Info

Publication number
WO2021143327A1
WO2021143327A1 PCT/CN2020/128392 CN2020128392W WO2021143327A1 WO 2021143327 A1 WO2021143327 A1 WO 2021143327A1 CN 2020128392 W CN2020128392 W CN 2020128392W WO 2021143327 A1 WO2021143327 A1 WO 2021143327A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
speech
loss function
neural network
target
Prior art date
Application number
PCT/CN2020/128392
Other languages
English (en)
French (fr)
Inventor
王珺
林永业
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2022520112A priority Critical patent/JP7282442B2/ja
Priority to EP20913796.7A priority patent/EP4006898A4/en
Publication of WO2021143327A1 publication Critical patent/WO2021143327A1/zh
Priority to US17/583,512 priority patent/US20220148571A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0409Adaptive resonance theory [ART] networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • This application relates to the field of speech processing technology, and in particular to a speech recognition method, device, and computer-readable storage medium.
  • speech recognition technology has made it possible for humans and machines to interact through natural language.
  • the speech signal can be converted into a text sequence.
  • front-end processing such as Speech Separation (SS) and Speech Enhancement (SE)
  • SE Speech Enhancement
  • ASR Automatic Speech Recognition
  • the speech signal can be separated and enhanced by the speech separation enhancement model, and then the speech recognition model can be used for speech recognition.
  • the speech recognition model can be used for speech recognition.
  • a voice recognition method, device, and computer-readable storage medium are provided.
  • a speech recognition method which is executed by a computer device, includes: obtaining a first loss function of a speech separation enhancement model and a second loss function of a speech recognition model; performing back propagation based on the second loss function to correct
  • the intermediate model bridged between the speech separation enhancement model and the speech recognition model is trained to obtain a robust representation model; the first loss function and the second loss function are fused to obtain a target loss function; based on the target
  • the loss function performs joint training on the speech separation enhancement model, the robust representation model, and the speech recognition model, and ends the training when the preset convergence condition is satisfied.
  • a speech recognition device comprising: an intermediate representation learning module for obtaining a first loss function of a speech separation enhancement model and a second loss function of a speech recognition model; performing back propagation based on the second loss function, To train the intermediate model bridged between the speech separation enhancement model and the speech recognition model to obtain a robust representation model; a loss fusion module is used to fuse the first loss function and the second loss function to obtain Target loss function; a joint training module for joint training of the speech separation enhancement model, robust characterization model, and speech recognition model based on the target loss function, and the training ends when a preset convergence condition is met.
  • a speech recognition method executed by a computer device, includes: acquiring a target speech stream; extracting an enhanced spectrum of each audio frame in the target speech stream based on a voice separation enhancement model; and hearing the enhanced spectrum based on a robust representation model Matching to obtain robust features; recognition of the robust features based on a speech recognition model to obtain phonemes corresponding to each audio frame; wherein the speech separation enhancement model, robust representation model and speech recognition model are jointly trained.
  • a speech recognition device comprising: a speech separation enhancement module for obtaining a target speech stream; extraction of the enhanced spectrum of each audio frame in the target speech stream based on the speech separation enhancement model; an intermediate characterization transition module for Perform auditory matching on the enhanced spectrum based on a robust representation model to obtain robust features; a speech recognition module is used to recognize the robust features based on a speech recognition model to obtain the phoneme corresponding to each audio frame;
  • the speech separation enhancement model, robust representation model and speech recognition model are jointly trained.
  • One or more non-volatile storage media storing computer-readable instructions.
  • the processors execute the steps of the voice recognition method.
  • a computer device includes a memory and a processor.
  • the memory stores computer readable instructions.
  • the processor executes the steps of the speech recognition method.
  • Figure 1 is an application environment diagram of a speech recognition method in an embodiment
  • Figure 2 is a schematic flowchart of a voice recognition method in an embodiment
  • FIG. 3 is a schematic diagram of a model architecture for bridging a speech separation enhancement model and a speech recognition model based on a robust representation model in an embodiment
  • FIG. 4 is a schematic flowchart of the steps of pre-training a speech processing model in an embodiment
  • FIG. 5 is a schematic flowchart of the steps of constructing an intermediate model in an embodiment
  • Fig. 6 is a schematic flow chart of the steps of pre-training a speech recognition model in an embodiment
  • FIG. 7 is a schematic flowchart of a voice recognition method in a specific embodiment
  • FIG. 8 is a schematic flowchart of a voice recognition method in an embodiment
  • FIG. 9a is a schematic diagram of a comparison of word error rates for recognizing speech from two acoustic environments based on different speech recognition methods under five SNR signal-to-noise ratio conditions in an embodiment
  • FIG. 9b is a schematic diagram of performance comparison of different speech recognition systems under different SNR signal-to-noise ratio conditions in an embodiment
  • FIG. 10 is a schematic flowchart of a voice recognition method in a specific embodiment
  • Figure 11 is a structural block diagram of a speech recognition device in an embodiment
  • Figure 12 is a structural block diagram of a speech recognition device in another embodiment
  • Figure 13 is a structural block diagram of a speech recognition device in an embodiment.
  • Fig. 14 is a structural block diagram of a computer device in an embodiment.
  • Fig. 1 is an application environment diagram of a training method of a speech recognition model in an embodiment.
  • the speech recognition method is applied to a model training system.
  • the speech recognition model training system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, and a notebook computer.
  • the server 120 may be implemented as an independent server or a server cluster composed of multiple servers. Both the terminal 110 and the server 120 can be used independently to execute the voice recognition method provided in the embodiment of the present application.
  • the terminal 110 and the server 120 can also be used in cooperation to execute the voice recognition method provided in the embodiment of the present application.
  • the solutions provided in the embodiments of the present application involve technologies such as artificial intelligence speech recognition.
  • speech technology is speech separation (SS), speech enhancement (SE) and automatic speech recognition technology (ASR). Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • the embodiments of the present application involve a joint model for speech processing.
  • the joint model includes three models for different aspects of speech processing, including a front-end speech separation enhancement model and a back-end speech recognition model, as well as a robust representation model bridging between the speech separation enhancement model and the speech recognition model.
  • Each of the three models can be a machine learning model.
  • a machine learning model is a model that has a certain ability after learning from samples. Specifically, it can be a neural network model, such as a CNN (Convolutional Neural Networks, convolutional neural network) model, a RNN (Recurrent Neural Networks, recurrent neural network) model, and so on.
  • the machine learning model can also adopt other types of models.
  • each link can adopt the optimal configuration without compromising the performance of any link.
  • each of the three models involved in this application can freely choose a dedicated model that is good at the corresponding field.
  • the speech separation enhancement model and the semantic recognition model can be pre-trained respectively, so this application can train a joint model including a robust representation model on the basis of the pre-trained speech separation enhancement model and the semantic recognition model. Fewer iterative training times can get a convergent joint model.
  • the pre-training process of the speech separation enhancement model and the semantic recognition model and the process of joint training with the robust representation model please refer to the detailed description in the subsequent embodiments.
  • a voice recognition method is provided.
  • the method is mainly applied to a computer device as an example.
  • the computer device may be the terminal 110 or the server 120 in the above figure.
  • the voice recognition method specifically includes the following steps:
  • S202 Acquire a first loss function of the speech separation enhancement model and a second loss function of the speech recognition model.
  • the voice separation enhancement model is a model that has the ability to separate and/or enhance the voice after training.
  • the sample voice stream can be used as training data, and the training data obtained by learning and training is used to extract the target voice from the sample voice stream. Model separated from background interference.
  • the voice separation enhancement model may also have at least one of the preprocessing capabilities of voice activity detection (Voice Activity Detection, VAD), echo cancellation, reverberation cancellation, or sound source localization, which is not limited.
  • VAD voice Activity Detection
  • the speech separation enhancement model can be divided into a mono (single microphone) separation enhancement model and an array (multiple microphone) separation enhancement model.
  • the main methods of monophonic separation include speech enhancement and computational auditory scene analysis (Computational Auditory Scene Analysis, CASA).
  • Speech enhancement can estimate clear speech by analyzing all the data of the target speech signal and interference signal in the mono mixed signal, after noise estimation of the noisy speech, the mainstream speech enhancement methods include spectral subtraction and so on.
  • Computational auditory scene analysis is based on the perception theory of auditory scene analysis, and uses grouping cues for speech separation.
  • the main methods of array separation include beamforming or spatial filters. Beamforming is to enhance the voice signal arriving from a specific direction through an appropriate array structure, thereby reducing the interference of voice signals from other directions, such as delay-superposition technology. Speech separation and enhancement are human-oriented speech processing tasks.
  • STOI Short Time Objective Intelligibility
  • the speech separation enhancement model and the speech recognition model may be pre-trained separately.
  • the pre-trained speech separation enhancement model and speech recognition model each have a fixed model structure and model parameters.
  • Speech recognition is a machine-oriented speech processing task. In fields such as automatic speech recognition, such as smart speakers, virtual digital assistants, machine translation, etc., more efficient characterization parameters of machine understanding are often used, such as Mel Fbanks and Mel frequency cepstral coefficients (Mel Fbanks). -Frequency Cepstral Coefficients, MFCC) etc.
  • the mainstream performance metrics of speech recognition models include word error rate (Word Error Rate, WER), character error rate (Character Error Rate, CER), sentence error rate (Sentence Error Rate, SER), and so on.
  • the computer device obtains the pre-trained speech separation enhancement model and the speech recognition model, the first loss function used when pre-training the speech separation enhancement model, and the pre-training speech recognition model.
  • the second loss function is usually associated with the optimization problem as a learning criterion, that is, the model is solved and evaluated by minimizing the loss function. For example, it is used for parameter estimation of models in statistics and machine learning.
  • the first loss function used by the pre-trained speech separation enhancement model and the second loss function used by the pre-trained speech recognition model can be respectively mean square error, average absolute error, Log-Cosh loss, quantile loss or ideal. Quantile loss, etc.
  • the first loss function and the second loss function may be a combination of multiple loss functions, respectively.
  • S204 Perform back propagation based on the second loss function to train the intermediate model bridged between the speech separation enhancement model and the speech recognition model to obtain a robust representation model.
  • the characterization parameters and performance measurement indicators used in the front-end speech separation task are human-oriented, that is, the subjective auditory intelligibility of the person is the goal; while the back-end speech recognition task uses The characterization parameters and performance measurement indicators of the are machine-oriented, that is, the accuracy of machine recognition is the target.
  • Bridging means that an object is between at least two objects and connects the at least two objects. That is, for an object B, if the object is bridged between A and C, it means that the object B is located between A and C, one end of B is connected to A, and the other end is connected to C.
  • the intermediate model is bridged between the speech separation enhancement model and the speech recognition model, which represents the output of the speech separation enhancement model, which is the input of the intermediate model.
  • the input data is processed by the intermediate model and the output data is speech recognition Input to the model.
  • FIG. 3 shows a schematic diagram of a model architecture for bridging a speech separation enhancement model and a speech recognition model based on a robust representation model in an embodiment.
  • the embodiment of the present application bridges the intermediate model to be trained between the speech separation enhancement model and the speech recognition model.
  • the trained intermediate model is robust and can be called a robust representation model.
  • the intermediate model to be trained, the pre-trained speech separation enhancement model and the speech recognition model may all be models composed of artificial neural networks. Artificial Neural Networks (Artificial Neural Networks, abbreviated as ANNs) are also referred to as Neural Networks (NNs) or Connection Models for short.
  • ANNs Artificial Neural Networks
  • Ns Neural Networks
  • Artificial neural network can abstract the human brain neuron network from the perspective of information processing to build a certain model and form different networks according to different connection methods. In engineering and academia, it is often abbreviated as neural network or quasi-neural network. Neural network models such as CNN (Convolutional Neural Network, Convolutional Neural Network) model, DNN (Deep Neural Network, Deep Neural Network) model and RNN (Recurrent Neural Network, Recurrent Neural Network) model, etc.
  • the speech separation enhancement model can also be a combination of multiple neural network models.
  • the convolutional neural network includes convolutional layer and pooling layer.
  • a deep neural network includes an input layer, a hidden layer and an output layer. The relationship between layers is fully connected.
  • Recurrent neural network is a neural network that models sequence data, that is, the current output of a sequence is also related to the previous output.
  • the specific form of expression is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layers are no longer unconnected but connected, and the input of the hidden layer not only includes the output of the input layer It also includes the output of the hidden layer at the previous moment.
  • Recurrent neural network models such as LSTM (Long Short-Term Memory Neural Network) model and BiLSTM (Bi-directional Long Short-Term Memory, two-way long and short-term memory neural network).
  • the speech separation enhancement model used for speech separation and enhancement can also be called Extractor, and the robust representation model used for intermediate transition representation learning can also be called Adapt, which is used for speech recognition for phoneme recognition.
  • the model can also be called Recongnize.
  • the speech processing system composed of the extractor, the adapter and the recognizer is called the EAR system.
  • the computer device determines the local descent gradient generated by the second loss function in each iteration process according to a preset deep learning optimization algorithm.
  • the deep learning optimization algorithm can specifically be Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent (MBGD), AdaGrad (adaptive algorithm) or RMSProp (Root Mean Square Prop) or Adam (Adaptive Moment Estimation), etc.
  • BGD Batch Gradient Descent
  • SGD Stochastic Gradient Descent
  • MBGD Mini-Batch Gradient Descent
  • AdaGrad adaptive algorithm
  • RMSProp Root Mean Square Prop
  • Adam Adaptive Moment Estimation
  • the stochastic gradient descent method Take the stochastic gradient descent method as an example, suppose L 1 and L 2 are the first loss row number and the second loss function, respectively, f(x, ⁇ adapt ) represents the intermediate model with input x and model parameter ⁇ adapt, y is The output target value corresponding to the speech recognition model when the intermediate model inputs x, the sample speech stream contains n audio frames ⁇ x (1) ,...,x (n) ⁇ , where the target corresponding to x (i) is y (i ) , Then the local descent gradient corresponding to each iteration is Assuming that the learning rate of the stochastic gradient descent algorithm is ⁇ , the model parameter can be changed to ⁇ adapt - ⁇ g, and the changed model parameter is used as the current model parameter of the intermediate model to continue iterating until the preset training stop condition is reached.
  • the training stop condition may be that the loss value of the second loss function reaches the preset minimum value, or the model performance of the intermediate model is not significantly improved by successive preset times of it
  • the training data has passed the speech recognition model, there is no need to adjust and update the model parameters of the pre-trained speech recognition model.
  • users can choose the specific intermediate model, speech separation enhancement model and speech recognition model to be used flexibly and independently according to model preference or accuracy requirements, which allows users to flexibly introduce new advanced models according to their own wishes. Speech separation/enhancement and speech recognition technology. In other words, each of the three models involved in this application can freely choose a dedicated model that is good at the corresponding field.
  • models that are good at speech separation include Ai
  • models that are good at robust representation learning include Bj
  • models that are good at speech recognition include Ck, where i, j, and k are all positive integers
  • the joint model to be trained can be Any one of Ai+Bj+Ck.
  • the local descent gradient here is relative to the global descent gradient involved in the following joint training, and cannot be considered as a partial value of the descent gradient value determined according to the second loss function.
  • the target loss function is a comprehensive loss function formed by combining the first loss function and the second loss function.
  • Function fusion is the process of converting multiple functions into one function through one or more preset logic operations.
  • the preset logical operations include but are not limited to four mixed operations, weighted summation, or machine learning algorithms.
  • the weighting factor can be a value set based on experience or experiment, such as 0.1. It is easy to find that by adjusting the weighting factor, the importance of the speech separation enhancement model can be adjusted in the joint training of multiple models.
  • the computer device presets one or more fusion calculation formulas, and sets the input format of each parameter factor in the fusion calculation formula.
  • the first loss function and the second loss function are respectively used as a parameter factor and input different fusion calculation formulas to obtain different target loss functions.
  • S208 Perform joint training on the speech separation enhancement model, the robust representation model, and the speech recognition model based on the target loss function, and end the training when the preset convergence condition is met.
  • the speech separation enhancement model, robust representation model, and speech recognition model can all be models composed of artificial neural networks.
  • the model architecture for speech processing provided in this application is completely based on neural networks and can be realized End-to-end joint training.
  • the entire end-to-end joint training process does not artificially divide the task. Instead, the entire speech processing task is completely handed over to the neural network model to directly learn the mapping from the original speech signal to the desired output.
  • the computer device determines the global descent gradient generated by the target loss function according to a preset deep learning optimization algorithm, for example, the loss value is calculated based on the target loss function, and the global descent gradient is determined based on the loss value.
  • the deep learning optimization algorithm used to determine the local descent gradient and the deep learning optimization algorithm used to determine the global descent gradient may be the same or different.
  • the global descent gradient generated by the target loss function is sequentially backpropagated from the speech recognition model to the robust representation model and the speech separation enhancement model.
  • the speech separation enhancement model, the robust representation model and the speech recognition model are corresponding
  • the model parameters of are respectively updated iteratively, and the training ends when the preset training stop conditions are met.
  • the joint training of the speech separation enhancement model, the robust representation model, and the speech recognition model based on the target loss function includes: determining the global descent gradient generated by the target loss function; The model parameters corresponding to the stick characterization model and the speech recognition model are updated iteratively until the minimum loss value of the target loss function is obtained.
  • the global descent gradient generated by the objective loss function is always Backpropagation to speech separation enhancement model Change the model parameters to And use the changed model parameters as the current model parameters of the joint model to continue iterating until the preset training stop condition is reached.
  • the training stop condition may be that the loss value of the target loss function reaches the preset minimum value, or the model performance of the intermediate model is not significantly improved by successively preset times of iterations.
  • the batch size of the sample speech stream can be 24, the initial learning rate ⁇ can be 10 -4 , the decay coefficient of the learning rate can be 0.8, and the loss of the objective loss function in 3 consecutive iterations is up to When there is no improvement, the joint model is considered to have converged, and the joint training ends.
  • the above speech recognition method proposes a new end-to-end network architecture that introduces a robust representation model for intermediate transition between the front-end speech separation enhancement model and the back-end speech recognition model.
  • This architecture introduces appropriate intermediate Transitional representation learning technology bridges the difference between human-oriented speech separation tasks and machine-oriented speech recognition tasks; in this network architecture, the intermediate model uses the second loss of the back-end speech recognition model
  • the function is back propagated to complete the training, and the speech separation enhancement model and the speech recognition model can be pre-selected and trained, so that convergence can be achieved after a small number of iterations of training; based on the front-end and back-end models respectively correspond to the combined end of the loss function
  • the end-to-end network model is jointly trained, so that each individual model in the network architecture can comprehensively learn the interference features in the speech signal from the complex acoustic environment, so as to ensure the performance of the global speech processing task and improve the accuracy of speech recognition; in addition, , Since each model in the network architecture supports flexible and independent selection, each
  • the above-mentioned speech recognition method further includes a step of pre-training the speech separation enhancement model, which is specifically as follows:
  • S402 Extract the estimated frequency spectrum and embedded feature matrix of the sample voice stream based on the first neural network model.
  • the first neural network model, the second neural network model and the third neural network model mentioned below may be any one of the above-mentioned artificial neural network models.
  • the first neural network model may be a model simplified by a deep attractor network (DeepAttractorNet, DANet) and a deep extraction network (DeepExtractorNet, DENet) based on an ideal ratio mask (IdealRatioMask, IRM).
  • the DENet network includes one or more convolutional neural networks.
  • the convolutional neural network may adopt a BiLSTM network.
  • the BiLSTM network is used to map the speech signal from a low-dimensional space to a high-dimensional space.
  • Sample voice streams can be audio data streams collected by voice applications in devices such as vehicle-mounted systems, teleconference equipment, speaker equipment, or online broadcasting equipment in different complex acoustic environments. Voice applications can be system phone applications, instant messaging applications, virtual voice assistants, or machine translation applications. Each sample audio stream may include multiple audio frames.
  • the sampling frequency of collecting audio frames in the sample audio stream and the frame length and frame shift of each audio frame can be set freely according to requirements. In a specific embodiment, a sampling frequency of 16 kHz, a frame length of 25 ms, and a frame shift of 10 ms can be used for audio frame collection.
  • the computer device may perform short-time Fourier transform on multiple sample voice streams in batches to obtain the voice features and voice frequency spectrum in each sample voice stream.
  • the batch size of the sample audio stream can be freely set according to requirements, such as 24.
  • the computer device maps the voice features of the batch sample voice stream to a higher-dimensional embedding space, and converts to obtain an embedded feature matrix.
  • the computer equipment separates and enhances the speech frequency spectrum based on the first neural network model to obtain the estimated frequency spectrum.
  • the estimated frequency spectrum is the frequency spectrum of the sample voice stream output by the first neural network model.
  • S404 Determine an attractor corresponding to the sample voice stream according to the embedded feature matrix and the preset ideal masking matrix.
  • the ideal masking matrix is an adaptive perceptual masking matrix established to constrain the noise energy and speech distortion energy in the speech signal, and it records the masking thresholds corresponding to different speech frequencies.
  • the ideal masking matrix can be predicted based on the low-dimensional speech features of the speech signal and the high-dimensional embedded feature matrix.
  • the attractor is a feature vector that can characterize the universal features of each sample voice stream in the embedding space.
  • the speech separation enhancement model based on DANet network is to calculate the weighted average of the vector of the target speech training sample in the embedding space and store it as the "attractor" of the target speech. Therefore, only one attractor needs to be calculated in the embedding space.
  • the computer device predicts the ideal masking matrix corresponding to the batch of sample voice streams based on the voice signal and the voice spectrum extracted from the voice signal through the short-time Fourier transform.
  • the ideal masking matrix and the embedded feature matrix are in the same dimension of the embedding space.
  • the computer device calculates the product of the embedded feature matrix and the ideal masking matrix, and determines the attractor of the embedded space based on the product result.
  • S406 Obtain the target masking matrix of the sample voice stream by calculating the similarity between each matrix element in the embedded feature matrix and the attractor.
  • the computer device combines the similarity between the voice feature and the attractor to modify the masking threshold to reconstruct the ideal masking matrix to obtain the target masking matrix.
  • the method for measuring the similarity between each matrix element in the embedded feature matrix and the attractor can specifically adopt Euclidean distance, Manhattan distance, Chebyshev distance, Mahalanobis distance, cosine distance, or Hamming distance.
  • S408 Determine an enhanced frequency spectrum corresponding to the sample voice stream according to the target masking matrix.
  • the voice signal collected in the real acoustic scene is usually a mixed signal in which noise is mixed into the target voice.
  • the enhanced spectrum corresponding to the sample voice stream may be the enhanced spectrum of the target voice in the voice signal.
  • the high-dimensional embedded feature matrix is subjected to dimensionality reduction processing, and converted back to a low-dimensional enhanced spectrum.
  • the computer device calculates the mean-square error loss MSE (mean-square error) between the enhanced spectrum of the batch sample voice stream and the enhanced spectrum of the target speech, and pre-trains the first neural network model through the mean-square error loss MSE:
  • M is the batch size of the mixed signal sample voice stream used for training
  • i represents the index of the training sample voice stream
  • 2 represents the 2-norm of the vector
  • S S represents the direct first neural network model output
  • the ideal ratio mask IRM is an effective method for speech separation and enhancement.
  • the ideal masking matrix based on IRM can constrain the noise energy and speech distortion energy in the speech signal, combined with the high-dimensional embedded feature matrix corresponding to the speech signal and the representative Its universal characteristic attractor reconstructs the ideal masking matrix, and performs spectrum extraction based on the reconstructed target masking matrix, which can make the extracted estimated spectrum closer to the enhanced spectrum of the sample speech stream and improve the effectiveness of spectrum extraction.
  • extracting the estimated spectrum and embedded feature matrix of the sample voice stream based on the first neural network model includes: performing Fourier transform on the sample voice stream to obtain the voice spectrum and voice features of each audio frame; The neural network model performs speech separation and enhancement on the speech frequency spectrum to obtain an estimated frequency spectrum; based on the first neural network model, the speech feature is mapped to the embedding space to obtain an embedded feature matrix.
  • the voice signal collected in the real acoustic scene is usually a mixed signal mixed with noise.
  • the short-time Fourier transform STFT calculation is performed on the mixed signal and the reference target speech signal, and the speech frequency spectrum and speech characteristics corresponding to the mixed signal can be obtained.
  • the voice feature may be a feature matrix in a low-dimensional mixed signal space R TF.
  • the feature dimension of the speech feature extracted by Fourier transform is the TxF dimension. Among them, T is the number of frames, and F is the number of mel filter bands in the mel filter bank MF.
  • DENet maps the speech features from the mixed signal space R TF to the higher-dimensional embedding space R TF*K through the BiLSTM network, so that the output is changed to the embedded feature matrix:
  • the embedding vector dimension K used for high-dimensional mapping can be set freely, such as 40.
  • the first neural network model may be obtained by cascading a preset number of BiLSTM models of peephole connections and a fully connected layer.
  • Peephole connection is a model connection method that is different from conventional cascading, and more contextual information can be obtained.
  • the gates in the forward LSTM and the backward LSTM are controlled only by the current input x(t) and the short-term state h(t-1) at the previous moment.
  • the first neural network model may adopt a four-layer BiLSTM connected by peepholes, each layer has 600 hidden nodes, and a fully connected layer is connected after the last BiLSTM layer. The fully connected layer is used to map the 600-dimensional speech feature vector to a high-dimensional embedded feature matrix.
  • the 600-dimensional speech feature vector can be mapped to the 24,000-dimensional embedded feature vector.
  • the low-dimensional voice features of the voice signal are mapped into a high-dimensional embedded feature matrix, which can ensure the effect of the first neural network model for voice separation and enhancement.
  • determining the attractor of the sample voice stream according to the embedded feature matrix and the preset ideal masking matrix includes: determining the ideal masking matrix according to the voice frequency spectrum and voice features; and comparing the ideal masking matrix based on the preset binary threshold matrix Noise elements are filtered; according to the embedded feature matrix and the ideal masking matrix that filters the noise elements, the attractor corresponding to the sample voice stream is determined.
  • the calculation formula of the attractor in the embedded space can be:
  • a s ⁇ R K , ⁇ represents matrix element multiplication
  • M s
  • is an ideal masking matrix
  • w ⁇ R TF is a binary threshold matrix
  • the calculation formula of the binary threshold matrix is as follows:
  • the binary threshold matrix w is used to eliminate matrix elements with too small energy in the ideal masking matrix to reduce noise interference. Then, by calculating the similarity between the attractor and each matrix element in the embedded feature matrix, the masking matrix of the target speech can be estimated, referred to as the target masking matrix:
  • the enhanced spectrum of the target speech can be extracted by the following calculation methods:
  • the attractors calculated in the first neural network model training stage are stored, and the average value of these attractors is calculated, and the average value is used as the global attractor in the test production stage to extract the target voice stream of the test. Enhance the spectrum.
  • the attractor calculation is performed after filtering out the noise elements in the ideal masking matrix, which can improve the accuracy of attractor calculation, and make the calculated attractor better reflect the voice characteristics of the voice data.
  • the above-mentioned speech recognition method further includes a step of constructing an intermediate model, which is specifically as follows:
  • the second neural network model is a model bridged between the front-end speech separation enhancement model and the back-end speech recognition model.
  • the acoustic environment faced by this application is very complex, and it is necessary to minimize the impact of speech recognition errors from the front end when the input spectrogram is a defective spectrum that includes spectrum estimation errors and temporal distortions.
  • the context difference between frame-level spectrogram extraction and phoneme-level speech recognition tasks also increases the time dynamic complexity of the fusion of front-end and back-end speech processing tasks.
  • the joint model obtained by bridge training based on the second neural network model provided in this application can adapt to more complex acoustic environments.
  • the second neural network model uses a more complex Recurrent model architecture.
  • a typical Recurrent model architecture includes a model structure that can use the context of the input spectrogram to predict points in the output acoustic feature space, such as deep convolutional neural network CNN or BiLSTM.
  • the BiLSTM model is usually called a general program approximator, which can learn intermediate representations by effectively estimating the conditional posterior probability of the complete sequence without making any explicit assumptions about its distribution.
  • the second neural network model adopts the BiLSTM model structure ⁇ BiLSTM ( ⁇ ) as an example for description.
  • the second neural network model may be obtained by peephole connecting a preset number of BiLSTM models.
  • the second neural network model may adopt a two-layer BiLSTM connected by a peephole, and each layer has 600 hidden nodes.
  • S504 Perform non-negative constraint processing on the second neural network model to obtain a non-negative neural network model.
  • the non-negative constraint processing is a processing step that can ensure that the second neural network model is non-negative.
  • the filter bank Fbanks output by the mel filter is non-negative, while the output of the standard BiLSTM has no non-negative limit.
  • the embodiment of the present application performs non-negative constraint processing on the second neural network model.
  • performing non-negative constraint processing on the second neural network model includes: performing a squaring operation on the second neural network model; the second neural network model includes a bidirectional long and short-term memory network model.
  • the computer device adds a squaring process to the output of the second neural network model to match the non-negativity of Fbanks. After evaluation, it is found that the squaring process is not only short in calculation logic, but also has a better effect of nonlinear transformation on the second neural network model than activation functions such as Rectified Linear Unit (ReLU).
  • ReLU Rectified Linear Unit
  • S506 Obtain a differential model for auditory adaptation of the acoustic features output by the non-negative neural network model; cascade the differential model and the non-negative neural network model to obtain an intermediate model.
  • auditory adaptation refers to simulating the operation of the human ear to make the acoustic characteristics conform to the hearing habits of the human ear.
  • the differential model is a calculation formula that simulates the operation of the human ear. Research has found that for high-amplitude speech signals and low-amplitude speech signals with very large spectral amplitude differences, the difference that human ears can perceive may not be as obvious as the amplitude difference. For example, for two speech signals with amplitudes of 1000 and 10, the difference that the human ear can perceive may only be the difference such as 3 and 1. In addition, the human ear is more sensitive to changes in the voice signal.
  • the computer device obtains the pre-built differential model, uses the differential model as a process step of performing auditory matching optimization on the acoustic features output by the non-negative neural network model, and cascades it after the non-negative neural network model to obtain an intermediate model.
  • intermediate models include non-negative neural network models and differential models.
  • the logic of simulating human ear operation is embodied in the form of a differential model.
  • the second neural network model does not need to learn the logic of simulating human ear operation, which reduces the learning complexity of the second neural network model and helps improve Training efficiency of the intermediate model.
  • the second neural network model can be directly used as an intermediate model, without the need for non-negative constraint processing on the second neural network model, and no need for splicing of differential models.
  • the second neural network model needs to learn and simulate the operation logic of the human ear by itself.
  • the self-learning based on the second neural network model can learn more comprehensive simulation human ear operation logic and achieve better auditory matching effect.
  • the second neural network model ie, robust representation model trained in the test production phase can adapt to more and more complex acoustic environments.
  • the non-negative constraint processing is performed on the second neural network model, and the differential model used to simulate the operation of the human ear is spliced, so that the acoustic characteristics of the model output can be more in line with the actual hearing habits of the human ear, thereby helping to improve The speech recognition performance of the entire EAR system.
  • the above-mentioned speech recognition method further includes: obtaining a differential model for auditory adaptation of the acoustic features output by the non-negative neural network model includes: obtaining a pair for performing logarithmic operation on the feature vector corresponding to the acoustic feature.
  • Mathematical model Obtain a differential model used to perform differential operations on the feature vector corresponding to acoustic features; Construct a differential model based on the logarithmic model and the differential model.
  • the logarithmic model is used to perform logarithmic operations on the feature vector elements of the acoustic features output by the non-negative neural network model.
  • the logarithmic model can be any model that can implement logarithmic operations of elements, such as lg x, ln x, etc., where x is an acoustic feature vector element.
  • x is an acoustic feature vector element.
  • the logarithmic operation of the characteristic vector elements of the acoustic characteristics can weaken the difference between the assignments, and make the difference between the different vector elements of the acoustic characteristics better reflect the actual signal difference that the human ear can feel .
  • the vector element 1000 is converted to 3
  • the vector element 10 is converted to 1, which well reflects the actual performance of the human ear. Can feel the signal difference.
  • the difference model is used for the difference operation of the feature vector element memory of the acoustic features output by the non-negative neural network model.
  • the difference model can be any model that can implement element difference operations, such as first-order difference operations and second-order difference operations.
  • the human ear is more sensitive to changes in the speech signal.
  • the difference operation is performed on the feature vector elements of the acoustic features based on the difference model, and the difference result reflects the changes between the different vector elements of the acoustic features.
  • the computer device may use the logarithmic model and the difference model as two parallel models to construct a differential model, or cascade the logarithmic model and the difference model to construct a differential model.
  • the cascading sequence of the logarithmic model and the differential model can be that the logarithmic model is cascaded after the differential model, or the differential model is cascaded after the logarithmic model.
  • the differential model may also include other models for auditory adaptation, which is not limited.
  • ⁇ adapt is the model parameter of the intermediate model, To perform non-negative constraint processing on the second neural network model and concatenate the intermediate model obtained from the differential model; It is the second neural network model itself.
  • the computer device may also perform global mean variance normalization processing on the vector elements of the acoustic features.
  • the method used for the normalization process may specifically be 01 standardization, Z-score standardization, or sigmoid function standardization.
  • the computer device may also splice the acoustic characteristics of each audio frame in a context window of 2W+1 frames centered on the current audio frame in the sample audio stream.
  • W represents the size of the one-sided context window, and the specific size can be freely set according to requirements, such as 5.
  • the logarithm calculation is performed on the non-negative neural network model, so that the difference between the different vector elements of the acoustic characteristics of the speech signal can better reflect the signal difference that the human ear can actually feel; for the non-negative neural network
  • the model performs a differential operation to reflect the changes between different vector elements of acoustic features, and then adapt the human ear to the auditory features that are more sensitive to changes in the voice signal.
  • the above-mentioned speech recognition method further includes a step of pre-training a speech recognition model, which is specifically as follows:
  • S602 Obtain a sample voice stream and a correspondingly labeled phoneme category.
  • each audio frame in the sample voice stream has corresponding annotation data.
  • the annotation data includes the phoneme category corresponding to the audio frame determined according to the output user or the voice content of the target voice in the audio frame.
  • S604 Extract the depth feature of each audio frame in the sample voice stream through the third neural network model.
  • the third neural network model in this embodiment may be an acoustic model implemented based on CLDNN (CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP NEURAL NETWORKS, a network obtained by fusing CNN, LSTM, and DNN).
  • CLDNN CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP NEURAL NETWORKS, a network obtained by fusing CNN, LSTM, and DNN.
  • the output of the CNN layer and the LSTM layer can be batch normalized, which has achieved faster convergence and better generalization.
  • the computer device extracts the depth feature of each audio frame in the sample voice stream through the third neural network model.
  • the third neural network model includes the Softmax layer.
  • the computer device can determine the probability that the robust feature vector element belongs to each phoneme category based on the Softmax layer.
  • the depth feature of each audio frame in the context window of 2W+1 frame centered on the current audio frame in the sample audio stream may be spliced, and the splicing result can be used as the depth feature of the current audio frame. In this way, it is possible to obtain in-depth features reflecting the context information, which helps to improve the accuracy of the third neural network model.
  • S606 Determine the center vector of the sample voice stream according to the depth features corresponding to the audio frames of all phoneme categories.
  • S608 Determine the fusion loss between the inter-class confusion measurement index and the intra-class distance penalty index of each audio frame based on the depth feature and the center vector.
  • the center vector is used to describe the center of all depth features in the target category.
  • the inter-class confusion measurement index of an audio frame refers to a parameter used to characterize the possibility of a sample voice stream belonging to a target category, and can reflect the distinction between different target categories. The smaller the inter-class confusion index, the stronger the distinction between classes.
  • the inter-class confusion index can be calculated by Euclidean distance, or by other distance type algorithms, such as angular distance.
  • the intra-class distance penalty index refers to a parameter used to characterize the compactness of the intra-class distribution of the sample speech stream.
  • the classification performance of the third neural network model can be enhanced by penalizing the intra-class distance, that is, the intra-class discrimination performance can be satisfied by compact distribution within the class.
  • the intra-class distance penalty index can be realized by the number of center loss rows, but it is not limited to this. For example, it can also be realized by using angular distance Contrastive loss function, Triplet loss function, Sphere face loss function, and CosFace loss function.
  • the computer equipment integrates the inter-class confusion measurement index and the intra-class distance penalty index by weighting the inter-class confusion measurement index and the intra-class distance penalty index according to a preset weighting factor:
  • L CL is the fusion loss
  • L ce is the inter-class confusion measurement index
  • L ct is the intra-class distance penalty index
  • ⁇ CL is the weighting factor
  • the computer device determines the global descent gradient generated by the target loss function according to a preset deep learning optimization algorithm.
  • the global descent gradient generated by the target loss function is sequentially backpropagated from the speech recognition model to the robust representation model and the speech separation enhancement model of the network layers:
  • the extractor based on the DENet network will generate a high-dimensional embedded feature matrix V through the BiLSTM network to predict the target floating value mask suitable for the target speech use
  • the mean square error MSE between the estimated spectrum output by the extractor and the enhanced spectrum of the target speech can be calculated, and robust features for the target speech can be generated.
  • the robust features can continue to pass through the adapter and recognizer to predict the speech unit.
  • this application updates the parameters of the DENet network by means of multi-task joint training, where the multi-task joint loss function (ie, the target loss function) ) Is the weighted combination of the first loss function of speech separation task and the second loss function of speech recognition. Since the forward process of the DENet network can calculate the cross-entropy and the center loss weighting and the spectral mean square error at the same time, the gradient of each loss function in the model parameter can be obtained by back propagation. After adding the weighting factor, it is possible to adjust the "importance" of the speech separation task during multi-task training.
  • the multi-task joint loss function ie, the target loss function
  • the center point of each category in the depth feature space can be learned and updated based on the center loss.
  • the error rate in the environment effectively improves the generalization ability of speech recognition to the variability of noise, and then a lower error rate can be obtained under the condition of clean speech, the acoustic environment that has been trained, and the acoustic environment that has not been seen; make the sample speech
  • the streaming standard can have better robustness in the new acoustic environment. Even in the new acoustic environment, when encountering different users based on new accents and background noise, it can also be able to achieve stable and reliable complete speech recognition.
  • determining the fusion loss of the inter-class confusion measurement index and the intra-class distance penalty index of each audio frame based on the depth feature and the center vector includes: inputting the depth feature into the cross-entropy function, and calculating the inter-class of each audio frame Confusion measurement index; input the depth feature and center vector into the center loss function to calculate the intra-class distance penalty index of each audio frame; perform the fusion operation between the inter-class confusion measurement index and the intra-class distance penalty index to obtain the fusion loss.
  • the cross-entropy function is used to ensure the discriminability of depth features between classes.
  • the calculation formula of the cross entropy function can be as follows:
  • L ce is the inter-class confusion measurement index
  • M is the batch size of the sample voice stream used for training
  • T is the number of audio frames in the sample voice stream.
  • a t softmax output layer neural network model before the third layer of the first audio frame time t Is the output of the j-th node in the previous layer of the softmax layer at the t-th audio frame time, W is the weight matrix of the softmax layer, and B is the bias vector of the softmax layer.
  • L ct is the within-class distance penalty index
  • i is the index of the sample voice stream.
  • the goal is to reduce the distance between the depth feature of the audio frame and its center vector as much as possible, that is, the smaller the intra-class distance u t -c Kt, the better.
  • the computer device fuses the cross-entropy loss function and the central loss function to obtain the second loss function corresponding to the speech recognition model.
  • the method of fusing the cross-entropy loss function and the central loss function may be to perform a weighted calculation on the cross-entropy loss function and the central loss function according to a preset weighting factor:
  • L CL is the second loss function
  • ⁇ CL is the hyperparameter of the weight of the control center loss function in the second loss function.
  • the method of fusing the inter-class confusion measurement index and the intra-class distance penalty index is to weight the inter-class confusion measurement index and the intra-class distance penalty index according to the preset weighting factor ⁇ CL .
  • the center loss function can be used to learn and update the center point of each category in the depth feature space, and the distance between the depth feature and the center point of the corresponding category is penalized, thereby improving the distinguishing ability of depth features.
  • the voice recognition method specifically includes the following steps:
  • S702 Perform Fourier transform on the sample voice stream to obtain the voice frequency spectrum and voice characteristics of each audio frame.
  • S704 Perform speech separation and enhancement on the speech frequency spectrum based on the first neural network model to obtain an estimated frequency spectrum.
  • S706 Map the voice features to the embedding space based on the first neural network model to obtain an embedding feature matrix.
  • S708 Determine an ideal masking matrix according to the speech frequency spectrum and speech characteristics.
  • S710 Filter noise elements in the ideal masking matrix based on a preset binary threshold matrix.
  • S712 Determine an attractor corresponding to the sample voice stream according to the embedded feature matrix and the ideal masking matrix with noise elements filtered.
  • S714 Obtain the target masking matrix of the sample voice stream by calculating the similarity between each matrix element in the embedded feature matrix and the attractor.
  • S716 Determine the enhanced frequency spectrum corresponding to the sample voice stream according to the target masking matrix.
  • S718 Calculate the mean square error loss between the estimated spectrum and the enhanced spectrum corresponding to the sample voice stream based on the first loss function.
  • S722 Obtain a sample voice stream and the correspondingly labeled phoneme category.
  • S724 Extract the depth feature of each audio frame in the sample voice stream through the third neural network model.
  • S726 Determine the center vector of the sample voice stream according to the depth features corresponding to the audio frames of all phoneme categories.
  • S728 Input the depth feature into the cross entropy function, and calculate the inter-class confusion measurement index of each audio frame.
  • S730 Input the depth feature and the center vector into the center loss function, and calculate the intra-class distance penalty index of each audio frame.
  • S732 Perform a fusion operation on the inter-class confusion measurement index and the intra-class distance penalty index to obtain a fusion loss based on the second loss function.
  • S736 Obtain a first loss function of the speech separation enhancement model and a second loss function of the speech recognition model.
  • S740 Perform non-negative constraint processing on the second neural network model to obtain a non-negative neural network model.
  • S746 Perform back propagation based on the second loss function to train the intermediate model bridged between the speech separation enhancement model and the speech recognition model to obtain a robust representation model.
  • S750 Determine the global descent gradient generated by the target loss function.
  • the robust characterization module ⁇ BiLSTM connects the front-end speech separation enhancement model and the back-end speech recognition model, making the entire EAR system a network that can realize end-to-end back propagation, and due to the modular architecture, the entire EAR system
  • the network can use the "curriculum” training method (Curriculum learning), that is, the robust representation model is independently trained based on the back-propagation of the loss function of the back-end speech recognition model, and then the entire EAR system network is jointly trained end-to-end . Since the training can be carried out on the basis of the pre-trained speech separation enhancement model and speech recognition model, the "course schedule" training method can quickly achieve convergence.
  • the above-mentioned speech recognition method, powerful network structure and "course schedule” training method make the joint model trained based on the speech recognition method provided in this application extremely strong in learning ability. It can extract robust and effective speech enhancement and speech separation representations. Improve the performance of automatic speech recognition and be able to adapt to any challenging and complex interference acoustic environment.
  • a voice recognition method is provided.
  • the method is mainly applied to a computer device as an example.
  • the computer device may be the terminal 110 or the server 120 in the above figure. Both the terminal 110 and the server 120 can be used independently to execute the voice recognition method provided in the embodiment of the present application.
  • the terminal 110 and the server 120 can also be used in cooperation to execute the voice recognition method provided in the embodiment of the present application.
  • the voice recognition method specifically includes the following steps:
  • the target voice stream can be an audio data stream collected in any actual acoustic environment.
  • the target voice stream can be pre-collected and stored in the computer device, or it can be dynamically collected by the computer device.
  • the target voice stream may be an audio data stream generated by a user during a game voice call collected by a game application.
  • the target voice stream may be echo interference including game background music and remote voices.
  • the computer device obtains the target voice stream, and collects audio frames in the target voice stream according to a preset sampling frequency.
  • the frame length of each audio frame and the frame shift between adjacent audio frames can be freely set according to requirements.
  • the computer device collects audio frames based on a sampling frequency of 16 kHz, a frame length of 25 ms, and a frame shift of 10 ms.
  • the speech separation enhancement model is a neural network model, which can be a deep attractor network (DANet) and a deep extraction network (Deep Extractor Net, DENet) based on an ideal ratio mask (IRM). ) Simplify the resulting model.
  • the speech separation enhancement model may adopt a peephole-connected four-layer BiLSTM, each layer has 600 hidden nodes, and a fully connected layer is connected after the last BiLSTM layer.
  • the computer device can perform short-time Fourier transform on multiple target voice streams in batches to obtain the voice features and voice frequency spectrum of each target voice stream.
  • the computer device maps the voice features of the batch target voice stream to a higher-dimensional embedding space based on the voice separation enhancement model, and performs voice separation and enhancement on the voice spectrum in the embedding space to obtain an embedded feature matrix.
  • the computer device obtains the pre-stored global attractor.
  • the computer device stores the attractors calculated from each batch of sample speech streams, calculates the average value of these attractors, and uses the average value as the global attractor in the test production phase.
  • the computer device obtains the target masking matrix of the target voice stream by calculating the similarity between each matrix element in the embedded feature matrix corresponding to the global attractor and the target voice stream. Based on the target masking matrix and the embedded feature matrix, the enhanced spectrum of the target speech stream can be extracted.
  • S806 Perform auditory matching on the enhanced spectrum based on the robust representation model to obtain robust features.
  • the robust representation model is a neural network model that bridges between the front-end speech separation enhancement model and the back-end speech recognition model.
  • it can be CNN, BiLSTM, etc. based on the Recurrent model architecture, and has the ability to adapt from bottom to top And the dynamic impact of time from top to bottom.
  • the robust representation model may be a two-layer BiLSTM connected by a peephole, each layer has 600 hidden nodes.
  • the robust feature is an intermediate transition feature obtained by converting the enhanced spectrum output by the front-end speech separation enhancement model, and the intermediate transition feature is used as the input of the back-end speech recognition model.
  • the computer equipment enhances the acoustic characteristics of the spectrum based on the robust characterization model.
  • the robust representation model performs auditory matching on the acoustic characteristics of the enhanced spectrum.
  • the computer equipment performs non-negative constraint processing on the acoustic features based on the robust representation model, and performs differential operations such as logarithm and difference on the acoustic features after the non-negative constraint processing to obtain robust features. For example, for high-amplitude speech signals and low-amplitude speech signals with a very large spectral amplitude difference, the difference that the human ear can perceive may not be as obvious as the amplitude difference.
  • the logarithmic operation of the characteristic vector elements of the acoustic characteristics can weaken the difference between the assignments, and make the difference between the different vector elements of the acoustic characteristics better reflect the actual signal difference that the human ear can feel .
  • the human ear is more sensitive to changes in the voice signal.
  • the difference operation is performed on the feature vector elements of the acoustic features based on the difference model, and the difference result reflects the changes between the different vector elements of the acoustic features.
  • S808 Recognize the robust features based on the speech recognition model to obtain the phoneme corresponding to each audio frame; among them, the speech separation enhancement model, the robust representation model and the speech recognition model are jointly trained.
  • the speech recognition model, the aforementioned speech separation enhancement model, and the robust representation model may be obtained through joint training in advance.
  • the front-end speech separation enhancement model and the back-end speech recognition model can be pre-trained.
  • the computer equipment obtains the first loss function of the speech separation enhancement model and the second loss function of the speech recognition model, and calculates the loss value based on the second loss function to perform back propagation according to the loss value to bridge between the speech separation enhancement model and the speech recognition model.
  • the intermediate model between the recognition models is trained to obtain a robust representation model.
  • the computer device further fuses the first loss function and the second loss function, and performs joint training on the speech separation enhancement model, robust representation model and speech recognition model based on the target loss function obtained by the fusion, and ends the training when the preset convergence conditions are met .
  • the computer device inputs the robust features into the speech recognition model to obtain the phonemes corresponding to the target speech stream.
  • the speech recognition model can recognize about 20,000 phoneme categories.
  • the speech recognition model processes the robust features of the input batch of target speech streams, and outputs a phoneme vector of about 20,000 dimensions. There is a correspondence between robust feature vector elements and phoneme vector elements.
  • the phoneme vector records the probability that the robust feature vector element belongs to each phoneme category, so that the phoneme string corresponding to each robust feature vector element corresponding to the maximum probability phoneme category can be determined, so as to realize the speech recognition of the target speech stream from the phoneme level.
  • the above speech recognition method proposes a new end-to-end network architecture that introduces a robust representation model between the front-end speech separation enhancement model and the back-end speech recognition model.
  • This architecture introduces appropriate intermediate transition representation learning technology, It bridges the difference between human-oriented speech separation tasks and machine-oriented speech recognition tasks; joint training of end-to-end network models allows each individual model in the network architecture to comprehensively learn from
  • the interference features in speech signals in complex acoustic environments can ensure the performance of global speech processing tasks and improve the accuracy of speech recognition; in addition, since each model in the network architecture supports flexible and independent selection, each model can be implemented separately Optimal configuration without compromising any single model, so that the performance of each local speech processing task can be taken into account at the same time, and the objective intelligibility of speech can be improved.
  • the speech separation enhancement model includes a first neural network model; extracting the enhanced spectrum of each audio frame in the target speech stream based on the speech separation enhancement model includes: extracting each audio in the target speech stream based on the first neural network model The embedded feature matrix of the frame; according to the embedded feature matrix and the preset ideal masking matrix, determine the attractor corresponding to the target speech stream; by calculating the similarity between each matrix element in the embedded feature matrix and the attractor, the target mask of the target speech stream is obtained Matrix: Determine the enhanced frequency spectrum corresponding to each audio frame in the target voice stream according to the target masking matrix.
  • the speech separation enhancement model may be obtained by training based on the first neural network model.
  • the process of extracting the enhanced spectrum of each audio frame in the target voice stream based on the voice separation enhancement model can refer to the description of the above steps S402-S410, which will not be repeated here.
  • the robust representation model includes a second neural network model and a differential model; performing auditory matching on the enhanced spectrum based on the robust representation model to obtain robust features includes: extracting acoustics from the enhanced spectrum based on the second neural network model Features; non-negative constraint processing is performed on the acoustic features to obtain non-negative acoustic features; the non-negative acoustic features are differentiated through the differential model to obtain robust features that match the human hearing habits.
  • the intermediate model may be obtained by splicing the second neural network model and the differential model, and training the intermediate model to obtain a robust representation model. For extracting robust features based on the robust representation model, reference may be made to the description of the above steps S502-S506, which will not be repeated here.
  • WER word error rate
  • the test result can be referred to as shown in Fig. 9a.
  • the word error rate is consistently better than other speech recognition systems, such as the cascade system of the speech recognition model ASR based on clean speech or interfering speech training, the speech separation enhancement model SS and the speech recognition model ASR.
  • Fig. 9b shows a schematic diagram of the performance comparison of different speech recognition systems under different SNR conditions in a single-channel multi-speaker speech recognition task in an embodiment.
  • the EAR system proposed in this application directly uses the speech separation enhancement model as the preprocessing step of the speech recognition model in a cascaded manner, regardless of the short-term objective intelligibility STOI or The word error rate WER (%) is excellent, that is, the EAR system can significantly improve the machine-oriented speech intelligibility (WER) while maintaining the speech intelligibility (STOI) that reflects the human hearing.
  • WER machine-oriented speech intelligibility
  • STOI speech intelligibility
  • the examples of this application reveal for the first time the importance of introducing appropriate intermediate transition representation learning in bridging the difference between human-oriented and machine-oriented speech processing tasks, which can ensure that front-end speech separates local tasks and back-end speech recognition at the same time.
  • the optimal performance of the local task human subjective auditory intelligibility
  • the optimal performance of the global task the performance of the machine's recognition accuracy index.
  • the EAR system proposed in this paper is based on the robust representation model.
  • the noise restraint can solve the echo interference of the game background sound and the far-end human voice during the game voice call.
  • the EAR system framework proposed in this application has high flexibility: it allows the flexible integration of any advanced speech separation enhancement model and speech recognition model to the corresponding modules in the EAR system framework, and the end of our proposal The end-to-end trainable framework does not come at the cost of any single module performance impairment.
  • the voice recognition method specifically includes the following steps:
  • S1004 Extract the embedded feature matrix of each audio frame in the target voice stream based on the voice separation enhancement model.
  • S1006 Determine the attractor corresponding to the target voice stream according to the embedded feature matrix and the preset ideal masking matrix.
  • S1008 Obtain the target masking matrix of the target speech stream by calculating the similarity between each matrix element in the embedded feature matrix and the attractor.
  • S1010 Determine the enhanced frequency spectrum corresponding to each audio frame in the target voice stream according to the target masking matrix.
  • S1012 Obtain a robust representation model; the robust representation model includes a second neural network model and a differential model.
  • S1014 Extract acoustic features from the enhanced frequency spectrum based on the second neural network model.
  • S1016 Perform non-negative constraint processing on the acoustic features to obtain non-negative acoustic features.
  • S1018 Perform a differential operation on non-negative acoustic features through a differential model to obtain robust features that match the human hearing habits.
  • S1020 Recognize the robust features based on the speech recognition model to obtain the phoneme corresponding to each audio frame; among them, the speech separation enhancement model, the robust representation model and the speech recognition model are jointly trained.
  • the above speech recognition method proposes a new end-to-end network architecture that introduces a robust representation model between the front-end speech separation enhancement model and the back-end speech recognition model.
  • This architecture introduces appropriate intermediate transition representation learning technology, It bridges the difference between human-oriented speech separation tasks and machine-oriented speech recognition tasks; joint training of end-to-end network models allows each individual model in the network architecture to comprehensively learn from
  • the interference features in speech signals in complex acoustic environments can ensure the performance of global speech processing tasks and improve the accuracy of speech recognition; in addition, since each model in the network architecture supports flexible and independent selection, each model can be implemented separately Optimal configuration without compromising any single model, so that the performance of each local speech processing task can be taken into account at the same time, and the objective intelligibility of speech can be improved.
  • steps in the above flowchart are displayed in sequence as indicated by the arrows, but these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least a part of the steps of the above flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • a speech recognition device 1100 which includes an intermediate representation learning module 1102, a loss fusion module 1104, and a joint training module 1106, where:
  • the intermediate representation learning module 1102 is used to obtain the first loss function of the speech separation enhancement model and the second loss function of the speech recognition model; back propagation is performed based on the second loss function to bridge the speech separation enhancement model and the speech recognition model
  • the intermediate model is trained to obtain a robust representation model.
  • the loss fusion module 1104 is used to fuse the first loss function and the second loss function to obtain the target loss function.
  • the joint training module 1106 is used for joint training of the speech separation enhancement model, the robust representation model and the speech recognition model based on the target loss function, and the training ends when the preset convergence condition is met.
  • the above-mentioned speech recognition device 1100 further includes a speech separation enhancement model pre-training module 1108, which is used to extract the estimated frequency spectrum and embedding feature matrix of the sample speech stream based on the first neural network model;
  • the feature matrix and the preset ideal masking matrix are used to determine the attractor corresponding to the sample voice stream;
  • the target masking matrix of the sample voice stream is obtained by calculating the similarity between each matrix element in the embedded feature matrix and the attractor; the sample is determined according to the target masking matrix
  • the first neural network model is trained based on the mean square error loss between the estimated spectrum corresponding to the sample voice stream and the enhanced spectrum to obtain a voice separation enhancement model.
  • the speech separation enhancement model pre-training module 1108 is also used to perform Fourier transform on the sample speech stream to obtain the speech spectrum and speech features of each audio frame; perform speech on the speech spectrum based on the first neural network model. Separate and enhance to obtain estimated frequency spectrum; based on the first neural network model, the speech feature is mapped to the embedding space to obtain the embedded feature matrix.
  • the speech separation enhancement model pre-training module 1108 is also used to determine the ideal masking matrix according to the speech frequency spectrum and speech features; filter the noise elements in the ideal masking matrix based on the preset binary threshold matrix; according to the embedded feature matrix And the ideal masking matrix which filters the noise elements, determines the attractor corresponding to the sample speech stream.
  • the above-mentioned speech recognition device 1100 further includes an intermediate model construction module 1110 for obtaining a second neural network model; performing non-negative constraint processing on the second neural network model to obtain a non-negative neural network model.
  • Network model obtain a differential model used for auditory adaptation to the acoustic features output by the non-negative neural network model; cascade the differential model and the non-negative neural network model to obtain an intermediate model.
  • the intermediate model construction module 1110 is also used to obtain a logarithmic model used to perform logarithmic operations on the feature vector corresponding to the acoustic feature; obtain a differential model used to perform the differential operation on the feature vector corresponding to the acoustic feature; Mathematical model and difference model to construct differential model.
  • the above-mentioned speech recognition device 1100 further includes a speech recognition model pre-training module 1112, which is used to obtain a sample speech stream and the corresponding labeled phoneme category; extract the sample speech stream through the third neural network model Depth features of each audio frame in each audio frame; Determine the center vector of the sample voice stream based on the depth features corresponding to the audio frames of all phoneme categories; Determine the inter-class confusion index and intra-class distance of each audio frame based on the depth feature and the center vector Punish the fusion loss between the indices; train the third neural network model based on the fusion loss to obtain the speech recognition model.
  • a speech recognition model pre-training module 1112 which is used to obtain a sample speech stream and the corresponding labeled phoneme category; extract the sample speech stream through the third neural network model Depth features of each audio frame in each audio frame; Determine the center vector of the sample voice stream based on the depth features corresponding to the audio frames of all phoneme categories; Determine the inter-class confusion index and intra-class distance of each audio
  • the speech recognition model pre-training module 1112 is also used to input the depth feature into the cross-entropy function to calculate the inter-class confusion measurement index of each audio frame; input the depth feature and the center vector into the center loss function, and calculate each The intra-class distance penalty index of each audio frame; the inter-class confusion measurement index and the intra-class distance penalty index are fused to obtain the fusion loss.
  • the joint training module 1106 is also used to determine the global descent gradient generated by the target loss function; according to the global descent gradient, iteratively update the model parameters corresponding to the speech separation enhancement model, the robust representation model, and the speech recognition model. Until the minimum loss value of the objective loss function is obtained.
  • a speech recognition device 1300 including a speech separation and enhancement module 1302, an intermediate characterization transition module 1304, and a speech recognition module 1306.
  • the speech separation and enhancement module 1302 is used to obtain target speech Stream: Extract the enhanced spectrum of each audio frame in the target voice stream based on the voice separation enhancement model.
  • the intermediate characterization transition module 1304 is used to perform auditory matching on the enhanced spectrum based on the robust characterization model to obtain robust features.
  • the speech recognition module 1306 is used to recognize robust features based on the speech recognition model to obtain the phoneme corresponding to each audio frame; among them, the speech separation enhancement model, the robust representation model and the speech recognition model are jointly trained.
  • the speech separation enhancement model includes a first neural network model; the speech separation enhancement module 1302 is also used to extract the embedded feature matrix of each audio frame in the target speech stream based on the first neural network model; according to the embedded feature matrix and Preset the ideal masking matrix to determine the attractor corresponding to the target speech stream; calculate the similarity between each matrix element in the embedded feature matrix and the attractor to obtain the target masking matrix of the target speech stream; determine the target speech stream according to the target masking matrix The enhanced frequency spectrum corresponding to each audio frame.
  • the robust representation model includes a second neural network model and a differential model; the speech recognition module 1306 is also used to extract acoustic features from the enhanced spectrum based on the second neural network model; perform non-negative constraint processing on the acoustic features, Obtain non-negative acoustic features; perform differential operations on non-negative acoustic features through a differential model to obtain robust features that match the human hearing habits.
  • Fig. 14 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device may specifically be the terminal 110 or the server 120 in FIG. 1.
  • the computer device includes the computer device including a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, and may also store computer-readable instructions.
  • the processor can realize the voice recognition method.
  • the internal memory may also store computer readable instructions, and when the computer readable instructions are executed by the processor, the processor can execute the voice recognition method.
  • FIG. 14 is a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment may include More or fewer parts than shown in the figure, or some parts are combined, or have a different arrangement of parts.
  • the speech recognition apparatus provided in the present application may be implemented in a form of computer-readable instructions, and the computer-readable instructions may run on the computer device as shown in FIG. 14.
  • the memory of the computer device can store various program modules that make up the speech recognition device, such as the speech separation enhancement module, the intermediate representation transition module, and the speech recognition module shown in FIG. 13.
  • the computer-readable instructions formed by each program module cause the processor to execute the steps in the voice recognition method of each embodiment of the present application described in this specification.
  • a computer device including a memory and a processor, the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the processor executes the steps of the above voice recognition method.
  • the steps of the voice recognition method here may be the steps in the voice recognition method of each of the foregoing embodiments.
  • a computer-readable storage medium which stores computer-readable instructions.
  • the processor executes the steps of the above-mentioned speech recognition method.
  • the steps of the voice recognition method here may be the steps in the voice recognition method of each of the foregoing embodiments.
  • a computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Complex Calculations (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种语音识别方法、装置和计算机可读存储介质,方法包括:获取语音分离增强模型的第一损失函数及语音识别模型的第二损失函数(S202);基于第二损失函数进行反向传播,以对桥接在语音分离增强模型和语音识别模型之间的中间模型进行训练,得到鲁棒表征模型(S204);对第一损失函数和第二损失函数进行融合,得到目标损失函数(S206);基于目标损失函数对语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练,在满足预设收敛条件时结束训练(S208)。

Description

语音识别方法、装置和计算机可读存储介质
本申请要求于2020年01月16日提交中国专利局,申请号为202010048780.2,申请名称为“语音识别及模型训练方法、装置和计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音处理技术领域,特别是涉及一种语音识别方法、装置和计算机可读存储介质。
背景技术
语音识别技术的发展,使人与机器通过自然语言交互成为可能。基于语音识别技术可以将语音信号转换为文本序列。实现这种转换需要对拾取的语音信号进行语音分离(Speech Separation,SS)和语音增强(Speech Enhancement,SE)等前端处理,再对前端处理得到的声学特征进行自动语音识别(Automatic Speech Recognition,ASR)后端处理。
传统技术中,可以通过语音分离增强模型对语音信号进行语音分离和增强,再利用语音识别模型进行语音识别。然而,经常存在语音识别准确性较低的问题。
发明内容
根据本申请提供的各种实施例,提供一种语音识别方法、装置和计算机可读存储介质。
一种语音识别方法,由计算机设备执行,所述方法包括:获取语音分离增强模型的第一损失函数及语音识别模型的第二损失函数;基于所述第二损失函数进行反向传播,以对桥接在所述语音分离增强模型和语音识别模型之间的中间模型进行训练,得到鲁棒表征模型;对所述第一损失函数和第二损失函数进行融合,得到目标损失函数;基于所述目标损失函数对所述语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练,在满足预设收敛条件时结束训练。
一种语音识别装置,所述装置包括:中间表征学习模块,用于获取语音分离增强模型的第一损失函数及语音识别模型的第二损失函数;基于所述第二损失函数进行反向传播,以对桥接在所述语音分离增强模型和语音识别模型之间的中间模型进行训练,得到鲁棒表征模型;损失融合模块,用于对所述第一损失函数和第二损失函数进行融合,得到目标损失函数;联合训练模块,用于基于所述目标损失函数对所述语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练,在满足预设收敛条件时结束训练。
一种语音识别方法,由计算机设备执行,包括:获取目标语音流;基于语音分离增强模型提取所述目标语音流中每个音频帧的增强频谱;基于鲁棒表征模型对所述增强频谱进行听觉匹配,得到鲁棒特征;基于语音识别模型对所述鲁棒特征进行识别,得到每个音频帧对应的音素;其中,所述语音分离增强模型、鲁棒表征模型及语音识别模型联合训练得到。
一种语音识别装置,所述装置包括:语音分离增强模块,用于获取目标语音流;基于语 音分离增强模型提取所述目标语音流中每个音频帧的增强频谱;中间表征过渡模块,用于基于鲁棒表征模型对所述增强频谱进行听觉匹配,得到鲁棒特征;语音识别模块,用于基于语音识别模型对所述鲁棒特征进行识别,得到每个音频帧对应的音素;其中,所述语音分离增强模型、鲁棒表征模型及语音识别模型联合训练得到。
一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述处理器执行所述语音识别方法的步骤。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述语音识别方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中语音识别方法的应用环境图;
图2为一个实施例中语音识别方法的流程示意图;
图3为一个实施例中基于鲁棒表征模型对语音分离增强模型和语音识别模型进行桥接的模型架构示意图;
图4为一个实施例中语音处理模型预训练的步骤的流程示意图;
图5为一个实施例中中间模型的构建步骤的流程示意图;
图6为一个实施例中语音识别模型预训练的步骤的流程示意图;
图7为一个具体实施例中语音识别方法的流程示意图;
图8为一个实施例中语音识别方法的流程示意图;
图9a为一个实施例中在五种SNR信噪比条件下基于不同语音识别方法对来自两种声学环境的语音进行识别的字错误率的对比示意图;
图9b为一个实施例中在不同SNR信噪比条件下不同语音识别系统的性能比较示意图;
图10为一个具体实施例中语音识别方法的流程示意图;
图11为一个实施例中语音识别装置的结构框图;
图12为另一个实施例中语音识别装置的结构框图;
图13为一个实施例中语音识别装置的结构框图;及
图14为一个实施例中计算机设备的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
图1为一个实施例中语音识别模型的训练方法的应用环境图。参照图1,该语音识别方法应用于模型训练系统。该语音识别模型训练系统包括终端110和服务器120。终端110和服务器120通过网络连接。终端110具体可以是台式终端或移动终端,移动终端具体可以手机、平板 电脑、笔记本电脑等中的至少一种。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。终端110和服务器120均可单独用于执行本申请实施例中提供的语音识别方法。终端110和服务器120也可协同用于执行本申请实施例中提供的语音识别方法。
本申请实施例提供的方案涉及人工智能的语音识别等技术。语音技术(Speech Technology)的关键技术有语音分离(SS)和语音增强(SE)及自动语音识别技术(ASR)。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
需要说明的是,本申请实施例中涉及用于语音处理的联合模型。联合模型包括用于不同环节语音处理的三个模型,具体包括前端的语音分离增强模型和后端的语音识别模型,以及桥接在语音分离增强模型和语音识别模型之间的鲁棒表征模型。三个模型分别可以是一种机器学习模型。机器学习模型是通过样本学习后具备某种能力的模型,具体可以是神经网络模型,比如CNN(Convolutional Neural Networks,卷积神经网络)模型、RNN(Recurrent Neural Networks,循环神经网络)模型等。当然,机器学习模型也可以采用其他类型的模型。可以理解,在模型训练前可以根据精准度要求等灵活选择每个环节所采用的模型,如此,每个环节均可采用最优配置,而不需要妥协任意一个环节的性能。换言之,本申请所涉及的三个模型分别可以自由选择擅长相应领域的专用模型。其中,语音分离增强模型与语义识别模型分别可以是预训练好的,如此本申请可以在预训练的语音分离增强模型与语义识别模型基础上训练包含鲁棒表征模型的联合模型,如此可以在较少的迭代训练次数即可得到收敛的联合模型。语音分离增强模型与语义识别模型的预训练过程以及结合鲁棒表征模型进行联合训练的过程可参考后续实施例中的详细描述。
如图2所示,在一个实施例中,提供了一种语音识别方法。本实施例主要以该方法应用于计算机设备来举例说明,该计算机设备具体可以是上图中的终端110或者服务器120。参照图2,该语音识别方法具体包括如下步骤:
S202,获取语音分离增强模型的第一损失函数及语音识别模型的第二损失函数。
其中,语音分离增强模型是用于经过训练后具有语音分离和/或增强能力的模型,具体可以是以样本语音流作为训练数据,进行学习训练得到的用于将目标语音从样本语音流中的背景干扰中分离出来的模型。可以理解,语音分离增强模型还可以具有对语音信号进行语音活动检测(Voice Activity Detection,VAD)、回声消除、混响消除或声源定位等预处理的能力的至少一种,对此不作限制。根据传感器或麦克风的数量,语音分离增强模型可分为单声道(单个麦克风)分离增强模型和阵列(多个麦克风)分离增强模型。单声道分离的主要方法包括语音增强和计算听觉场景分析(Computational Auditory Scene Analysis,CASA)。语音增强可以通过分析单声道混合信号中目标语音信号和干扰信号的全部数据,经过带噪语音的噪声估计,对清晰语音进行估计,主流的语音增强方法包括频谱相减法(spectral subtraction)等。计算听觉场景分析是建立在听觉场景分析的感知理论基础上,利用聚类约束(grouping cue)进行语音分离。阵列分离的主要方法包括波束成形或者空间滤波器等。波束成形是通过恰当的阵列结构增强从特定的方向到达的语音信号,进而削减来自其它方向语音信号的干扰,如延迟-叠加技术。语音分离以及增强是以人为导向的语音处理任务。在语音分离以及增强领域,常采用人为理解更为有效的表征参数,如短时傅立叶变换(Short Time Fourier Transform,STFT)频谱图或者修正离散余弦变换(Modified Discrete Cosine Transform,MDCT)等。语音分离以及增强主流的性能衡量指标包括语音质量的感知评估(Perceptual Evaluation of  Speech Quality,PESQ)、信号失真比(Signal Distortion Rate,SDR)以及短时客观可懂度(Short Time Objective Intelligibility,STOI)等中的至少一个。其中STOI与主观听觉可懂度具有高度相关性。语音识别模型是经过训练后具有语音识别能力的声学模型,具体可以是以样本语音流作为训练数据,进行学习训练得到的用于对样本语音流进行音素识别的模型。语音分离增强模型与语音识别模型可以是分别预先训练好的。预训练的语音分离增强模型与语音识别模型各自具有固定的模型结构和模型参数。语音识别是以机器为导向的语音处理任务。在自动语音识别等领域,例如智能音箱、虚拟数字人助手、机器翻译等,常采用机器理解更为高效的表征参数,如梅尔滤波器组(Mel Fbanks)、梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)等。语音识别模型主流的性能衡量指标包括字错误率(Word Error Rate,WER),字符错误率(Character Error Rate,CER)或句子错误率(Sentence Error Rate,SER)等。
具体地,当需要进行联合模型训练时,计算机设备获取预训练的语音分离增强模型和语音识别模型、预训练语音分离增强模型时所采用的第一损失函数以及预训练语音识别模型时所采用的第二损失函数。损失函数(loss function)通常作为学习准则与优化问题相联系,即通过最小化损失函数求解和评估模型。例如在统计学和机器学习中被用于模型的参数估计(parameteric estimation)。预训练语音分离增强模型所采用的第一损失函数及预训练语音识别模型所采用的第二损失函数分别具体可以是均方误差、平均绝对值误差、Log-Cosh损失、分位数损失或者理想分位数损失等。第一损失函数与第二损失函数分别也可以是多种损失函数的组合。
S204,基于第二损失函数进行反向传播,以对桥接在语音分离增强模型和语音识别模型之间的中间模型进行训练,得到鲁棒表征模型。
如上文所述,在语音处理过程中,前端的语音分离任务所采用的表征参数和性能衡量指标是以人为导向的,即以人的主观听觉可懂度为目标;而后端的语音识别任务所采用的表征参数和性能衡量指标则是以机器为导向的,即以机器识别准确率为目标。如此,在进行前后端语音处理任务的融合时,需要克服两种表征范畴之间的差异。桥接是指一个对象在至少两个对象之间,连接该至少两个对象。即对于一个对象B,如果该对象桥接在A与C之间,则表示对象B位于A与C之间,B的一端与A连接,另外一端与C连接。对于模型而言,中间模型桥接在语音分离增强模型以及语音识别模型之间,表示语音分离增强模型的输出,为中间模型的输入,输入的数据经过中间模型进行处理所输出的数据,为语音识别模型的输入。
参考图3,图3示出了一个实施例中基于鲁棒表征模型对语音分离增强模型和语音识别模型进行桥接的模型架构示意图。如图3所示,为了克服两种表征范畴之间的差异,本申请的实施例在语音分离增强模型和语音识别模型之间桥接了待训练的中间模型。训练后的中间模型具有鲁棒性,可以称作鲁棒表征模型。其中,待训练的中间模型以及预训练的语音分离增强模型和语音识别模型均可以是由人工神经网络构成的模型。人工神经网络(Artificial Neural Networks,简写为ANNs),也简称为神经网络(NNs)或称作连接模型(Connection Model)。人工神经网络可从信息处理角度对人脑神经元网络进行抽象,以建立某种模型,按不同的连接方式组成不同的网络。在工程与学术界也常直接简称为神经网络或类神经网络。神经网络模型比如CNN(Convolutional Neural Network,卷积神经网络)模型、DNN(Deep Neural Network,深度神经网络)模型和RNN(Recurrent Neural Network,循环神经网络)模型等。语音分离增强模型也可以是多种神经网络模型的组合。其中,卷积神经网络包括卷 积层(Convolutional Layer)和池化层(Pooling Layer)。深度神经网络包括输入层、隐含层和输出层,层与层之间是全连接的关系。循环神经网络是一种对序列数据建模的神经网络,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐藏层之间的节点不再无连接而是有连接的,并且隐藏层的输入不仅包括输入层的输出还包括上一时刻隐藏层的输出。循环神经网络模型,比如LSTM(Long Short-Term Memory Neural Network,长短时记忆神经网络)模型和BiLSTM(Bi-directional Long Short-Term Memory,双向长短时记忆神经网络)等。
在一个实施例中,用于语音分离和增强的语音分离增强模型也可称作提取器Extract,用于中间过渡表征学习的鲁棒表征模型也可称作适配器Adapt,用于音素识别的语音识别模型也可称作识别器Recongnize。下文将提取器、适配器和识别器构成的语音处理系统称作EAR系统。
具体地,计算机设备按照预设的深度学习优化算法确定第二损失函数在每次迭代过程产生的局部下降梯度。深度学习优化算法具体可以是批量梯度下降(Batch Gradient Descent,BGD)、随机梯度下降(Stochastic Gradient Descent,SGD)、小批量梯度下降(Mini-Batch Gradient Descent,MBGD)、AdaGrad(自适应算法)或者RMSProp(Root Mean Square Prop)或Adam(Adaptive Moment Estimation)等。计算机设备将局部下降梯度反向传播至中间模型,以对中间模型对应的模型参数进行更新,直至符合预设的训练停止条件时结束训练。以随机梯度下降法为例,假设L 1和L 2分别为第一损失行数和第二损失函数,f(x,Θ adapt)表示输入为x和模型参数为Θ adapt的中间模型,y为中间模型输入x时语音识别模型对应的输出目标值,样本语音流中包含n个音频帧{x (1),…,x (n)},其中x (i)所对应的目标为y (i),则每次迭代所对应的局部下降梯度为
Figure PCTCN2020128392-appb-000001
假设随机梯度下降算法的学习率为η,则可以将模型参数变更为Θ adapt-ηg,并将变更后的模型参数作为中间模型当前的模型参数继续进行迭代,直至达到预设的训练停止条件。训练停止条件可以是第二损失函数的损失值达到预设最小值,或连续预设次数迭代中间模型的模型性能无明显改善等。
在一个实施例中,在基于第二损失函数反向传播对中间模型进行训练过程中,训练数据虽然经过了语音识别模型,但无需对预训练的语音识别模型的模型参数进行调整更新。值得强调的是,用户根据模型偏好或者精准度要求等可以对具体所采用的中间模型、语音分离增强模型以及语音识别模型分别进行灵活独立选择,即允许用户按照自己意愿灵活地引进新的先进的语音分离/增强和语音识别技术。换言之,本申请所涉及的三个模型分别可以自由选择擅长相应领域的专用模型。比如,擅长语音分离领域的模型包括Ai,擅长鲁棒表征学习领域的模型包括Bj,擅长语音识别领域的模型包括Ck,其中i,j,k均为正整数,则待训练的联合模型可以是Ai+Bj+Ck中的任意一种。如此,每个模型均可采用最优配置,而不需要妥协其他模型的性能。此外,这里的局部下降梯度是相对下文联合训练时所涉及的全局下降梯度 而言的,不可认为是根据第二损失函数确定的下降梯度值的部分取值。
S206,对第一损失函数和第二损失函数进行融合,得到目标损失函数。
其中,目标损失函数是由第一损失函数和第二损失函数组合而成的综合损失函数。函数融合是通过一种或多种预设逻辑运算将多个函数转换为一个函数的过程。预设逻辑运算包括但不限于四则混合运算、加权求和或者机器学习算法等。
具体地,计算机设备通过对第一损失函数与第二损失函数分进行预设逻辑运算,得到目标损失函数。以加权求和为例,假设加权因子为λ SS,则目标损失函数L=L 2SSL 1。加权因子可以是根据经验或实验设定的数值,如0.1。容易发现,通过调整加权因子可以调整在多模型联合训练时语音分离增强模型的重要性。
在一个实施例中,计算机设备预置了一种或多种融合计算公式,并设定了融合计算公式中每种参数因子的输入格式。第一损失函数与第二损失函数分别作为一种参数因子输入不同的融合计算公式,即可得到不同的目标损失函数。
S208,基于目标损失函数对语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练,在满足预设收敛条件时结束训练。
如上文,语音分离增强模型、鲁棒表征模型和语音识别模型均可以是由人工神经网络构成的模型,如此,本申请提供的用于语音处理的模型架构是完全基于神经网络的,可以是实现端到端的联合训练。整个端到端的联合训练过程并不会人为进行任务划分,而是将整个语音处理任务完全交给神经网络模型直接学习从原始语音信号到期望输出的映射。具体地,计算机设备按照预设的深度学习优化算法确定目标损失函数产生的全局下降梯度,例如基于目标损失函数计算得到损失值,基于损失值确定全局下降梯度。用于确定局部下降梯度的深度学习优化算法与用于确定全局下降梯度的深度学习优化算法可以相同,也可以不同。目标损失函数产生的全局下降梯度从语音识别模型依次反向传播至鲁棒表征模型和语音分离增强模型的网络各层,在此过程中对语音分离增强模型、鲁棒表征模型及语音识别模型对应的模型参数分别进行迭代更新,直至满足预设的训练停止条件时结束训练。
在一个实施例中,基于目标损失函数对语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练包括:确定目标损失函数产生的全局下降梯度;根据全局下降梯度对语音分离增强模型、鲁棒表征模型及语音识别模型分别对应的模型参数进行迭代更新,直至获得目标损失函数的最小化损失值。
以小批量随机梯度下降法为例,假设L 1和L 2分别为第一损失行数和第二损失函数,L为目标损失函数,Θ adapt为鲁棒表征模型的模型参数,Θ extract为语音分离增强模型的模型参数,Θ recog为语音识别模型的模型参数,Θ EAR为整个联合模型的模型参数,α为小批量随机梯度下降算法的学习率,则将目标损失函数产生的全局下降梯度一直反向传播至语音分离增强模型
Figure PCTCN2020128392-appb-000002
将模型参数变更为
Figure PCTCN2020128392-appb-000003
并将变更后的模型参数作为联合模型当前的模型参数继续进行迭代,直至达到预设的训练停止条件。训练停止条件可以是目标损失函数的损失值达到预设最小值,或连续预设次数迭代中间模型的模型性能无明显改善等。
在一个具体的实施例中,样本语音流的批量大小可以是24,最初的学习率α可以是10 -4,学习率的衰退系数可以是0.8,并在连续3次迭代目标损失函数的损失至均无改善时认为联合模型已经收敛,联合训练结束。
上述语音识别方法,提出了一种新型的在前端语音分离增强模型和后端语音识别模型之间引入用于中间过渡的鲁棒表征模型的端到端网络架构,这种架构通过引入适当的中间过渡表征学习技术,很好的弥合了以人为导向的语音分离任务和以机器为导向的语音识别任务之间的差异;在这种网络架构中,中间模型借助后端语音识别模型的第二损失函数反向传播完成训练,而语音分离增强模型和语音识别模型可以是预选训练好的,如此可以在较少的迭代训练次数后即可达到收敛;基于前后端模型分别对应损失函数的组合对端到端的网络模型进行联合训练,使得网络架构中每个单独的模型均能够综合学习来自复杂声学环境语音信号中的干扰特征,从而可以保证全局的语音处理任务的性能,提高语音识别准确性;此外,由于网路架构中的每个模型支持灵活独立选择,单独每个模型均可实现最优配置,而无需妥协单独任一模型,从而可以同时兼顾局部的每个语音处理任务的性能,提高语音客观可懂度。
在一个实施例中,如图4所示,上述语音识别方法还包括语音分离增强模型预训练的步骤,具体如下:
S402,基于第一神经网络模型提取样本语音流的估计频谱和嵌入特征矩阵。
其中,第一神经网络网络模型以及下文提及的第二神经网络模型、第三神经网络模型分别可以是上述人工神经网络模型中的任意一种。在本实施例中,第一神经网络网络模型可以是由基于理想比率掩模(IdealRatioMask,IRM)的深度吸引子网络(DeepAttractorNet,DANet)和深度提取网络(DeepExtractorNet,DENet)简化得到的模型。DENet网络中包括一个或多个卷积神经网络。在本实施例中,卷积神经网络可以采用BiLSTM网络。BiLSTM网络用于将语音信号从低维空间映射到高维空间。DANet网络用于在高维空间中嵌入吸引子(Attractor)以结合语音信号中时频信息一起参与训练。在基于SGD的反向传播对DENet和DANet网络进行联合训练过程中,DENet网络和DANet网络并未引入任何时间上的损失。样本语音流可以是在不同复杂声学环境,基于车载系统、电话会议设备、音箱设备或在线广播设备等设备中的语音应用采集到的音频数据流。语音应用可以是系统电话应用、即时通讯应用、虚拟语音助手或者机器翻译应用等。每段样本音频流可以包括多个音频帧。在样本音频流中采集音频帧的采样频率以及每个音频帧的帧长和帧移均可以根据需求自由设定。在一个具体的实施例中,可以采用16kHZ的采样频率,25ms的帧长以及10ms的帧移进行音频帧采集。
具体地,计算机设备可以批量对多个样本语音流进行短时傅里叶变换,得到每个样本语音流中的语音特征和语音频谱。样本音频流的批量大小可以根据需求自由设定,如24等。由于用于语音分离和增强的第一神经网络模型在高维的嵌入空间能较好的完成语音分离和增强。因而,计算机设备将批量样本语音流的语音特征映射至更高维的嵌入空间,转换得到嵌入特征矩阵。计算机设备在嵌入空间,基于第一神经网络模型对语音频谱进行语音分离和增 强,得到估计频谱。估计频谱为第一神经网络模型输出的样本语音流的频谱。
S404,根据嵌入特征矩阵及预设理想掩蔽矩阵,确定样本语音流对应的吸引子。
其中,理想掩蔽矩阵是为了约束语音信号中噪声能量和语音失真能量而建立的自适应感知掩蔽矩阵,记录了不同语音频率对应的掩蔽阈值。理想掩蔽矩阵可以是根据语音信号低维的语音特征和高维的嵌入特征矩阵预测得到的。吸引子是能够表征各样本语音流在嵌入空间所存在的普遍特征的特征向量。基于DANet网络的语音分离增强模型是计算目标语音训练样本在嵌入空间中的向量的加权平均值并存储起来作为目标语音的“吸引子”。因此,在嵌入空间中可以只需要计算一个吸引子。具体地,计算机设备根据语音信号以及通过短时傅里叶变换从语音信号中提取得到的语音频谱,预测批量样本语音流对应的理想掩蔽矩阵。理想掩蔽矩阵与嵌入特征矩阵处于同一维度的嵌入空间。计算机设备计算嵌入特征矩阵与理想掩蔽矩阵的乘积,基于该乘积结果确定嵌入空间的吸引子。
S406,通过计算嵌入特征矩阵中每个矩阵元素与吸引子的相似性,得到样本语音流的目标掩蔽矩阵。
具体地,计算机设备结合语音特征与吸引子的相似性进行掩蔽阈值修正,以对理想掩蔽矩阵进行重构,得到目标掩蔽矩阵。嵌入特征矩阵中每个矩阵元素与吸引子之间相似性的度量方法具体可以采用欧氏距离、曼哈顿距离、切比雪夫距离、马氏距离、余弦距离或汉明距离等。
S408,根据目标掩蔽矩阵确定样本语音流所对应的增强频谱。
其中,在现实声学场景中采集的语音信号通常为目标语音中混入了噪声的混合信号。样本语音流对应的增强频谱可以是语音信号中目标语音的增强频谱。
具体地,为了减少桥接在语音分离增强模型之后的鲁棒表征模型的计算量,将高维的嵌入特征矩阵进行降维处理,转换回低维的增强频谱。
S410,基于样本语音流对应的估计频谱与增强频谱之间的均方误差损失对第一神经网络模型进行训练,得到语音分离增强模型。
具体地,计算机设备计算批量样本语音流的增强频谱与目标语音的增强频谱之间的均方误差损失MSE(mean-square error),通过该均方误差损失MSE来预训练第一神经网络模型:
Figure PCTCN2020128392-appb-000004
其中,M为用于训练的混合信号样本语音流的批量大小,i表示训练样本语音流的索引,||.|| 2表示向量的2-范数,S S表示直接第一神经网络模型输出的样本语音流的估计频谱,
Figure PCTCN2020128392-appb-000005
表示样本语音流的增强频谱。计算机设备将第一损失函数L 1=L MSE产生的梯度反向传播至第一神经网络模型的各个网络层,通过小批量随机梯度下降法更新第一神经网络模型的模型参数Θ extract,当达到预设收敛条件时停止训练,得到语音分离增强模型。
本实施例中,理想比率掩模IRM是一种有效的语音分离增强方法,基于IRM的理想掩蔽 矩阵可以约束语音信号中噪声能量和语音失真能量,结合语音信号对应的高维嵌入特征矩阵以及代表其普遍特征的吸引子对理想掩蔽矩阵进行重构,基于重构的目标掩蔽矩阵进行频谱提取,可以使所提取估计频谱更加接近样本语音流的增强频谱,提高频谱提取有效性。
在一个实施例中,基于第一神经网络模型提取样本语音流的估计频谱和嵌入特征矩阵包括:对样本语音流进行傅里叶变换,得到每个音频帧的语音频谱和语音特征;基于第一神经网络模型对语音频谱进行语音分离和增强,得到估计频谱;基于第一神经网络模型将语音特征映射至嵌入空间,得到嵌入特征矩阵。
其中,在现实声学场景中采集的语音信号通常为混入了噪声的混合信号。可以认为,混合信号x(n)是目标语音信号S S(n)和干扰信号S I(n)的线性叠加:x(n)=S S(n)+S I(n),其中n为样本语音流的数量。对于混合信号和参考目标语音信号进行短时傅里叶变换STFT计算,可以得到混合信号对应的语音频谱和语音特征。语音特征可以是为处于低维的混合信号空间R TF的特征矩阵。通过傅里叶变换提取得到的语音特征的特征维度为TxF维。其中,T为帧数,F为梅尔滤波器组MF中梅尔滤波频带的数量。
DENet通过BiLSTM网络将语音特征从混合信号空间R TF映射到更高维的嵌入空间R TF*K,使得输出变更为嵌入特征矩阵:
Figure PCTCN2020128392-appb-000006
用于高维映射的嵌入向量维度K可以根据自由设定,如40等。
在一个实施例中,第一神经网络模型可以是将窥孔连接(peephole connection)的预设数量BiLSTM模型与一个全连接层级联得到。窥孔连接是区别与常规级联的一种模型连接方式,可以获取到更多的上下文信息。基本形式的BiLSTM单元中,前向LSTM及后向LSTM中门的控制均仅有当前的输入x(t)和前一时刻的短时状态h(t-1)。将不同BiLSTM窥孔连接,可以把前一时刻的长时状态c(t-1)加入遗忘门和输入门控制的输入,当前时刻的长时状态加入输出门的控制输入,可以让各个控制门窥视一下长时状态,从而能够获取更多的上下文信息。比如,在一个具体的实施例中,第一神经网络模型可以采用窥孔连接的四层BiLSTM,每层具有600个隐节点,最后一个BiLSTM层之后连接一个全连接层。全连接层用于将600维的语音特征向量映射为高维的嵌入特征矩阵。假设嵌入特征矩阵的维度K为40,则可以将600维的语音特征向量映射为24000维的嵌入特征向量。本实施例中,将语音信号低纬的语音特征映射为高维的嵌入特征矩阵,可以保证第一神经网络模型进行语音分离及增强的效果。
在一个实施例中,根据嵌入特征矩阵及预设理想掩蔽矩阵,确定样本语音流的吸引子包括:根据语音频谱和语音特征确定理想掩蔽矩阵;基于预设的二元阈值矩阵对理想掩蔽矩阵中噪声元素进行过滤;根据嵌入特征矩阵及过滤了噪声元素的理想掩蔽矩阵,确定样本语音流对应的吸引子。
其中,嵌入空间中吸引子的计算公式可以是:
Figure PCTCN2020128392-appb-000007
其中,a s∈R K,⊙表示矩阵元素乘法,M s=|S s|/|x|为理想掩蔽矩阵,w∈R TF是二元阈值矩阵,二元阈值矩阵计算公式如下:
Figure PCTCN2020128392-appb-000008
二元阈值矩阵w用于排除掉理想掩蔽矩阵中能量太小的矩阵元素,以减小噪声干扰。然后,通过计算吸引子与嵌入特征矩阵中每个矩阵元素之间的相似性,可以估计目标语音的掩蔽矩阵,简称目标掩蔽矩阵:
Figure PCTCN2020128392-appb-000009
最后,目标语音的增强频谱可以通过下面的计算方式提取出来:
Figure PCTCN2020128392-appb-000010
在一个实施例中,在第一神经网络模型训练阶段计算出来的吸引子被存储下来,并计算这些吸引子的均值,将该均值作为测试生产阶段的全局吸引子来提取测试的目标语音流的增强频谱。
本实施例中,过滤掉理想掩蔽矩阵中的噪声元素之后进行吸引子计算,可以提高吸引子计算准确性,使所计算吸引子更好的反映语音数据的语音特征。
在一个实施例中,如图5所示,上述语音识别方法还包括中间模型的构建步骤,具体如下:
S502,获取第二神经网络模型。
其中,第二神经网络模型是桥接在前端语音分离增强模型和后端语音识别模型之间的模型。本申请所面临的声学环境是非常复杂的,需要在输入的频谱图是包含了谱估计误差和时态失真的有缺陷频谱的情况下,最小化来自前端的语音识别误差影响。此外,帧级别的频谱图提取和音素级别的语音识别任务之间的上下文差异也增加了前后端语音处理任务融合的时间动态复杂性。换言之,本申请提供基于第二神经网络模型桥接训练得到的联合模型能够适应更多复杂的声学环境。为了有能力适应来自自下而上和自上而下的时间动态影响,本申请的实施例中,第二神经网络模型使用更复杂的Recurrent模型架构。典型地Recurrent模型架构包括能够使用输入频谱图的上下文来预测输出声学特征空间中的点的模型结构,如深层卷积神经网络CNN或者BiLSTM等。其中,BiLSTM模型通常称为通用程序近似器,能够通过有效估计完整序列的条件后验概率来学习中间表征,而不需要对其分布做出任何明确的假设。下文以第二神经网络模型采用BiLSTM模型结构ψ BiLSTM(·)为例进行描述。
在一个实施例中,第二神经网络模型可以是将预设数量BiLSTM模型窥孔连接得到。比如,在一个具体的实施例中,第二神经网络模型可以采用窥孔连接的两层BiLSTM,每层具有600个隐节点。
S504,对第二神经网络模型进行非负约束处理,得到非负神经网络模型。
其中,非负约束处理是能够保证第二神经网络模型非负的处理步骤。基于梅尔滤波器桥接前后端模型时,梅尔滤波器输出的滤波器组Fbanks是非负的,而标准BiLSTM的输出是没有非负限制的。为了贴合专家定义的声学特征,本申请的实施例对第二神经网络模型进行非负约束处理。
在一个实施例中,对第二神经网络模型进行非负约束处理包括:对第二神经网络模型进行平方运算;第二神经网络模型包括双向长短期记忆网络模型。
具体地,计算机设备在第二神经网络模型的输出上加上一个平方处理,以匹配Fbanks的非负性。经过评测,发现平方处理不但计算逻辑简短,且相比线性整流函数(Rectified Linear Unit,ReLU)等激活函数对第二神经网络模型进行非线性变换的效果更优。
S506,获取用于对非负神经网络模型输出的声学特征进行听觉适配的微分模型;将微分模型与非负神经网络模型级联,得到中间模型。
其中,听觉适配是指通过模拟人耳运算,使声学特征符合人耳听觉习惯。微分模型是模拟人耳运算的运算公式。经研究发现,对于频谱幅度差值非常大的高幅值语音信号和低幅值语音信号,人耳所能感受到的差异可能并不如幅度差值这么明显。比如,对于幅值1000和10的两个语音信号,人耳能够感知到的差异可能只是诸如3和1的差异。此外,人耳对语音信号中的变化比较敏感。
具体地,计算机设备获取预先构建的微分模型,将微分模型作为对非负神经网络模型输出的声学特征进行听觉匹配优化处理步骤,级联在非负神经网络模型之后,得到中间模型。也就是说,中间模型包括非负神经网络模型和微分模型。如此,将模拟人耳运算的逻辑以微分模型的方式体现,在训练阶段,第二神经网络模型无需进行模拟人耳运算逻辑方面的学习,降低第二神经网络模型学习复杂度,有助于提高中间模型训练效率。值得强调的是,在另一个实施例中,可以直接基于第二神经网络模型作为中间模型,而无需对第二神经网络模型的非负约束处理,也无需进行微分模型的拼接。此时,在训练阶段,第二神经网络模型需要自行学习模拟人耳运算逻辑。经测试发现,相比根据专家经验确定的非负约束处理逻辑以及微分模型,基于第二神经网络模型自行学习,反而能够学习到更加全面的模拟人耳运算逻辑,实现更好的听觉匹配效果。在测试生产阶段训练完毕的第二神经网络模型(即鲁棒表征模型)能够适应更多更复杂的声学环境。本实施例中,对第二神经网络模型进行非负约束处理,并拼接用于模拟人耳运算的微分模型,可以使模型输出的声学特征更加贴合实际人耳听觉习惯,进而有助于提高整个EAR系统的语音识别性能。
在一个实施例中,上述语音识别方法还包括:获取用于对非负神经网络模型输出的声学特征进行听觉适配的微分模型包括:获取用于对声学特征对应特征向量进行对数运算的对数模型;获取用于对声学特征对应特征向量进行差分运算的差分模型;根据对数模型与差分模型构建微分模型。其中,对数模型是用于对非负神经网络模型输出的声学特征的特征向量元素进行求对数运算。对数模型可以是任意能够实现元素对数运算的模型,如lg x,ln x等,其中x为声学特征向量元素。如上文,对于频谱幅度差值非常大的高幅值语音信号和低幅值语音信号,人耳所能感受到的差异可能并不如幅度差值这么明显。基于对数模型对声学特征的特征向量元素进行求对数运算,能够弱化赋值之间的差异,使其声学特征不同向量元素之间的差异更好的反应人耳实际所能感受出的信号差异。比如,在上述举例中,对于幅值1000 和10的两个语音信号,经过lg x对数运算后,向量元素1000转换为3,向量元素10转换为1,很好的反应了人耳实际所能感受出的信号差异。差分模型用于对非负神经网络模型输出的声学特征的特征向量元素记性差分运算。差分模型可以是任意能够实现元素差分运算的模型,如一阶差分运算和二阶差分运算等。如上文,人耳对语音信号中的变化比较敏感。基于差分模型对声学特征的特征向量元素进行差分运算,差分的结果反映了声学特征不同向量元素之间的变化。
具体地,计算机设备可以将对数模型和差分模型作为并列的两个模型构建微分模型,也可以将对数模型和差分模型进行级联构建微分模型。对数模型与差分模型的级联顺序可以是对数模型级联在差分模型之后,也可以是差分模型级联在对数模型之后。可以理解,微分模型还可以包括其他用于听觉适配的模型,对此不作限制。计算机设备在预训练好语音识别模型后,固定语音识别模型的模型参数,继续使用干净语音的频谱作为训练数据,通过直接反向传播识别第二损失函数L 2来训练中间模型。
Figure PCTCN2020128392-appb-000011
其中,Θ adapt为中间模型的模型参数,
Figure PCTCN2020128392-appb-000012
为对第二神经网络模型进行非负约束处理并拼接微分模型得到的中间模型;
Figure PCTCN2020128392-appb-000013
为第二神经网络模型本身。
在一个实施例中,为了实现更快的收敛和更好的泛化,计算机设备还可以对声学特征的向量元素执行全局均值方差归一化处理。归一化处理所采用的方法具体可以是01标准化、Z-score标准化或者sigmoid函数标准化等。
在一个实施例中,为了实现更好的语音平滑效果,计算机设备还可以拼接以样本音频流中当前音频帧为中心的2W+1帧的上下文窗口中每个音频帧的声学特征。其中,W表示单侧上下文窗口的大小,具体大小可以根据需求自由设定,如5。
本实施例中,对非负神经网络模型进行求对数运算,可以使语音信号声学特征不同向量元素之间的差异更好的反应人耳实际所能感受出的信号差异;对非负神经网络模型进行差分运算,可以反映声学特征不同向量元素之间的变化,进而适配人耳对语音信号中的变化比较敏感的听觉特征。
在一个实施例中,如图6所示,上述语音识别方法还包括语音识别模型预训练的步骤,具体如下:
S602,获取样本语音流及对应标注的音素类别。
其中,样本语音流中每个音频帧具有对应的标注数据。标注数据包括根据音频帧中目标语音的输出用户或者语音内容而确定的音频帧对应的音素类别。
S604,通过第三神经网络模型提取样本语音流中每个音频帧的深度特征。
其中,在本实施例第三神经网络模型可以是基于CLDNN(CONVOLUTIONAL,LONG SHORT-TERM MEMORY,FULLY CONNECTED DEEP NEURAL NETWORKS,将CNN、LSTM和DNN融合得到的网络)实现的声学模型。其中CNN层和LSTM层的输出均可以进行批量归一化,已达到更 快的收敛和更好的泛化。
具体地,计算机设备通过第三神经网络模型提取样本语音流中每个音频帧的深度特征。第三神经网络模型包括Softmax层。计算机设备可以基于Softmax层确定鲁棒特征向量元素属于每种音素类别的概率。
在一个实施例中,可以拼接以样本音频流中当前音频帧为中心的2W+1帧的上下文窗口中每个音频帧的深度特征,将拼接结果作为当前音频帧的深度特征。如此,能够获得反映上下文信息的深度特征,有助于提高第三神经网络模型的精确性。
S606,根据所有音素类别的音频帧对应的深度特征,确定样本语音流的中心向量。
S608,基于深度特征和中心向量确定每个音频帧的类间混淆衡量指数与类内距离惩罚指数之间的融合损失。
其中,中心向量用于描述目标类别中所有深度特征的中心。音频帧的类间混淆衡量指数是指用于表征样本语音流归属于目标类别的可能性的参数,能够反映不同目标类别之间的区分性。类间混淆衡量指数越小,表明类间区分性越强。类间混淆衡量指数可以通过欧几里得距离计算得到,也可以采用其他距离类型算法计算得到,比如角度距离等。类内距离惩罚指数是指用于表征样本语音流的类内分布紧凑性的参数。通过类内距离的惩罚可以增强第三神经网络模型的分类性能,即通过类内分布紧凑来满足类内鉴别性能。类内距离惩罚指数越小,表明类内分布的紧凑性越强,进而可以获得类内鉴别性能的增强。类内距离惩罚指数可以通过中心损失行数实现,但也不局限于此,比如也可通过采用角度距离的Contrastive损失函数、Triplet损失函数、Sphere face损失函数和CosFace损失函数等实现。
具体地,计算机设备将类间混淆衡量指数与类内距离惩罚指数融合的方式是按照预设的权重因子,对类间混淆衡量指数与类内距离惩罚指数进行加权计算:
L CL=L ceCLL ct
其中,L CL为融合损失,L ce为类间混淆衡量指数,L ct为类内距离惩罚指数,λ CL为权重因子。
S610,基于融合损失对第三神经网络模型进行训练,得到语音识别模型。
具体地,计算机设备按照预设的深度学习优化算法确定目标损失函数产生的全局下降梯度。目标损失函数产生的全局下降梯度从语音识别模型依次反向传播至鲁棒表征模型和语音分离增强模型的网络各层:
Figure PCTCN2020128392-appb-000014
EAR系统中,基于DENet网络的提取器会透过BiLSTM网络产生高维的嵌入特征矩阵V来预测适合目标语音的目标浮值掩蔽
Figure PCTCN2020128392-appb-000015
利用
Figure PCTCN2020128392-appb-000016
可以计算提取器输出的估计频谱和目标语音的增强频谱之间的均方误差MSE,并产生针对目标语音的鲁棒特征,鲁棒特征能继续经过适配器和识别器来预测语音单元。为了让提取器的模型参数尽可能在准确估计目标语音频谱的同时减低语音识别的错误率,本申请以多任务联合训练的方式更新DENet网络的参数,其中多任务联合损失函数(即目标损失函数)是语音分离任务的第一损失函数和语音识别的第二损失函数的加权组合。由于DENet网络的前向过程同时能计算交叉熵和中心损失加权及频谱 均方误差,使得能够以反向传播得到各损失函数在模型参数的梯度。在加入加权因子后,能够调整在多任务训练时语音分离任务的“重要性”。本实施例中,基于中心损失能够学习和更新每个类别在深度特征空间的中心点,通过惩罚深度特征与其对应目标类别的中心点之间的类内距离,可以显著降低语音识别在未见声学环境下的错误率,有效提高了语音识别对噪声可变性的泛化能力,进而在干净语音条件下、训练已见声学环境以及未见声学环境下均可获得较低的错误率;使样本语音流的标准能够在新的声学环境下具有较好鲁棒性,即便在新的声学环境,遇到不同用户基于新的口音和背景噪声,也能够稳定可靠的完整语音识别。
在一个实施例中,基于深度特征和中心向量确定每个音频帧的类间混淆衡量指数与类内距离惩罚指数的融合损失包括:将深度特征输入交叉熵函数,计算得到各音频帧的类间混淆衡量指数;将深度特征和中心向量输入中心损失函数,计算得到每个音频帧的类内距离惩罚指数;将类间混淆衡量指数与类内距离惩罚指数进行融合运算,得到融合损失。
其中,交叉熵函数用于保证深度特征的类间区分性。交叉熵函数的计算公式可以如下:
Figure PCTCN2020128392-appb-000017
其中,L ce为类间混淆衡量指数,M为用于训练的样本语音流的批量大小,T为样本语音流中音频帧的帧数。
Figure PCTCN2020128392-appb-000018
为第三神经网络模型输出层进行softmax操作之后对第i个节点的输出,第三神经网络模型中有K个输出节点,代表K个输出类别。
Figure PCTCN2020128392-appb-000019
其中,a t为第三神经网络模型softmax层的前一层在第t音频帧时刻的输出;
Figure PCTCN2020128392-appb-000020
为softmax层的前一层第j个结点在第t音频帧时刻的输出,W为softmax层的权重矩阵,B为softmax层的偏置向量。
中心损失函数的计算公式可以如下:
Figure PCTCN2020128392-appb-000021
其中,L ct为类内距离惩罚指数;
Figure PCTCN2020128392-appb-000022
为第t帧音频帧的深度特征,即第三神经网络模型中倒数第二层在第t个音频帧时刻的输出;
Figure PCTCN2020128392-appb-000023
表示第Kt类深度特征的中心向量,i为样本语音流的索引。在所进行的中心损失计算过程中,其目标是尽可能减小音频帧的深度特征相对其中心向量的距离,即类内距离u t-c Kt越小越好。
具体地,计算机设备将交叉熵损失函数和中心损失函数进行融合,得到语音识别模型对应的第二损失函数。在一个实施例中,将交叉熵损失函数和中心损失函数融合的方式可以是 按照预设的权重因子,对交叉熵损失函数和中心损失函数进行加权计算:
L CL=L ceCLL ct
其中,L CL为第二损失函数,λ CL为控制中心损失函数在第二损失函数中所占权重的超参数。对应的,将类间混淆衡量指数与类内距离惩罚指数融合的方式是按照预设的权重因子λ CL,对类间混淆衡量指数与类内距离惩罚指数进行加权计算。
本实施例中,采用中心损失函数能够学习和更新每个类别在深度特征空间的中心点,通过惩罚深度特征与其对应类的中心点之间的距离,从而提高深度特征的区分能力。
在一个具体的实施例中,如图7所示,该语音识别方法具体包括以下步骤:
S702,对样本语音流进行傅里叶变换,得到每个音频帧的语音频谱和语音特征。
S704,基于第一神经网络模型对语音频谱进行语音分离和增强,得到估计频谱。
S706,基于第一神经网络模型将语音特征映射至嵌入空间,得到嵌入特征矩阵。
S708,根据语音频谱和语音特征确定理想掩蔽矩阵。
S710,基于预设的二元阈值矩阵对理想掩蔽矩阵中噪声元素进行过滤。
S712,根据嵌入特征矩阵及过滤了噪声元素的理想掩蔽矩阵,确定样本语音流对应的吸引子。
S714,通过计算嵌入特征矩阵中每个矩阵元素与吸引子的相似性,得到样本语音流的目标掩蔽矩阵。
S716,根据目标掩蔽矩阵确定样本语音流所对应的增强频谱。
S718,基于第一损失函数计算样本语音流对应的估计频谱与增强频谱之间的均方误差损失。
S720,根据均方误差损失对第一神经网络模型进行训练,得到语音分离增强模型。
S722,获取样本语音流及对应标注的音素类别。
S724,通过第三神经网络模型提取样本语音流中每个音频帧的深度特征。
S726,根据所有音素类别的音频帧对应的深度特征,确定样本语音流的中心向量。
S728,将深度特征输入交叉熵函数,计算得到各音频帧的类间混淆衡量指数。
S730,将深度特征和中心向量输入中心损失函数,计算得到每个音频帧的类内距离惩罚指数。
S732,将类间混淆衡量指数与类内距离惩罚指数进行融合运算,得到基于第二损失函数的融合损失。
S734,基于融合损失对第三神经网络模型进行训练,得到语音识别模型。
S736,获取语音分离增强模型的第一损失函数及语音识别模型的第二损失函数。
S738,获取第二神经网络模型。
S740,对第二神经网络模型进行非负约束处理,得到非负神经网络模型。
S742,获取用于对非负神经网络模型输出的声学特征进行听觉适配的微分模型。
S744,将微分模型与非负神经网络模型级联,得到中间模型。
S746,基于第二损失函数进行反向传播,以对桥接在语音分离增强模型和语音识别模型 之间的中间模型进行训练,得到鲁棒表征模型。
S748,对第一损失函数和第二损失函数进行融合,得到目标损失函数。
S750,确定目标损失函数产生的全局下降梯度。
S752,根据全局下降梯度对语音分离增强模型、鲁棒表征模型及语音识别模型分别对应的模型参数进行迭代更新,直至获得目标损失函数的最小化损失值。
经鲁棒表征模块ψ BiLSTM(·)连接前端的语音分离增强模型和后端的语音识别模型,使整个EAR系统成为一个可以实现端到端反向传播的网络,并且由于模块化架构,整个EAR系统的网络可以采用“课程表”训练方法(Curriculum learning),即基于后端语音识别模型的损失函数反向传播对鲁棒表征模型进行单独训练,然后端到端地对整个EAR系统网络进行联合训练。由于可以在预训练的语音分离增强模型和语音识别模型基础上进行训练,采用“课程表”训练方法可以快速实现收敛。
上述语音识别方法,强大的网络结构以及“课程表”训练方式,使得基于本申请提供的语音识别方法训练得到的联合模型,学习能力极强,通过提取鲁棒有效的语音增强和语音分离表征来提高自动语音识别的性能,能够适应任何具有挑战性的复杂干扰声学环境。
如图8所示,在一个实施例中,提供了一种语音识别方法。本实施例主要以该方法应用于计算机设备来举例说明,该计算机设备具体可以是上图中的终端110或者服务器120。终端110和服务器120均可单独用于执行本申请实施例中提供的语音识别方法。终端110和服务器120也可协同用于执行本申请实施例中提供的语音识别方法。参照图8,该语音识别方法具体包括如下步骤:
S802,获取目标语音流。
其中,目标语音流可以是在任一实际声学环境采集到的音频数据流。目标语音流可以预先采集并存储在计算机设备的,也可以是计算机设备动态采集得到的。比如,目标语音流可以是基于游戏应用采集的用户在游戏语音通话过程中产生的音频数据流。此时,目标语音流可能为包括游戏背景音乐和远端人声的回声干扰。具体地,计算机设备获取目标语音流,并按照预设的采样频率在目标语音流中采集音频帧。每个音频帧的帧长以及相邻音频帧之间的帧移均可以根据需求自由设定。在一个具体的实施例中,计算机设备基于16kHZ的采样频率,25ms的帧长以及10ms的帧移进行音频帧采集。
S804,基于语音分离增强模型提取目标语音流中每个音频帧的增强频谱。
其中,语音分离增强模型是一种神经网络模型,具体可以是基于理想比率掩模(Ideal Ratio Mask,IRM)的深度吸引子网络(Deep Attractor Net,DANet)和深度提取网络(Deep Extractor Net,DENet)简化得到的模型。在一个具体的实施例中,语音分离增强模型可以采用窥孔连接的四层BiLSTM,每层具有600个隐节点,最后一个BiLSTM层之后连接一个全连接层。具体地,计算机设备可以批量对多个目标语音流进行短时傅里叶变换,得到每个目标语音流中的语音特征和语音频谱。计算机设备基于语音分离增强模型将批量目标语音流的语音特征映射至更高维的嵌入空间,在嵌入空间对语音频谱进行语音分离和增强,得到嵌入特征矩阵。计算机设备获取预存储的全局吸引子。在语音分离增强模型训练阶段,计算机设备将根据每次批量样本语音流计算出来的吸引子存储下来,并计算这些吸引子的均值,将该均值作为测试生产阶段的全局吸引子。计算机设备通过计算全局吸引子与目标语音流对应的嵌入特征矩阵中每个矩阵元素之间的相似性,得到目标语音流的目标掩蔽矩阵。基于目标掩 蔽矩阵以及嵌入特征矩阵,可以提取得到目标语音流的增强频谱。
S806,基于鲁棒表征模型对增强频谱进行听觉匹配,得到鲁棒特征。
其中,鲁棒表征模型是桥接在前端语音分离增强模型和后端语音识别模型之间的一种神经网络模型,具体可以是基于Recurrent模型架构的CNN、BiLSTM等,有能力适应来自自下而上和自上而下的时间动态影响。在一个具体实施例中,鲁棒表征模型可以是窥孔连接的两层BiLSTM,每层具有600个隐节点。鲁棒特征是用于对前端的语音分离增强模型输出的增强频谱进行转换,得到的一种中间过渡特征,该中间过渡特征作为后端语音识别模型的输入。
具体地,计算机设备基于鲁棒表征模型在增强频谱的声学特征。为了贴合人耳听觉习惯,鲁棒表征模型对增强频谱的声学特征进行听觉匹配。计算机设备基于鲁棒表征模型对声学特征行非负约束处理,对非负约束处理后的声学特征进行求对数和差分等微分运算,得到鲁棒特征。比如,对于频谱幅度差值非常大的高幅值语音信号和低幅值语音信号,人耳所能感受到的差异可能并不如幅度差值这么明显。基于对数模型对声学特征的特征向量元素进行求对数运算,能够弱化赋值之间的差异,使其声学特征不同向量元素之间的差异更好的反应人耳实际所能感受出的信号差异。人耳对语音信号中的变化比较敏感。基于差分模型对声学特征的特征向量元素进行差分运算,差分的结果反映了声学特征不同向量元素之间的变化。
S808,基于语音识别模型对鲁棒特征进行识别,得到每个音频帧对应的音素;其中,语音分离增强模型、鲁棒表征模型及语音识别模型联合训练得到。
其中,语音识别模型以及上文提及的语音分离增强模型、鲁棒表征模型可以是预先联合训练得到的。前端语音分离增强模型和后端语音识别模型可以预先训练好的。计算机设备获取语音分离增强模型的第一损失函数及语音识别模型的第二损失函数,基于第二损失函数计算损失值,以根据损失值进行反向传播,以对桥接在语音分离增强模型和语音识别模型之间的中间模型进行训练,得到鲁棒表征模型。计算机设备进一步对第一损失函数和第二损失函数进行融合,基于融合得到的目标损失函数对语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练,在满足预设收敛条件时结束训练。具体地,计算机设备将鲁棒特征输入语音识别模型,得到目标语音流对应的音素。在本申请的实施例中,语音识别模型能够识别的音素类别约2万种。语音识别模型对输入的批量目标语音流的鲁棒特征进行处理,输出一个约2万维的音素向量。鲁棒特征向量元素和音素向量元素之间存在对应关系。音素向量记录了鲁棒特征向量元素属于每种音素类别的概率,如此可以确定每个鲁棒特征向量元素对应最大概率音素类别对应的音素串,从而实现从音素级别对目标语音流进行语音识别。
上述语音识别方法,提出了一种新型的在前端语音分离增强模型和后端语音识别模型之间引入鲁棒表征模型的端到端网络架构,这种架构通过引入适当的中间过渡表征学习技术,很好的弥合了以人为导向的语音分离任务和以机器为导向的语音识别任务之间的差异;对端到端的网络模型进行联合训练,使得网络架构中每个单独的模型均能够综合学习来自复杂声学环境语音信号中的干扰特征,从而可以保证全局的语音处理任务的性能,提高语音识别准确性;此外,由于网路架构中的每个模型支持灵活独立选择,单独每个模型均可实现最优配置,而无需妥协单独任一模型,从而可以同时兼顾局部的每个语音处理任务的性能,提高语音客观可懂度。
在一个实施例中,语音分离增强模型包括第一神经网络模型;基于语音分离增强模型提取目标语音流中每个音频帧的增强频谱包括:基于第一神经网络模型提取目标语音流中每个音频帧的嵌入特征矩阵;根据嵌入特征矩阵及预设理想掩蔽矩阵,确定目标语音流对应的吸 引子;通过计算嵌入特征矩阵中每个矩阵元素与吸引子的相似性,得到目标语音流的目标掩蔽矩阵;根据目标掩蔽矩阵确定目标语音流中每个音频帧所对应的增强频谱。语音分离增强模型可以是基于第一神经网络模型训练得到的。基于语音分离增强模型提取目标语音流中每个音频帧的增强频谱的过程可参阅上述步骤S402-S410的描述,在此不再赘述。
在一个实施例中,鲁棒表征模型包括第二神经网络模型和微分模型;基于鲁棒表征模型对增强频谱进行听觉匹配,得到鲁棒特征包括:基于第二神经网络模型在增强频谱中提取声学特征;对声学特征进行非负约束处理,得到非负的声学特征;通过微分模型对非负的声学特征进行微分运算,得到与人耳听觉习惯相匹配的鲁棒特征。中间模型可以是第二神经网络模型和微分模型拼接得到的,对中间模型训练得到鲁棒表征模型。基于鲁棒表征模型提取鲁棒特征可以参考上述步骤S502-S506的描述,在此不再赘述。
在一个实施例中,对来自“受背景音乐干扰”和“受其他说话人干扰”两种声学环境的语音,在五种不同SNR信噪比条件下(0dB,5dB,10dB,15dB和20dB),对基于不同语音识别方法的字错误率(WER)进行测试对比。测试结果可参考图9a所示,基于本申请提出的EAR系统进行语音识别,无论是在单任务λ SS=0状态下,还是在多任务λ SS≠0(如λ SS=0.1)下,其字错误率一致地优于其他语音识别系统,如基于干净语音或者有干扰语音训练的语音识别模型ASR,语音分离增强模型SS及语音识别模型ASR的级联系统。
参考图9b,图9b示出了一个实施例中在单通道多说话人语音识别任务中在不同SNR信噪比条件下不同语音识别系统的性能比较示意图。如图9所示,在不同的多任务训练权重下,本申请提出的EAR系统相比直接将语音分离增强模型作为语音识别模型预处理步骤进行级联方式,无论短时客观可懂度STOI还是字错误率WER(%)均表现优良,即EAR系统可显著提高以机器为导向的语音清晰度(WER)的同时,还能保持反映人类听觉方面的语音可懂度(STOI),可以达到和DENet作为专用SS模型的性能相当或甚至更好。
本申请实施例首次揭示了引入适当的中间过渡表征学习在弥合以人为导向和以机器为导向的语音处理任务之间差异过程中的重要性,可以同时保证前端语音分离局部任务和后端语音识别局部任务的最优性能(人的主观听觉可懂度)和全局任务的最优性能(机器的识别准确率指标方面的性能)。比如,在游戏实时语音的应用场景,在组队语音通话时,既有近端讲话的人声也有游戏过程中手机播放的背景音,本文提出的EAR系统由于基于鲁棒表征模型更好的进行了噪声约束,可以解决用户在游戏语音通话过程中游戏背景音和远端人声的回声干扰。背景音回声消除保证了用户之间语音通话的质量。除了性能的显著提升外,本申请所提出的EAR系统框架具有高灵活性:允许灵活地集成任何先进的语音分离增强模型和语音识别模型替换到EAR系统框架中的相应模块,并且我们提出的端到端可训练的框架不会以任何单个模块性能受损作为代价。
在一个具体的实施例中,如图10所示,该语音识别方法具体包括以下步骤:
S1002,获取目标语音流。
S1004,基于语音分离增强模型提取目标语音流中每个音频帧的嵌入特征矩阵。
S1006,根据嵌入特征矩阵及预设理想掩蔽矩阵,确定目标语音流对应的吸引子。
S1008,通过计算嵌入特征矩阵中每个矩阵元素与吸引子的相似性,得到目标语音流的目标掩蔽矩阵。
S1010,根据目标掩蔽矩阵确定目标语音流中每个音频帧所对应的增强频谱。
S1012,获取鲁棒表征模型;鲁棒表征模型包括第二神经网络模型和微分模型。
S1014,基于第二神经网络模型在增强频谱中提取声学特征。
S1016,对声学特征进行非负约束处理,得到非负的声学特征。
S1018,通过微分模型对非负的声学特征进行微分运算,得到与人耳听觉习惯相匹配的鲁棒特征。
S1020,基于语音识别模型对鲁棒特征进行识别,得到每个音频帧对应的音素;其中,语音分离增强模型、鲁棒表征模型及语音识别模型联合训练得到。
上述语音识别方法,提出了一种新型的在前端语音分离增强模型和后端语音识别模型之间引入鲁棒表征模型的端到端网络架构,这种架构通过引入适当的中间过渡表征学习技术,很好的弥合了以人为导向的语音分离任务和以机器为导向的语音识别任务之间的差异;对端到端的网络模型进行联合训练,使得网络架构中每个单独的模型均能够综合学习来自复杂声学环境语音信号中的干扰特征,从而可以保证全局的语音处理任务的性能,提高语音识别准确性;此外,由于网路架构中的每个模型支持灵活独立选择,单独每个模型均可实现最优配置,而无需妥协单独任一模型,从而可以同时兼顾局部的每个语音处理任务的性能,提高语音客观可懂度。
上述流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上述流程图的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
如图11所示,在一个实施例中,提供了语音识别装置1100,包括中间表征学习模块1102、损失融合模块1104和联合训练模块1106,其中,
中间表征学习模块1102,用于获取语音分离增强模型的第一损失函数及语音识别模型的第二损失函数;基于第二损失函数进行反向传播,以对桥接在语音分离增强模型和语音识别模型之间的中间模型进行训练,得到鲁棒表征模型。
损失融合模块1104,用于对第一损失函数和第二损失函数进行融合,得到目标损失函数。
联合训练模块1106,用于基于目标损失函数对语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练,在满足预设收敛条件时结束训练。
在一个实施例中,如图12所示,上述语音识别装置1100还包括语音分离增强模型预训练模块1108,用于基于第一神经网络模型提取样本语音流的估计频谱和嵌入特征矩阵;根据嵌入特征矩阵及预设理想掩蔽矩阵,确定样本语音流对应的吸引子;通过计算嵌入特征矩阵中每个矩阵元素与吸引子的相似性,得到样本语音流的目标掩蔽矩阵;根据目标掩蔽矩阵确定样本语音流所对应的增强频谱;基于样本语音流对应的估计频谱与增强频谱之间的均方误差损失对第一神经网络模型进行训练,得到语音分离增强模型。
在一个实施例中,语音分离增强模型预训练模块1108还用于对样本语音流进行傅里叶变换,得到每个音频帧的语音频谱和语音特征;基于第一神经网络模型对语音频谱进行语音分离和增强,得到估计频谱;基于第一神经网络模型将语音特征映射至嵌入空间,得到嵌入特征矩阵。
在一个实施例中,语音分离增强模型预训练模块1108还用于根据语音频谱和语音特征确定理想掩蔽矩阵;基于预设的二元阈值矩阵对理想掩蔽矩阵中噪声元素进行过滤;根据嵌 入特征矩阵及过滤了噪声元素的理想掩蔽矩阵,确定样本语音流对应的吸引子。
在一个实施例中,如图12所示,上述语音识别装置1100还包括中间模型构建模块1110,用于获取第二神经网络模型;对第二神经网络模型进行非负约束处理,得到非负神经网络模型;获取用于对非负神经网络模型输出的声学特征进行听觉适配的微分模型;将微分模型与非负神经网络模型级联,得到中间模型。
在一个实施例中,中间模型构建模块1110还用于获取用于对声学特征对应特征向量进行对数运算的对数模型;获取用于对声学特征对应特征向量进行差分运算的差分模型;根据对数模型与差分模型构建微分模型。
在一个实施例中,如图12所示,上述语音识别装置1100还包括语音识别模型预训练模块1112,用于获取样本语音流及对应标注的音素类别;通过第三神经网络模型提取样本语音流中每个音频帧的深度特征;根据所有音素类别的音频帧对应的深度特征,确定样本语音流的中心向量;基于深度特征和中心向量确定每个音频帧的类间混淆衡量指数与类内距离惩罚指数之间的融合损失;基于融合损失对第三神经网络模型进行训练,得到语音识别模型。
在一个实施例中,语音识别模型预训练模块1112还用于将深度特征输入交叉熵函数,计算得到各音频帧的类间混淆衡量指数;将深度特征和中心向量输入中心损失函数,计算得到每个音频帧的类内距离惩罚指数;将类间混淆衡量指数与类内距离惩罚指数进行融合运算,得到融合损失。
在一个实施例中,联合训练模块1106还用于确定目标损失函数产生的全局下降梯度;根据全局下降梯度对语音分离增强模型、鲁棒表征模型及语音识别模型分别对应的模型参数进行迭代更新,直至获得目标损失函数的最小化损失值。
如图13所示,在一个实施例中,提供了语音识别装置1300,包括语音分离增强模块1302、中间表征过渡模块1304和语音识别模块1306,其中,语音分离增强模块1302,用于获取目标语音流;基于语音分离增强模型提取目标语音流中每个音频帧的增强频谱。中间表征过渡模块1304,用于基于鲁棒表征模型对增强频谱进行听觉匹配,得到鲁棒特征。语音识别模块1306,用于基于语音识别模型对鲁棒特征进行识别,得到每个音频帧对应的音素;其中,语音分离增强模型、鲁棒表征模型及语音识别模型联合训练得到。
在一个实施例中,语音分离增强模型包括第一神经网络模型;语音分离增强模块1302还用于基于第一神经网络模型提取目标语音流中每个音频帧的嵌入特征矩阵;根据嵌入特征矩阵及预设理想掩蔽矩阵,确定目标语音流对应的吸引子;通过计算嵌入特征矩阵中每个矩阵元素与吸引子的相似性,得到目标语音流的目标掩蔽矩阵;根据目标掩蔽矩阵确定目标语音流中每个音频帧所对应的增强频谱。
在一个实施例中,鲁棒表征模型包括第二神经网络模型和微分模型;语音识别模块1306还用于基于第二神经网络模型在增强频谱中提取声学特征;对声学特征进行非负约束处理,得到非负的声学特征;通过微分模型对非负的声学特征进行微分运算,得到与人耳听觉习惯相匹配的鲁棒特征。
图14示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是图1中的终端110或服务器120。如图14所示,该计算机设备包括该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器实现语音识别方法。该内存储器中也可储存有计算机可 读指令,该计算机可读指令被处理器执行时,可使得处理器执行语音识别方法。本领域技术人员可以理解,图14中示出的结构,是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的语音识别装置可以实现为一种计算机可读指令的形式,计算机可读指令可在如图14所示的计算机设备上运行。计算机设备的存储器中可存储组成该语音识别装置的各个程序模块,比如,图13所示的语音分离增强模块、中间表征过渡模块和语音识别模块。各个程序模块构成的计算机可读指令使得处理器执行本说明书中描述的本申请各个实施例的语音识别方法中的步骤。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述语音识别方法的步骤。此处语音识别方法的步骤可以是上述各个实施例的语音识别方法中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述语音识别方法的步骤。此处语音识别方法的步骤可以是上述各个实施例的语音识别方法中的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (16)

  1. 一种语音识别方法,由计算机设备执行,所述方法包括:
    获取语音分离增强模型的第一损失函数及语音识别模型的第二损失函数;
    基于所述第二损失函数进行反向传播,以对桥接在所述语音分离增强模型和语音识别模型之间的中间模型进行训练,得到鲁棒表征模型;
    对所述第一损失函数和第二损失函数进行融合,得到目标损失函数;及
    基于所述目标损失函数对所述语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练,在满足预设收敛条件时结束训练。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    基于第一神经网络模型提取样本语音流的估计频谱和嵌入特征矩阵;
    根据嵌入特征矩阵及预设理想掩蔽矩阵,确定样本语音流对应的吸引子;
    通过计算所述嵌入特征矩阵中每个矩阵元素与所述吸引子的相似性,得到所述样本语音流的目标掩蔽矩阵;
    根据所述目标掩蔽矩阵确定样本语音流所对应的增强频谱;及
    基于所述样本语音流对应的估计频谱与所述增强频谱之间的均方误差损失对所述第一神经网络模型进行训练,得到语音分离增强模型。
  3. 根据权利要求2所述的方法,其特征在于,所述基于第一神经网络模型提取样本语音流的估计频谱和嵌入特征矩阵包括:
    对样本语音流进行傅里叶变换,得到每个音频帧的语音频谱和语音特征;
    基于第一神经网络模型对语音频谱进行语音分离和增强,得到估计频谱;及
    基于第一神经网络模型将语音特征映射至嵌入空间,得到嵌入特征矩阵。
  4. 根据权利要求3所述的方法,其特征在于,所述根据嵌入特征矩阵及预设理想掩蔽矩阵,确定样本语音流的吸引子包括:
    根据所述语音频谱和语音特征确定理想掩蔽矩阵;
    基于预设的二元阈值矩阵对所述理想掩蔽矩阵中噪声元素进行过滤;及
    根据嵌入特征矩阵及过滤了噪声元素的理想掩蔽矩阵,确定样本语音流对应的吸引子。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取第二神经网络模型;
    对所述第二神经网络模型进行非负约束处理,得到非负神经网络模型;
    获取用于对非负神经网络模型输出的声学特征进行听觉适配的微分模型;及
    将所述微分模型与所述非负神经网络模型级联,得到中间模型。
  6. 根据权利要求5所述的方法,其特征在于,所述获取用于对非负神经网络模型输出的声学特征进行听觉适配的微分模型包括:
    获取用于对声学特征对应特征向量进行对数运算的对数模型;
    获取用于对声学特征对应特征向量进行差分运算的差分模型;及
    根据所述对数模型与所述差分模型构建微分模型。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取样本语音流及对应标注的音素类别;
    通过第三神经网络模型提取样本语音流中每个音频帧的深度特征;
    根据所有音素类别的音频帧对应的深度特征,确定样本语音流的中心向量;
    基于所述深度特征和所述中心向量确定每个音频帧的类间混淆衡量指数与类内距离惩罚指数之间的融合损失;及
    基于所述融合损失对所述第三神经网络模型进行训练,得到语音识别模型。
  8. 根据权利要求7所述的方法,其特征在于,所述基于深度特征和中心向量确定每个音频帧的类间混淆衡量指数与类内距离惩罚指数的融合损失包括:
    将所述深度特征输入交叉熵函数,计算得到各音频帧的类间混淆衡量指数;
    将所述深度特征和所述中心向量输入中心损失函数,计算得到每个音频帧的类内距离惩罚指数;及
    将类间混淆衡量指数与类内距离惩罚指数进行融合运算,得到融合损失。
  9. 根据权利要求1所述的方法,其特征在于,所述基于所述目标损失函数对所述语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练包括:
    确定所述目标损失函数产生的全局下降梯度;及
    根据所述全局下降梯度对所述语音分离增强模型、鲁棒表征模型及语音识别模型分别对应的模型参数进行迭代更新,直至获得所述目标损失函数的最小化损失值。
  10. 一种语音识别方法,由计算机设备执行,包括:
    获取目标语音流;
    基于语音分离增强模型提取所述目标语音流中每个音频帧的增强频谱;
    基于鲁棒表征模型对所述增强频谱进行听觉匹配,得到鲁棒特征;及
    基于语音识别模型对所述鲁棒特征进行识别,得到每个音频帧对应的音素;
    其中,所述语音分离增强模型、鲁棒表征模型及语音识别模型联合训练得到。
  11. 根据权利要求10所述的方法,其特征在于,所述语音分离增强模型包括第一神经网络模型;所述基于语音分离增强模型提取所述目标语音流中每个音频帧的增强频谱包括:
    基于第一神经网络模型提取目标语音流中每个音频帧的嵌入特征矩阵;
    根据嵌入特征矩阵及预设理想掩蔽矩阵,确定目标语音流对应的吸引子;
    通过计算所述嵌入特征矩阵中每个矩阵元素与所述吸引子的相似性,得到所述目标语音流的目标掩蔽矩阵;及
    根据所述目标掩蔽矩阵确定目标语音流中每个音频帧所对应的增强频谱。
  12. 根据权利要求10所述的方法,其特征在于,所述鲁棒表征模型包括第二神经网络模型和微分模型;所述基于鲁棒表征模型对所述增强频谱进行听觉匹配,得到鲁棒特征包括:
    基于所述第二神经网络模型在所述增强频谱中提取声学特征;
    对所述声学特征进行非负约束处理,得到非负的声学特征;及
    通过所述微分模型对所述非负的声学特征进行微分运算,得到与人耳听觉习惯相匹配的鲁棒特征。
  13. 一种语音识别装置,所述装置包括:
    中间表征学习模块,用于获取语音分离增强模型的第一损失函数及语音识别模型的第二损失函数;基于所述第二损失函数进行反向传播,以对桥接在所述语音分离增强模型和语音 识别模型之间的中间模型进行训练,得到鲁棒表征模型;
    损失融合模块,用于对所述第一损失函数和第二损失函数进行融合,得到目标损失函数;及
    联合训练模块,用于基于所述目标损失函数对所述语音分离增强模型、鲁棒表征模型及语音识别模型进行联合训练,在满足预设收敛条件时结束训练。
  14. 一种语音识别装置,所述装置包括:
    语音分离增强模块,用于获取目标语音流;基于语音分离增强模型提取所述目标语音流中每个音频帧的增强频谱;
    中间表征过渡模块,用于基于鲁棒表征模型对所述增强频谱进行听觉匹配,得到鲁棒特征;及
    语音识别模块,用于基于语音识别模型对所述鲁棒特征进行识别,得到每个音频帧对应的音素;其中,所述语音分离增强模型、鲁棒表征模型及语音识别模型联合训练得到。
  15. 一种计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器执行如权利要求1至12中任一项所述方法的步骤。
  16. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如权利要求1至12中任一项所述方法的步骤。
PCT/CN2020/128392 2020-01-16 2020-11-12 语音识别方法、装置和计算机可读存储介质 WO2021143327A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2022520112A JP7282442B2 (ja) 2020-01-16 2020-11-12 音声認識方法、装置及びコンピュータプログラム
EP20913796.7A EP4006898A4 (en) 2020-01-16 2020-11-12 VOICE RECOGNITION METHOD, DEVICE AND COMPUTER READABLE STORAGE MEDIUM
US17/583,512 US20220148571A1 (en) 2020-01-16 2022-01-25 Speech Recognition Method and Apparatus, and Computer-Readable Storage Medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010048780.2A CN111261146B (zh) 2020-01-16 2020-01-16 语音识别及模型训练方法、装置和计算机可读存储介质
CN202010048780.2 2020-01-16

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/583,512 Continuation US20220148571A1 (en) 2020-01-16 2022-01-25 Speech Recognition Method and Apparatus, and Computer-Readable Storage Medium

Publications (1)

Publication Number Publication Date
WO2021143327A1 true WO2021143327A1 (zh) 2021-07-22

Family

ID=70950716

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/128392 WO2021143327A1 (zh) 2020-01-16 2020-11-12 语音识别方法、装置和计算机可读存储介质

Country Status (5)

Country Link
US (1) US20220148571A1 (zh)
EP (1) EP4006898A4 (zh)
JP (1) JP7282442B2 (zh)
CN (1) CN111261146B (zh)
WO (1) WO2021143327A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539293A (zh) * 2021-08-10 2021-10-22 南京邮电大学 基于卷积神经网络和联合优化的单通道语音分离方法
CN113707134A (zh) * 2021-08-17 2021-11-26 北京搜狗科技发展有限公司 一种模型训练方法、装置和用于模型训练的装置
CN114446316A (zh) * 2022-01-27 2022-05-06 腾讯科技(深圳)有限公司 音频分离方法、音频分离模型的训练方法、装置及设备

Families Citing this family (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111261146B (zh) * 2020-01-16 2022-09-09 腾讯科技(深圳)有限公司 语音识别及模型训练方法、装置和计算机可读存储介质
CN111798866A (zh) * 2020-07-13 2020-10-20 商汤集团有限公司 音频处理网络的训练及立体声重构方法和装置
CN111896808B (zh) * 2020-07-31 2023-02-03 中国电子科技集团公司第四十一研究所 将频谱轨迹处理和自适应门限生成进行一体化设计的方法
CN111933172A (zh) * 2020-08-10 2020-11-13 广州九四智能科技有限公司 人声分离提取方法方法、装置、计算机设备及存储介质
CN112102816A (zh) * 2020-08-17 2020-12-18 北京百度网讯科技有限公司 语音识别方法、装置、系统、电子设备和存储介质
CN111816171B (zh) * 2020-08-31 2020-12-11 北京世纪好未来教育科技有限公司 语音识别模型的训练方法、语音识别方法及装置
CN112185374A (zh) * 2020-09-07 2021-01-05 北京如影智能科技有限公司 一种确定语音意图的方法及装置
CN112309398A (zh) * 2020-09-30 2021-02-02 音数汇元(上海)智能科技有限公司 工作时长监控方法、装置、电子设备和存储介质
CN112312540A (zh) * 2020-09-30 2021-02-02 音数汇元(上海)智能科技有限公司 服务人员定位方法、装置、电子设备和存储介质
CN112331208A (zh) * 2020-09-30 2021-02-05 音数汇元(上海)智能科技有限公司 人身安全监控方法、装置、电子设备和存储介质
CN112331207A (zh) * 2020-09-30 2021-02-05 音数汇元(上海)智能科技有限公司 服务内容监控方法、装置、电子设备和存储介质
CN112309374A (zh) * 2020-09-30 2021-02-02 音数汇元(上海)智能科技有限公司 服务报告生成方法、装置和计算机设备
CN111933114B (zh) * 2020-10-09 2021-02-02 深圳市友杰智新科技有限公司 语音唤醒混合模型的训练方法、使用方法和相关设备
US11810588B2 (en) * 2021-02-19 2023-11-07 Apple Inc. Audio source separation for audio devices
CN112949711B (zh) * 2021-02-26 2023-10-27 中国科学院软件研究所 面向软件定义卫星的神经网络模型可复用训练方法、装置
CN113178192B (zh) * 2021-04-30 2024-05-24 平安科技(深圳)有限公司 语音识别模型的训练方法、装置、设备及存储介质
US11922963B2 (en) * 2021-05-26 2024-03-05 Microsoft Technology Licensing, Llc Systems and methods for human listening and live captioning
CN113327586B (zh) * 2021-06-01 2023-11-28 深圳市北科瑞声科技股份有限公司 一种语音识别方法、装置、电子设备以及存储介质
CN113256592B (zh) * 2021-06-07 2021-10-08 中国人民解放军总医院 图像特征提取模型的训练方法、系统及装置
CN113327596B (zh) * 2021-06-17 2023-01-24 北京百度网讯科技有限公司 语音识别模型的训练方法、语音识别方法和装置
CN113436643B (zh) * 2021-06-25 2024-05-24 平安科技(深圳)有限公司 语音增强模型的训练及应用方法、装置、设备及存储介质
CN113284508B (zh) * 2021-07-21 2021-11-09 中国科学院自动化研究所 基于层级区分的生成音频检测系统
US20230038982A1 (en) * 2021-08-09 2023-02-09 Google Llc Joint Acoustic Echo Cancelation, Speech Enhancement, and Voice Separation for Automatic Speech Recognition
CN113593594B (zh) * 2021-09-01 2024-03-08 北京达佳互联信息技术有限公司 语音增强模型的训练方法和设备及语音增强方法和设备
CN113724727A (zh) * 2021-09-02 2021-11-30 哈尔滨理工大学 基于波束形成的长短时记忆网络语音分离算法
CN113724713A (zh) * 2021-09-07 2021-11-30 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN113870888A (zh) * 2021-09-24 2021-12-31 武汉大学 一种基于语音信号时域和频域的特征提取方法、装置、回声消除方法及装置
CN113936647B (zh) * 2021-12-17 2022-04-01 中国科学院自动化研究所 语音识别模型的训练方法、语音识别方法和系统
WO2023132018A1 (ja) * 2022-01-05 2023-07-13 日本電信電話株式会社 学習装置、信号処理装置、学習方法及び学習プログラム
CN114512136B (zh) * 2022-03-18 2023-09-26 北京百度网讯科技有限公司 模型训练、音频处理方法、装置、设备、存储介质及程序
CN114663965B (zh) * 2022-05-24 2022-10-21 之江实验室 一种基于双阶段交替学习的人证比对方法和装置
CN114722884B (zh) * 2022-06-08 2022-09-30 深圳市润东来科技有限公司 基于环境音的音频控制方法、装置、设备及存储介质
CN115116446A (zh) * 2022-06-21 2022-09-27 成都理工大学 一种噪声环境下说话人识别模型构建方法
CN115261963A (zh) * 2022-09-27 2022-11-01 南通如东依航电子研发有限公司 一种用于pcb板深镀能力提高的方法
CN115600084A (zh) * 2022-10-18 2023-01-13 浙江大学(Cn) 声非视距信号识别方法及装置、电子设备、存储介质
CN116013256B (zh) * 2022-12-19 2024-01-30 镁佳(北京)科技有限公司 一种语音识别模型构建及语音识别方法、装置及存储介质
JP7489502B1 (ja) 2023-02-09 2024-05-23 エヌ・ティ・ティ・コミュニケーションズ株式会社 予測装置、予測方法、および予測プログラム
CN116051859B (zh) * 2023-02-21 2023-09-08 阿里巴巴(中国)有限公司 服务提供方法、设备和存储介质
CN117235665A (zh) * 2023-09-18 2023-12-15 北京大学 自适应隐私数据合成方法、装置、计算机设备和存储介质
CN117708601B (zh) * 2024-02-06 2024-04-26 智慧眼科技股份有限公司 一种相似度计算模型训练方法、装置、设备及存储介质
CN117727298B (zh) * 2024-02-09 2024-04-19 广州紫麦科技有限公司 基于深度学习的手提电脑语音识别方法及系统
CN117746871A (zh) * 2024-02-21 2024-03-22 南方科技大学 一种基于云端检测鸟类鸣声的方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10147442B1 (en) * 2015-09-29 2018-12-04 Amazon Technologies, Inc. Robust neural network acoustic model with side task prediction of reference signals
CN109378010A (zh) * 2018-10-29 2019-02-22 珠海格力电器股份有限公司 神经网络模型的训练方法、语音去噪方法及装置
CN110070855A (zh) * 2018-01-23 2019-07-30 中国科学院声学研究所 一种基于迁移神经网络声学模型的语音识别系统及方法
CN110600017A (zh) * 2019-09-12 2019-12-20 腾讯科技(深圳)有限公司 语音处理模型的训练方法、语音识别方法、系统及装置
CN110648659A (zh) * 2019-09-24 2020-01-03 上海依图信息技术有限公司 基于多任务模型的语音识别与关键词检测装置和方法
CN111261146A (zh) * 2020-01-16 2020-06-09 腾讯科技(深圳)有限公司 语音识别及模型训练方法、装置和计算机可读存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10657437B2 (en) 2016-08-18 2020-05-19 International Business Machines Corporation Training of front-end and back-end neural networks
JP2019078857A (ja) 2017-10-24 2019-05-23 国立研究開発法人情報通信研究機構 音響モデルの学習方法及びコンピュータプログラム
US10811000B2 (en) 2018-04-13 2020-10-20 Mitsubishi Electric Research Laboratories, Inc. Methods and systems for recognizing simultaneous speech by multiple speakers
US10726858B2 (en) 2018-06-22 2020-07-28 Intel Corporation Neural network for speech denoising trained with deep feature losses
CN109637526A (zh) * 2019-01-08 2019-04-16 西安电子科技大学 基于个人身份特征的dnn声学模型的自适应方法
CN109859743B (zh) * 2019-01-29 2023-12-08 腾讯科技(深圳)有限公司 音频识别方法、系统和机器设备
CN110120227B (zh) * 2019-04-26 2021-03-19 天津大学 一种深度堆叠残差网络的语音分离方法
CN110570845B (zh) * 2019-08-15 2021-10-22 武汉理工大学 一种基于域不变特征的语音识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10147442B1 (en) * 2015-09-29 2018-12-04 Amazon Technologies, Inc. Robust neural network acoustic model with side task prediction of reference signals
CN110070855A (zh) * 2018-01-23 2019-07-30 中国科学院声学研究所 一种基于迁移神经网络声学模型的语音识别系统及方法
CN109378010A (zh) * 2018-10-29 2019-02-22 珠海格力电器股份有限公司 神经网络模型的训练方法、语音去噪方法及装置
CN110600017A (zh) * 2019-09-12 2019-12-20 腾讯科技(深圳)有限公司 语音处理模型的训练方法、语音识别方法、系统及装置
CN110648659A (zh) * 2019-09-24 2020-01-03 上海依图信息技术有限公司 基于多任务模型的语音识别与关键词检测装置和方法
CN111261146A (zh) * 2020-01-16 2020-06-09 腾讯科技(深圳)有限公司 语音识别及模型训练方法、装置和计算机可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4006898A4

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539293A (zh) * 2021-08-10 2021-10-22 南京邮电大学 基于卷积神经网络和联合优化的单通道语音分离方法
CN113539293B (zh) * 2021-08-10 2023-12-26 南京邮电大学 基于卷积神经网络和联合优化的单通道语音分离方法
CN113707134A (zh) * 2021-08-17 2021-11-26 北京搜狗科技发展有限公司 一种模型训练方法、装置和用于模型训练的装置
CN113707134B (zh) * 2021-08-17 2024-05-17 北京搜狗科技发展有限公司 一种模型训练方法、装置和用于模型训练的装置
CN114446316A (zh) * 2022-01-27 2022-05-06 腾讯科技(深圳)有限公司 音频分离方法、音频分离模型的训练方法、装置及设备
CN114446316B (zh) * 2022-01-27 2024-03-12 腾讯科技(深圳)有限公司 音频分离方法、音频分离模型的训练方法、装置及设备

Also Published As

Publication number Publication date
EP4006898A4 (en) 2022-11-09
US20220148571A1 (en) 2022-05-12
CN111261146B (zh) 2022-09-09
EP4006898A1 (en) 2022-06-01
JP7282442B2 (ja) 2023-05-29
JP2022551068A (ja) 2022-12-07
CN111261146A (zh) 2020-06-09

Similar Documents

Publication Publication Date Title
WO2021143327A1 (zh) 语音识别方法、装置和计算机可读存储介质
WO2021143326A1 (zh) 语音识别方法、装置、设备和存储介质
JP7337953B2 (ja) 音声認識方法及び装置、ニューラルネットワークの訓練方法及び装置、並びにコンピュータープログラム
US11908455B2 (en) Speech separation model training method and apparatus, storage medium and computer device
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
Xu et al. Convolutional gated recurrent neural network incorporating spatial features for audio tagging
CN111583954B (zh) 一种说话人无关单通道语音分离方法
Feng et al. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition
CN105139864B (zh) 语音识别方法和装置
CN111899757B (zh) 针对目标说话人提取的单通道语音分离方法及系统
KR102026226B1 (ko) 딥러닝 기반 Variational Inference 모델을 이용한 신호 단위 특징 추출 방법 및 시스템
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques
Samui et al. Tensor-train long short-term memory for monaural speech enhancement
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
Jati et al. An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks.
Sunny et al. Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam
Nayem et al. Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement.
Li et al. Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network
CN112951270B (zh) 语音流利度检测的方法、装置和电子设备
Daneshvar et al. Persian phoneme recognition using long short-term memory neural network
Zhang et al. Audio-visual speech separation with visual features enhanced by adversarial training
Fang et al. Uncertainty-Driven Hybrid Fusion for Audio-Visual Phoneme Recognition
Yechuri et al. A U-net with Gated Recurrent Unit and Efficient Channel Attention Mechanism for Real-time Speech Enhancement
US20240079022A1 (en) General speech enhancement method and apparatus using multi-source auxiliary information
Kothapally Deep Learning Strategies for Monaural Speech Enhancement in Reverberant Environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20913796

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020913796

Country of ref document: EP

Effective date: 20220224

ENP Entry into the national phase

Ref document number: 2022520112

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE