CN113823311B

CN113823311B - Voice recognition method and device based on audio enhancement

Info

Publication number: CN113823311B
Application number: CN202110955519.5A
Authority: CN
Inventors: 戴李
Original assignee: Guangzhou Shengwei Electronics Co ltd
Current assignee: Guangzhou Shengwei Electronics Co ltd
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2023-11-21
Anticipated expiration: 2041-08-19
Also published as: CN113823311A

Abstract

The invention discloses a voice recognition method and a voice recognition device based on audio enhancement, wherein the voice recognition method comprises the steps of obtaining first data through first filter function calculation on multichannel sound source sound data picked up by a microphone array, obtaining second data through second filter function calculation on the first data, and processing the second data through a beam forming algorithm to obtain single-channel audio signals; processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to obtain third data; and recognizing the third data through a voice recognition model. According to the invention, the multichannel voice data picked up by the microphone array firstly eliminates multipath reflection mixed voice data with different delays caused by reflection and absorption of different barriers encountered by the sound source, then removes non-target sound source sound data in the second data, finally removes environmental noise, realizes enhancement processing of the sound source sound data, and improves the accuracy of voice recognition.

Description

Voice recognition method and device based on audio enhancement

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition method and device based on audio enhancement.

Background

Compared with a single microphone, the microphone array has the advantages of higher gain, beam flexibility, strong anti-interference capability and the like, for example, when the remote voice signal is picked up, the weak voice signal in the far-field environment is more favorable to be acquired due to the high gain characteristic of the microphone array. In addition, the microphone array has spatial filtering characteristics, can flexibly inhibit interference in different directions, and is widely applied to the fields of blind source separation, sound source positioning and the like.

Because various noises exist in the audio data and influence the quality of voice communication and man-machine interaction to different degrees, the existing algorithm still cannot achieve an ideal effect in certain specific scenes due to the complexity of application scenes and the diversity of the noises, and therefore the development of a microphone array voice enhancement algorithm with strong robustness is particularly important.

Disclosure of Invention

In view of the foregoing problems with the prior art, the present invention provides a voice recognition method based on audio enhancement, including:

the method comprises the steps that multichannel sound source sound data picked up by a microphone array are calculated through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;

calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound greater than a first preset threshold value in the first data, and obtaining second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;

processing the second data through a beam forming algorithm to obtain a single-channel audio signal;

processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to remove the environmental noise in the single-channel audio signal, and obtaining third data;

and recognizing the third data through a voice recognition model.

Preferably, the first ambient noise reduction algorithm includes:

inputting the single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal;

clean speech data in the single-channel audio signal is obtained based on the single-channel audio signal and the audio features.

Preferably, the first deep learning network model comprises a plurality of LSTM network models, the a-th layer outputs of the 1 st to n-th LSTM network models being commonly connected to the a+1-th layer inputs of the n-th LSTM network model.

Preferably, the method for obtaining the second filtering function includes:

performing linear combination based on multipath reflection mixed data with the arrival delay of the sound source sound greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of the multipath reflection mixed data with the arrival delay of the sound source sound greater than the first preset threshold value in the first data of the current time;

the coefficient matrix of the linear combination is obtained by adopting a weighted least square algorithm so as to minimize the time domain correlation of the estimated value meeting the second expected signal of the output signal, namely:

wherein,as an estimate of the second desired signal,

weight estimation value of weighted least squares algorithmThe method comprises the following steps:

for the power spectral density estimate of the second desired signal, M is the number of microphones in the microphone array, ε is a constant;

estimation of coefficient matrix for linear combinationThe method comprises the following steps:

wherein the method comprises the steps of

And (3) inverting the autocorrelation matrix of the multipath reflection mixed data with the arrival delay of the sound source sound in the first data being greater than a first preset threshold.

Preferably, the power spectral density estimate of the second desired signalAnd acquiring a power spectral density estimation model based on a second deep learning network, wherein the second deep learning network takes the power spectral density of the first data as input during training, and learns the mapping relation from the power spectral density of the first data to the power spectral density of the second expected signal so as to output an estimated value of the power spectral density of the second expected signal.

Preferably, the second deep learning network adopts an LSTM network, and output data of each cell of the LSTM network is input to an input of a next cell through projection processing.

The invention also provides a voice recognition device based on audio enhancement, which comprises:

the first data generation module is used for calculating the multichannel sound source sound data picked up by the microphone array through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;

the second data generation module is used for calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound larger than a first preset threshold value in the first data, so as to obtain second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;

the single-channel audio signal generation module is used for processing the second data through a beam forming algorithm to obtain a single-channel audio signal;

the third data generation module is used for processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise so as to remove the environmental noise in the single-channel audio signal and obtain third data;

and the voice recognition module is used for recognizing the third data through the voice recognition model.

As a further optimization of the above solution, the second data generating module includes a second filtering function unit, where the second filtering function performs linear combination based on multiple reflection mixed data with arrival delays of sound source sounds greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of multiple reflection mixed data with arrival delays of sound source sounds greater than the first preset threshold value in the first data of the current time, and obtains a coefficient matrix of the linear combination by adopting a weighted least square algorithm so as to minimize a time-domain correlation of a second desired signal with the estimated value satisfying the output signal.

The invention also provides an electronic device, which comprises:

a memory for storing executable instructions;

and the processor is used for realizing the voice recognition method based on the audio enhancement when executing the executable instructions stored in the memory.

A computer readable storage medium storing executable instructions that when executed by a processor implement an audio enhancement based speech recognition method as described above.

The voice recognition method and device based on audio enhancement have the following beneficial effects:

1. the multichannel voice data picked up by the microphone array firstly eliminates multipath reflection mixed voice data which have different time delays with the voice data of the direct microphone and are caused by the reflection and absorption of the sound source sound meeting different barriers, then removes other sound source sound data except the target sound source sound data in the second data, finally removes environmental noise, realizes the enhancement processing of the sound source sound data, and improves the accuracy of voice recognition.

2. The a layer output of the 1 st to n th LSTM network models is commonly connected to the a+1 layer input of the n th LSTM network model, so that the a+1 layer of the n th LSTM network model is learned based on the prior knowledge provided by the 1 st to n-1 st LSTM network models, a more accurate learning direction can be obtained, the neural network training time is effectively shortened, and the accuracy of the first type audio features of the clean voice extracted by the first deep learning network model is improved.

Drawings

FIG. 1 is an overall flow chart of a voice recognition method based on audio enhancement of the present invention;

fig. 2 is a block diagram of a voice recognition apparatus based on audio enhancement according to the present invention.

Detailed Description

The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

The voice recognition method based on audio enhancement provided by the embodiment comprises the following steps:

and recognizing the third data through a voice recognition model.

In this embodiment, the multi-channel voice data picked up by the microphone array firstly eliminates the multi-channel reflection mixed voice data which has different delay with the voice data of the direct microphone and is caused by the reflection and absorption of the sound source sound meeting different obstacles, then removes other sound source sound data except the target sound source sound data in the second data, finally removes the environmental noise, realizes the enhancement processing of the sound source sound data, and improves the accuracy of voice recognition.

Wherein the first ambient noise reduction algorithm comprises:

inputting a single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal;

and a second step of obtaining clean voice data in the single-channel audio signal based on the single-channel audio signal and the audio characteristics.

Wherein in the first step, the first deep learning network model is composed of a plurality of LSTM network models, the input of the first deep learning network model is a single-channel audio signal, each LSTM network model is used for outputting the first type audio characteristics of clean voice under different signal-to-noise ratio conditions based on the first type audio characteristics of the single-channel audio signal,

in the second step, based on the first type audio features of clean voice under different signal-to-noise ratio conditions, a plurality of first type audio features are fused through mean value calculation to obtain first type fusion audio features, spectrum reconstruction is obtained based on the fusion audio features, and then reconstructed voice data is obtained.

The type of audio features described above may be logarithmic power spectrum or power spectrum masking features of clean speech and noisy speech, etc.

The first deep learning network model includes a plurality of LSTM network models, and the a-th layer outputs of the 1 st to n-th LSTM network models are commonly connected to the a+1-th layer inputs of the n-th LSTM network model.

By adopting the connection structure, the (a+1) th layer of the nth LSTM network model learns based on the prior knowledge provided by the 1 st to n-1 st LSTM network models, so that a more accurate learning direction can be obtained, the training time of the neural network is effectively shortened, and the accuracy of the first type audio features of the clean voice extracted by the first deep learning network model is improved.

During training, mixing clean voice data and noise data according to different signal to noise ratios to serve as training data, inputting first type audio features of the training data into an LSTM network model to train a single model, and completing training of each LSTM network model to start a training process of the next LSTM network model.

The method for acquiring the second filter function comprises the following steps:

wherein,as an estimate of the second desired signal,

wherein the method comprises the steps of

Wherein x (n) is the first data at the current time,and the mixed data is multipath reflection mixed data with the arrival delay of sound source sound larger than a first preset threshold value in the first data of all the moments before the current moment.

The power spectrum density estimated value of the second expected signal in the process of obtaining the second filtering functionThe power spectrum density estimation model based on a second deep learning network is adopted to obtain, the second deep learning network takes the power spectrum density of the first data as input during training, the mapping relation from the power spectrum density of the first data to the power spectrum density of the second expected signal is learned to output the estimated value of the power spectrum density of the second expected signal, the second deep learning network adopts an LSTM network, and the output data of each cell of the LSTM network is input to the input of the next cell through projection processing.

Specifically, each cell is based on input data x _t Selecting to discard partial data f at forget gate _t Output h based on last cell at input gate _t-1 And the current input data x _t State c of updating cell _t Output o at the output gate based on the output gate _t And updated cell state c _t Obtain output data m _t The projection processing is performed based on the output data: treatment m by means of a circulation unit _t * W1 gives r _t Processing m by non-circulating unit _t * W2 gives p _t The data input to the next cell is then: w3 r _t +W4*p _t +b, wherein W1, W2, W3 and W4 are weight parameters, and b is a bias parameter, so that the complexity of the model is effectively reduced and the training time is reduced through the input-output relationship among the cells.

The embodiment also provides a voice recognition device based on audio enhancement, which comprises:

The second data generating module includes a second filtering function unit, where the second filtering function performs linear combination based on multi-path reflection mixed data with arrival delay of sound source being greater than a first preset threshold value in first data before the current time to obtain an estimated value of multi-path reflection mixed data with arrival delay of sound source being greater than the first preset threshold value in the first data at the current time, and obtains a coefficient matrix of the linear combination by adopting a weighted least square algorithm so as to minimize time-domain correlation of a second expected signal with the estimated value meeting an output signal.

For specific limitations of the speech recognition device, reference may be made to the above limitations of the speech recognition method, and no further description is given here. The various modules in the speech recognition device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The embodiment also provides an electronic device, including:

a memory for storing executable instructions;

The present embodiment also provides a computer-readable storage medium storing executable instructions that when executed by a processor implement an audio enhancement-based speech recognition method as described above.

The electronic device includes: at least one processor, memory, a user interface, and at least one network interface. The various components in the electronic device are coupled together by a bus system. It will be appreciated that a bus system is used to enable connected communications between these components. The bus system includes a power bus, a control bus, and a status signal bus in addition to the data bus. The user interface may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc. The processor of the electronic device is configured to provide computing and control capabilities, and the memory of the electronic device may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory, and the memory in this embodiment stores an operating system, a computer program, and a database, the computer program when executed by the processor implementing an audio enhancement-based speech recognition method as described above.

The present invention is not limited to the above-described specific embodiments, and various modifications may be made by those skilled in the art without inventive effort from the above-described concepts, and are within the scope of the present invention.

Claims

1. A voice recognition method based on audio enhancement, comprising:

recognizing the third data through a voice recognition model;

the first ambient noise reduction algorithm includes: inputting the single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal; clean speech data in the single-channel audio signal is obtained based on the single-channel audio signal and the audio features.

2. The audio enhancement based speech recognition method of claim 1, wherein the first deep learning network model comprises a plurality of LSTM network models, and wherein a-th layer outputs of the 1 st to n-th LSTM network models are commonly connected to an a+1-th layer input of the n-th LSTM network model.

3. The method for voice recognition based on audio enhancement according to claim 1, wherein the method for obtaining the second filter function comprises:

wherein,as an estimate of the second desired signal,

wherein the method comprises the steps of

0＜γ≤1，/>And (3) inverting the autocorrelation matrix of the multipath reflection mixed data with the arrival delay of the sound source sound in the first data being greater than a first preset threshold.

4. A method of audio enhancement based speech recognition according to claim 3 and wherein said second desired signal has an estimate of its power spectral densityAnd acquiring a power spectral density estimation model based on a second deep learning network, wherein the second deep learning network takes the power spectral density of the first data as input during training, and learns the mapping relation from the power spectral density of the first data to the power spectral density of the second expected signal so as to output an estimated value of the power spectral density of the second expected signal.

5. The voice recognition method according to claim 4, wherein the second deep learning network uses an LSTM network, and output data of each cell of the LSTM network is input to an input of a next cell through projection processing.

6. A voice recognition apparatus based on audio enhancement, comprising:

the voice recognition module is used for recognizing the third data through the voice recognition model;

7. The voice recognition device according to claim 6, wherein the second data generating module includes a second filter function unit that performs linear combination based on the multiple reflection mixed data having the arrival delay of the sound source sound greater than a first preset threshold value in the first data at the present time, and obtains the coefficient matrix of the linear combination by using a weighted least square algorithm to minimize the time-domain correlation of the second desired signal having the estimated value satisfying the output signal.

8. An electronic device, the electronic device comprising:

a memory for storing executable instructions;

a processor for implementing an audio enhancement based speech recognition method according to any one of claims 1 to 5 when executing executable instructions stored in said memory.

9. A computer readable storage medium storing executable instructions which when executed by a processor implement a voice recognition method based on audio enhancement as claimed in any one of claims 1 to 5.