CN115862632A - Voice recognition method and device, electronic equipment and storage medium - Google Patents

Voice recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115862632A
CN115862632A CN202211689410.2A CN202211689410A CN115862632A CN 115862632 A CN115862632 A CN 115862632A CN 202211689410 A CN202211689410 A CN 202211689410A CN 115862632 A CN115862632 A CN 115862632A
Authority
CN
China
Prior art keywords
signal
target
microphone
noise
microphone signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211689410.2A
Other languages
Chinese (zh)
Inventor
柴丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202211689410.2A priority Critical patent/CN115862632A/en
Publication of CN115862632A publication Critical patent/CN115862632A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a microphone signal; determining a target wave beam of the microphone signals based on the correlation degree between the wave beams in the microphone signals, and denoising the target wave beam based on a noise signal in the microphone signals to obtain an enhanced target signal; and performing voice recognition on the enhanced target signal. The method, the device, the electronic equipment and the storage medium provided by the invention determine the target wave beams of the microphone signals based on the correlation degree among the wave beams in the microphone signals, and perform voice recognition by applying the noise-reduced target wave beams.

Description

Voice recognition method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.
Background
At present, the speech recognition technology has unsatisfactory recognition accuracy in low signal-to-noise ratio scenes, such as far-field speech recognition scenes.
In order to improve the processing quality of the speech signal and thus improve the speech recognition rate in a low signal-to-noise ratio scene, the mainstream far-field speech recognition system is usually formed by connecting a front-end speech enhancement module and a back-end speech recognition acoustic modeling model in series. The front-end speech enhancement module may implement MVDR (Minimum Variance Distortionless Response) or GSC (Generalized Sidelobe Canceller) by using a deep neural network to implement beamforming.
However, the network structure designed by the method of using the deep neural network to realize beam forming is simple at present, and various information contained in the microphone signals directly acquired cannot be fully applied, so that the optimization effect of voice recognition is poor.
Disclosure of Invention
The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for solving the defect of poor voice recognition effect in the low signal-to-noise ratio scene in the prior art.
The invention provides a voice recognition method, which comprises the following steps:
acquiring a microphone signal;
determining a target wave beam of the microphone signals based on the correlation degree between the wave beams in the microphone signals, and denoising the target wave beam based on a noise signal in the microphone signals to obtain an enhanced target signal;
and performing voice recognition on the enhanced target signal.
According to a voice recognition method provided by the present invention, the determining a target beam of the microphone signal based on a correlation between beams in the microphone signal includes:
determining the importance of each beam based on the correlation degree between the beams;
and carrying out weighted summation on each beam based on the importance of each beam to obtain the target beam.
According to a speech recognition method provided by the present invention, the step of determining the noise signal comprises:
determining a direction of arrival based on the beam with the highest importance;
and performing noise estimation on the microphone signal based on the target wave beam and the direction of arrival to obtain a noise signal of the microphone signal.
According to a speech recognition method provided by the present invention, the performing speech recognition on the enhancement target signal includes:
and performing voice recognition on the enhanced target signal based on the noise signal.
According to a voice recognition method provided by the present invention, the determining of each beam in the microphone signal comprises:
carrying out time-frequency transformation on the microphone signals to obtain multi-channel frequency domain signals;
and generating a plurality of fixed beams in different directions as each beam based on the multi-channel frequency domain signals.
According to a voice recognition method provided by the present invention, the determining a target beam of the microphone signal based on a correlation between beams in the microphone signal, and denoising the target beam based on a noise signal in the microphone signal to obtain an enhanced target signal includes:
based on a voice enhancement module, determining a target wave beam of a microphone signal by applying the correlation degree between wave beams in the microphone signal, and reducing noise of the target wave beam by applying a noise signal in the microphone signal to obtain an enhanced target signal;
the performing voice recognition on the enhanced target signal comprises:
performing voice recognition on the enhanced target signal based on a voice recognition module;
the voice enhancement module and the voice recognition module form an integrated model, and the integrated model is obtained based on a sample microphone signal and recognition text training of the sample microphone signal.
According to the speech recognition method provided by the invention, the integrated model is obtained by training based on the sample microphone signal, the recognition text of the sample microphone signal and the target beam and/or noise signal of the sample microphone signal.
The present invention also provides a voice recognition apparatus comprising:
an acquisition unit for acquiring a microphone signal;
the beam forming unit is used for determining a target beam of the microphone signals based on the correlation degree between the beams in the microphone signals, and carrying out noise reduction on the target beam based on a noise signal in the microphone signals to obtain an enhanced target signal;
and the recognition unit is used for carrying out voice recognition on the enhanced target signal.
The invention also provides an electronic device comprising a microphone, a memory, a processor and a computer program stored on the memory and executable on the processor;
the microphone is used for collecting microphone signals;
when the processor executes the program, the target wave beam of the microphone signal is determined based on the correlation degree between the wave beams in the microphone signal, and the noise of the target wave beam is reduced based on the noise signal in the microphone signal, so that an enhanced target signal is obtained; and performing voice recognition on the enhanced target signal.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech recognition method as described in any of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a speech recognition method as described in any one of the above.
According to the voice recognition method, the voice recognition device, the electronic equipment and the storage medium, the target wave beams of the microphone signals are determined based on the correlation degree among the wave beams in the microphone signals, the noise-reduced target wave beams are applied for voice recognition, and the commonality and the difference among the wave beams are fully referred during wave beam forming, so that more accurate and reliable target wave beams are obtained, and the accuracy and the reliability of the voice recognition realized based on the objective wave beams are further ensured.
Drawings
In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a speech recognition method according to the present invention;
FIG. 2 is a schematic diagram of a generalized sidelobe canceller according to the present invention;
FIG. 3 is a second schematic flow chart of a speech recognition method according to the present invention;
FIG. 4 is a schematic diagram of a voice recognition apparatus according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Far-field speech recognition is a typical low signal-to-noise ratio scene, when a target sound source is far away from a sound pickup, a received target sound source signal is gradually attenuated in a transmission process, and the surrounding environment is noisy, so that a plurality of interference sound source signals exist, and the speech signal recorded by the sound pickup is low in signal-to-noise ratio and poor in recognition result. Far-field speech recognition generally refers to speaking at a distance of 3 meters to 5 meters from a microphone, and common scenes include a conference room, a vehicle-mounted scene, smart home and the like.
In far-field speech recognition, a microphone array is generally used as a sound pick-up, and a microphone array technology is introduced to extract a target speech signal, so that the recognition accuracy is improved. The microphone array is composed of a group of microphones which are arranged according to a certain geometric structure (commonly used linear and annular), and performs space-time processing on collected sound signals in different spatial directions, so that the functions of noise suppression, reverberation removal, human sound interference suppression, sound source direction finding, sound source tracking, array gain and the like are realized, and the processing quality of a voice signal is further improved, so that the voice recognition rate in a real environment is improved.
The mainstream far-field speech recognition system is usually composed of a front-end speech enhancement module and a back-end speech recognition acoustic modeling model connected in series. The front-end speech enhancement module generally includes Direction of Arrival (DOA) and Beamforming (BF). Among them, the DOA technique is used to estimate the direction of a target sound source, and the BF technique uses the azimuth information of the target sound source to enhance a target signal and suppress an interference signal. In practical applications, the MVDR or the GSC may be implemented by using a deep neural network to implement a beamformer.
However, the prior assumption based on digital signal processing in the conventional MVDR is still retained in the method for realizing the MVDR beam former by using the deep neural network at present, and the inverse of a spatial correlation covariance matrix and a target signal steering vector required by an MVDR calculation formula are learned by using the neural network. The network structure designed by the method for realizing GSC beam forming by utilizing the deep neural network is simpler, only a simple full-connection layer and an LSTM network are utilized, and distinguishing information among all paths of signals is not utilized, so that the accuracy of voice recognition cannot be effectively improved.
In view of the above problem, fig. 1 is a schematic flow chart of a speech recognition method provided by the present invention, and as shown in fig. 1, the method includes:
step 110, a microphone signal is acquired.
Here, the microphone signal is a multi-channel signal acquired by a microphone array, where the microphone array may acquire a signal in a far-field environment or a near-field environment, and the embodiment of the present invention is not limited to this.
Step 120, determining a target beam of the microphone signal based on the correlation between the beams in the microphone signal, and performing noise reduction on the target beam based on a noise signal in the microphone signal to obtain an enhanced target signal.
Specifically, after the microphone signal is obtained, the microphone signal may be converted into a multi-channel signal through time-frequency transform, such as fourier transform or short-time fourier transform, and a plurality of fixed beams in different directions are respectively generated by using an amplitude phase difference between each channel signal in the multi-channel signal, that is, each beam in the microphone signal is obtained.
After each beam in the microphone signal is obtained, beamforming may be implemented based thereon to determine a target beam for the microphone signal. In order to utilize the information contained in the microphone signal as much as possible, and thus improve the reliability of beam forming and subsequent speech recognition, the embodiments of the present invention refer to the correlation between the beams when determining the target beam. It can be understood that the correlation between the beams can reflect the commonality and difference between the beams, and the target beam is determined based on the correlation, so that the information contained in the microphone signal is fully applied, and the reliability and accuracy of the determination of the target beam are guaranteed.
Considering that the target beam still has residual noise interference, after the target beam is determined, the noise signal obtained by performing noise estimation on the microphone signal may be applied to perform noise reduction processing on the target beam, so as to obtain the target beam after the noise reduction processing, that is, the enhanced target beam.
Step 130, performing voice recognition on the enhanced target signal.
Specifically, based on the enhanced target signals obtained in step 110 and step 120, the beamforming process thereof refers to the correlation between the beams in the microphone signals, and speech recognition is performed based on the correlation, so that the accuracy and reliability of speech recognition can be improved.
According to the method provided by the embodiment of the invention, the target wave beams of the microphone signals are determined based on the correlation degree among the wave beams in the microphone signals, and the noise-reduced target wave beams are applied for voice recognition.
Based on the foregoing embodiment, in step 120, the determining a target beam of the microphone signal based on a correlation between beams in the microphone signal includes:
determining the importance of each beam based on the correlation degree between the beams;
and carrying out weighted summation on each beam based on the importance of each beam to obtain the target beam.
Here, the correlation between the beams is calculated pairwise, that is, the correlation reflects the correlation between the two beams. It will be appreciated that for any one of the beams, the importance of the beam in each beam may be determined based on the correlation between the beam and the remaining beams. For example, the correlation between the beam and each of the rest beams may be accumulated as the importance of the beam; for example, the importance of each beam can be obtained by adding up the correlation between the beam and each of the remaining beams as the total correlation value of the beam and normalizing the total correlation value of each beam. Here, the calculation of the correlation degree may be realized by an attention mechanism, for example, fixed beams in different directions may be input into a multi-head attention network as a multi-head to obtain the correlation degree between the beams, and then an attention coefficient is calculated with each beam as a reference and other beams, that is, the importance degree of each beam is obtained.
After the importance of each beam is obtained, the importance can be used as a weight to perform weighted summation on each beam, so as to obtain a beam in which the target source direction is located, that is, a target beam.
According to any of the above embodiments, the determining the noise signal includes:
determining a direction of arrival based on the beam with the highest importance;
and performing noise estimation on the microphone signal based on the target wave beam and the direction of arrival to obtain a noise signal of the microphone signal.
Specifically, after the importance of each beam is obtained, the beam with the highest importance may be selected from the importance, and the direction in which the beam is located may be determined as the direction of arrival DOA.
After the target beam and the direction of arrival are obtained, the noise estimation of the microphone signal can be performed based on both. Further, the target beam, the direction of arrival and the microphone signal may be sent to a pre-trained noise estimation model, or a multi-channel signal obtained by performing time-frequency variation on the target beam, the direction of arrival and the microphone signal may be sent to the pre-trained noise estimation model to realize noise estimation, so as to obtain the noise signal included in the microphone signal.
Based on any of the above embodiments, step 130 includes:
and performing voice recognition on the enhanced target signal based on the noise signal.
Specifically, compared with the method of directly performing voice recognition on the enhanced target signal in the related art, the noise signal for performing noise estimation on the microphone signal is also referred to in the embodiment of the invention during the voice recognition. The noise signal and the enhanced target signal are sent into the speech recognition model at the rear end, richer information can be provided for the speech recognition model, and therefore the final recognition effect is improved.
In any of the above embodiments, the determining of each beam in the microphone signals includes:
carrying out time-frequency transformation on the microphone signals to obtain multi-channel frequency domain signals;
and generating a plurality of fixed beams in different directions as each beam based on the multi-channel frequency domain signals.
In particular, beamforming techniques mainly include fixed beamforming and adaptive beamforming. Fixed beamforming has limited noise suppression capability due to mismatch between theoretical and actual noise field assumptions. While adaptive beamforming may improve noise robustness by incorporating higher order statistics. It generally relies on accurate noise masking (Mask) estimates and target sound source location estimates, and if this information is not reliably available, adaptive beamforming methods incur a greater loss of recognition rate than fixed beamforming.
Therefore, in order to ensure the noise suppression capability and simultaneously reduce the loss of the identification rate as much as possible, in the embodiment of the invention, each beam is acquired through a fixed beam forming mode. Specifically, the microphone signal may be converted into a multi-channel signal by performing time-frequency transform, such as fourier transform or short-time fourier transform, and a plurality of fixed beams in different directions are respectively generated by using amplitude phase differences between the channel signals in the multi-channel signal, so as to obtain each beam in the microphone signal. Here, the Fixed Beamforming Filter (FBF) weight parameters may be fitted to the generation of fixed beams for a plurality of different directions by a complex convolutional neural network.
Further, in view of the problem that the fixed beamforming noise suppression capability is relatively limited, it is possible to secure the noise suppression capability by performing noise estimation based on both the target beam and the direction of arrival after obtaining each beam and thus determining the target beam, and performing noise reduction on the target beam by estimating the obtained noise signal.
Although the related art also relates to a scheme of implementing speech recognition in a far-field environment by a speech enhancement module and a speech recognition module, the speech enhancement module and the speech recognition module are generally optimized independently, that is, the current speech recognition scheme cannot establish a connection between front-end speech enhancement and front-end speech recognition, so that the problem that the optimization target of front-end speech enhancement and the recognition target of candidate speech recognition are inconsistent exists.
For example, the design and test conditions of the conventional microphone array processing filter are not aimed at improving the speech recognition accuracy, and the change of the front-end filter parameters in the far-field scene is also easy to cause the mismatch with the back-end acoustic model, thereby reducing the recognition rate. Therefore, how to fuse the front-end algorithm optimization and the acoustic model with the final recognition rate as the target to improve the recognition effect to the maximum becomes the current problem.
To solve this problem, according to any of the above embodiments, in step 120, the determining a target beam of the microphone signal based on a correlation between beams in the microphone signal, and performing noise reduction on the target beam based on a noise signal in the microphone signal to obtain an enhanced target signal includes:
based on a voice enhancement module, determining a target wave beam of a microphone signal by applying the correlation degree between wave beams in the microphone signal, and reducing noise of the target wave beam by applying a noise signal in the microphone signal to obtain an enhanced target signal;
in step 130, the performing speech recognition on the enhancement target signal includes:
performing voice recognition on the enhanced target signal based on a voice recognition module;
the voice enhancement module and the voice recognition module form an integrated model, and the integrated model is obtained based on a sample microphone signal and recognition text training of the sample microphone signal.
Specifically, in the embodiment of the present invention, step 120 and step 130 may be performed by two modules in an integrated model, namely, a speech enhancement module and a speech recognition module. In order to ensure that the optimization target of the speech enhancement module and the recognition target of the speech recognition module can be kept consistent, the integrated model can be integrally trained by taking the sample microphone signal as training data and the recognition text of the sample microphone signal as a training label. Therefore, the uniformity of the optimization target of the voice enhancement module and the recognition target of the voice recognition module in the integrated model is ensured, and the problem of reduced recognition rate caused by the fact that the voice enhancement module and the voice recognition module are not matched in the related scheme is solved.
Here, the training for the unified model may include the following steps:
firstly, acquiring an initialized initial voice enhancement module and an initial voice recognition module to construct an initial integrated model;
then, inputting the sample microphone signal into an initial integrated model, predicting a target wave beam of the sample microphone signal by an initial voice enhancement module based on the correlation degree between wave beams in the sample microphone signal, and denoising the predicted target wave beam based on a noise signal in the sample microphone signal to obtain a predicted enhanced target signal; and then, carrying out voice recognition by the initial voice recognition module based on the predicted enhanced target signal and the noise information to obtain a sample recognition result of the sample microphone voice.
Then, a sample recognition result obtained based on the initial integration model is compared with a recognition text of a sample microphone signal, a loss function is generated based on the difference between the sample recognition result and the recognition text, and parameter iteration is performed on the initial integration model based on the loss function. For example, cross entropy may be calculated as a loss function. It can be understood that, here, for parameter iteration of the initial integrated model, joint tuning of the initial speech enhancement module and the initial speech recognition module is realized, and uniformity of targets of speech enhancement and speech recognition is ensured.
Based on any of the above embodiments, the unified model is trained based on the sample microphone signal, the recognition text of the sample microphone signal, and the target beam and/or noise signal of the sample microphone signal.
Specifically, when training the unified model including the speech enhancement module and the speech recognition module, the labels of the training samples may include not only the recognition text of the sample microphone signal, but also the target beam and/or noise signal of the sample microphone signal.
It can be understood that the target beam and/or the noise signal of the sample microphone signal are also used as the label for training, so that the speech enhancement performance of the speech enhancement module can be further improved and the training efficiency of the integrated model can be accelerated on the premise of ensuring the uniformity of the speech enhancement and speech recognition targets.
In order to reduce the cost of manual labeling, avoid the error caused by the substitution of manual labeling and reduce the cost of collecting real multichannel microphone signals, the sample microphone signals can be obtained by single-channel near-field data simulation synthesis.
Specifically, room Impulse Responses (RIRs) of rooms with different distances (3 to 5 meters) from a target sound source and different sizes can be simulated, and then RIRs containing different azimuth steering vectors can be simulated and generated by using a mirror image sound source method. And then carrying out convolution on the single-path near-field data and the simulated RIRs to simulate multi-microphone array information to generate multi-channel data. Finally, after the multi-channel data are obtained, interference noise is added according to the signal-to-noise ratio of 0-10 db, for example, and the interference noise comprises non-human noise and human noise, so that the final far-field multi-channel data, namely the sample microphone signal used as training data in the embodiment of the invention, is obtained.
Specifically, in the training process of the integrated model, the target beam and/or noise signal predicted by the initial speech enhancement module may be compared with the target beam and/or noise signal of the sample microphone signal to generate a first loss function for performing parameter iteration on the initial speech enhancement module itself; in addition, a sample identification result obtained based on the initial integration model is compared with an identification text of a sample microphone signal to generate a second loss function for parameter iteration of the initial integration model as a whole.
According to the method provided by the embodiment of the invention, through multi-task learning including beam forming and voice recognition, the model training efficiency is effectively improved and the model training effect is ensured while the target uniformity of voice enhancement and voice recognition is ensured.
Based on any of the above embodiments, fig. 2 is a schematic structural diagram of the generalized sidelobe Canceller provided in the present invention, and as shown in fig. 2, the conventional generalized sidelobe Canceller is composed of Fixed Beamforming (FBF), blocking Matrix (BM), and Adaptive Noise Canceller (ANC). Wherein, the upper branch of the generalized sidelobe canceller is composed of a fixed beam former for delay and sum, and only allows the target beam to pass through. The lower branch of the generalized sidelobe canceller consists of a blocking matrix and an adaptive noise canceller, and the blocking matrix only allows noise interference to pass through. The output of the fixed beamformer, the output of the blocking matrix and the adaptive noise canceller constitute a multi-channel adaptive filtering structure.
Based on this, in the embodiment of the invention, the generalized sidelobe canceller GSC is fitted into a voice enhancement module, and voice recognition is realized by combining a voice recognition module. Fig. 3 is a second flowchart of the speech recognition method according to the present invention, and as shown in fig. 3, the generalized sidelobe canceller GSC may include a complex convolutional neural network, an attention network, a Transformer network, and a noise reduction network.
Wherein channels 1 to N represent microphone signals of multiple channels. The microphone signals are subjected to short-time Fourier transform, the transformed microphone signals are input into a complex convolution neural network for fitting fixed beam forming filter weight parameters, and D fixed beams in different directions, namely beams 1 to N in FIG. 3, are respectively generated by the complex convolution neural network by using the amplitude phase difference among the multi-channel microphone signals.
After obtaining the D beams, the D beams may be input into the attention network as multiple heads, the attention network is used to learn the cross-correlation coefficients between different beams, that is, each beam is used as a reference, the attention coefficients are calculated with beams in other directions, and the learned attention coefficients are used to perform weighted summation on the beams in the D different directions, so as to obtain the beam in which the target source direction is located, that is, the target beam.
In addition, since the target beam still has residual noise signals and needs to be further de-noised, the noise signals are estimated by using the transform network, and then the target beam is further de-noised by using the de-noised network. Here, the maximum value of the attention coefficients learned by the attention network, that is, the estimated direction of arrival is obtained by making Argmax for the attention coefficient of each beam, and the direction of arrival, the target beam, and the original microphone signal are input into the transform network to learn the noise signal. And then, sending the noise signal and the target beam output by the transducer into a noise reduction network to further reduce noise of the target beam to obtain a final enhanced target signal, namely an enhanced target signal.
Finally, the enhanced target signal may be fed into a speech recognition module along with the noise signal to achieve speech recognition.
The method provided by the embodiment of the invention fully utilizes the distinguishing information among the wave beams, and sends the enhanced target signal and the noise signal to the back-end voice recognition module, compared with the method of only sending a single-path enhanced target signal beam, the method can provide richer information for the back-end voice recognition module, thereby improving the final recognition effect, and the NN (Neural Network) -GSC modeling is more powerful and reasonable by introducing a plurality of convolution networks, an attention Network designed based on a multi-head attention idea and a transform Network.
Based on any of the above embodiments, fig. 4 is a schematic structural diagram of a speech recognition apparatus provided by the present invention, and as shown in fig. 4, the apparatus includes:
an acquisition unit 410 for acquiring a microphone signal;
a beam forming unit 420, configured to determine a target beam of the microphone signal based on a correlation between beams in the microphone signal, and perform noise reduction on the target beam based on a noise signal in the microphone signal to obtain an enhanced target signal;
a recognition unit 430, configured to perform speech recognition on the enhancement target signal.
The device provided by the embodiment of the invention determines the target wave beams of the microphone signals based on the correlation degree among the wave beams in the microphone signals, and performs voice recognition by applying the noise-reduced target wave beams.
Based on any of the above embodiments, the beamforming unit is specifically configured to:
determining the importance of each beam based on the correlation degree between the beams;
and carrying out weighted summation on each beam based on the importance of each beam to obtain the target beam.
Based on any embodiment above, the beamforming unit is further configured to:
determining a direction of arrival based on the beam with the highest importance;
and performing noise estimation on the microphone signal based on the target wave beam and the direction of arrival to obtain a noise signal of the microphone signal.
Based on any of the embodiments above, the identification unit is specifically configured to:
and performing voice recognition on the enhanced target signal based on the noise signal.
Based on any of the above embodiments, the beamforming unit is further configured to:
carrying out time-frequency transformation on the microphone signals to obtain multi-channel frequency domain signals;
and generating a plurality of fixed beams in different directions as each beam based on the multi-channel frequency domain signals.
Based on any of the above embodiments, the beamforming unit is specifically configured to:
based on a voice enhancement module, determining a target wave beam of a microphone signal by applying the correlation degree between wave beams in the microphone signal, and reducing noise of the target wave beam by applying a noise signal in the microphone signal to obtain an enhanced target signal;
the performing voice recognition on the enhanced target signal comprises:
performing voice recognition on the enhanced target signal based on a voice recognition module;
the voice enhancement module and the voice recognition module form an integrated model, and the integrated model is obtained based on a sample microphone signal and recognition text training of the sample microphone signal.
Based on any of the above embodiments, the unified model is trained based on the sample microphone signal, the recognition text of the sample microphone signal, and the target beam and/or noise signal of the sample microphone signal.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530, a communication bus 540, and a microphone 550, wherein the processor 510, the communication Interface 520, the memory 530, and the microphone 550 are in communication with each other via the communication bus 540. Where microphone 55 is used to collect microphone signals, processor 510 may invoke logic instructions in memory 530 to perform a speech recognition method comprising: acquiring a microphone signal; determining a target wave beam of the microphone signals based on the correlation degree between the wave beams in the microphone signals, and denoising the target wave beam based on a noise signal in the microphone signals to obtain an enhanced target signal; and performing voice recognition on the enhanced target signal.
In addition, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing a speech recognition method provided by the above methods, the method comprising: acquiring a microphone signal; determining a target wave beam of the microphone signals based on the correlation degree between the wave beams in the microphone signals, and denoising the target wave beam based on a noise signal in the microphone signals to obtain an enhanced target signal; and performing voice recognition on the enhanced target signal.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the speech recognition method provided by the above methods, the method comprising: acquiring a microphone signal; determining a target wave beam of the microphone signals based on the correlation degree between the wave beams in the microphone signals, and denoising the target wave beam based on a noise signal in the microphone signals to obtain an enhanced target signal; and performing voice recognition on the enhanced target signal.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A speech recognition method, comprising:
acquiring a microphone signal;
determining a target wave beam of the microphone signals based on the correlation degree between the wave beams in the microphone signals, and denoising the target wave beam based on a noise signal in the microphone signals to obtain an enhanced target signal;
and performing voice recognition on the enhanced target signal.
2. The speech recognition method of claim 1, wherein determining the target beam of the microphone signal based on the correlation between the beams of the microphone signal comprises:
determining the importance of each beam based on the correlation degree between the beams;
and carrying out weighted summation on each wave beam based on the importance degree of each wave beam to obtain the target wave beam.
3. The speech recognition method of claim 2, wherein the noise signal determining step comprises:
determining a direction of arrival based on the beam with the highest importance;
and performing noise estimation on the microphone signal based on the target wave beam and the direction of arrival to obtain a noise signal of the microphone signal.
4. The speech recognition method of claim 1, wherein the performing speech recognition on the enhanced target signal comprises:
and performing voice recognition on the enhanced target signal based on the noise signal.
5. The speech recognition method of claim 1, wherein the step of determining each beam in the microphone signal comprises:
carrying out time-frequency transformation on the microphone signals to obtain multi-channel frequency domain signals;
and generating a plurality of fixed beams in different directions as each beam based on the multi-channel frequency domain signals.
6. The speech recognition method according to any one of claims 1 to 5, wherein the determining a target beam of the microphone signal based on a correlation between beams in the microphone signal, and denoising the target beam based on a noise signal in the microphone signal to obtain an enhanced target signal comprises:
based on a voice enhancement module, determining a target wave beam of a microphone signal by applying the correlation degree between wave beams in the microphone signal, and reducing noise of the target wave beam by applying a noise signal in the microphone signal to obtain an enhanced target signal;
the performing voice recognition on the enhanced target signal comprises:
performing voice recognition on the enhanced target signal based on a voice recognition module;
the voice enhancement module and the voice recognition module form an integrated model, and the integrated model is obtained based on a sample microphone signal and recognition text training of the sample microphone signal.
7. The speech recognition method of claim 6, wherein the unified model is trained based on a sample microphone signal, a recognition context of the sample microphone signal, and a target beam and/or noise signal of the sample microphone signal.
8. A speech recognition apparatus, comprising:
an acquisition unit configured to acquire a microphone signal;
the beam forming unit is used for determining a target beam of the microphone signals based on the correlation degree between the beams in the microphone signals, and carrying out noise reduction on the target beam based on a noise signal in the microphone signals to obtain an enhanced target signal;
and the recognition unit is used for carrying out voice recognition on the enhanced target signal.
9. An electronic device comprising a microphone, a memory, a processor, and a computer program stored on the memory and executable on the processor,
the microphone is used for collecting microphone signals;
when the processor executes the program, the target wave beam of the microphone signal is determined based on the correlation degree between the wave beams in the microphone signal, and the noise of the target wave beam is reduced based on the noise signal in the microphone signal, so that an enhanced target signal is obtained; and performing voice recognition on the enhanced target signal.
10. A non-transitory computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the speech recognition method according to any one of claims 1 to 7.
CN202211689410.2A 2022-12-27 2022-12-27 Voice recognition method and device, electronic equipment and storage medium Pending CN115862632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211689410.2A CN115862632A (en) 2022-12-27 2022-12-27 Voice recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211689410.2A CN115862632A (en) 2022-12-27 2022-12-27 Voice recognition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115862632A true CN115862632A (en) 2023-03-28

Family

ID=85653540

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211689410.2A Pending CN115862632A (en) 2022-12-27 2022-12-27 Voice recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115862632A (en)

Similar Documents

Publication Publication Date Title
CN107452389B (en) Universal single-track real-time noise reduction method
KR101934636B1 (en) Method and apparatus for integrating and removing acoustic echo and background noise based on deepening neural network
CN107993670B (en) Microphone array speech enhancement method based on statistical model
JP5587396B2 (en) System, method and apparatus for signal separation
Kumatani et al. Microphone array processing for distant speech recognition: Towards real-world deployment
Xiao et al. Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
US20100217590A1 (en) Speaker localization system and method
Schwartz et al. An expectation-maximization algorithm for multimicrophone speech dereverberation and noise reduction with coherence matrix estimation
CN110610718B (en) Method and device for extracting expected sound source voice signal
Kumatani et al. Beamforming with a maximum negentropy criterion
Ito et al. Designing the Wiener post-filter for diffuse noise suppression using imaginary parts of inter-channel cross-spectra
Nesta et al. A flexible spatial blind source extraction framework for robust speech recognition in noisy environments
Nesta et al. Blind source extraction for robust speech recognition in multisource noisy environments
CN111681665A (en) Omnidirectional noise reduction method, equipment and storage medium
US9875748B2 (en) Audio signal noise attenuation
Schwartz et al. Nested generalized sidelobe canceller for joint dereverberation and noise reduction
US9659574B2 (en) Signal noise attenuation
CN113870893A (en) Multi-channel double-speaker separation method and system
KR102048370B1 (en) Method for beamforming by using maximum likelihood estimation
Hashemgeloogerdi et al. Joint beamforming and reverberation cancellation using a constrained Kalman filter with multichannel linear prediction
CN114758670A (en) Beam forming method, beam forming device, electronic equipment and storage medium
CN114242104A (en) Method, device and equipment for voice noise reduction and storage medium
CN115862632A (en) Voice recognition method and device, electronic equipment and storage medium
Delcroix et al. Multichannel speech enhancement approaches to DNN-based far-field speech recognition
US20240212701A1 (en) Estimating an optimized mask for processing acquired sound data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination