CN113823311B - Voice recognition method and device based on audio enhancement - Google Patents
Voice recognition method and device based on audio enhancement Download PDFInfo
- Publication number
- CN113823311B CN113823311B CN202110955519.5A CN202110955519A CN113823311B CN 113823311 B CN113823311 B CN 113823311B CN 202110955519 A CN202110955519 A CN 202110955519A CN 113823311 B CN113823311 B CN 113823311B
- Authority
- CN
- China
- Prior art keywords
- data
- signal
- voice recognition
- channel audio
- sound source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000005236 sound signal Effects 0.000 claims abstract description 41
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 29
- 238000012545 processing Methods 0.000 claims abstract description 22
- 230000007613 environmental effect Effects 0.000 claims abstract description 20
- 230000009467 reduction Effects 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 32
- 238000013135 deep learning Methods 0.000 claims description 20
- 230000003595 spectral effect Effects 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 3
- 230000001934 delay Effects 0.000 abstract description 4
- 238000010521 absorption reaction Methods 0.000 abstract description 3
- 238000004364 calculation method Methods 0.000 abstract description 3
- 230000004888 barrier function Effects 0.000 abstract description 2
- 238000001228 spectrum Methods 0.000 description 9
- 238000001914 filtration Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Noise Elimination (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a voice recognition method and a voice recognition device based on audio enhancement, wherein the voice recognition method comprises the steps of obtaining first data through first filter function calculation on multichannel sound source sound data picked up by a microphone array, obtaining second data through second filter function calculation on the first data, and processing the second data through a beam forming algorithm to obtain single-channel audio signals; processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to obtain third data; and recognizing the third data through a voice recognition model. According to the invention, the multichannel voice data picked up by the microphone array firstly eliminates multipath reflection mixed voice data with different delays caused by reflection and absorption of different barriers encountered by the sound source, then removes non-target sound source sound data in the second data, finally removes environmental noise, realizes enhancement processing of the sound source sound data, and improves the accuracy of voice recognition.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice recognition method and device based on audio enhancement.
Background
Compared with a single microphone, the microphone array has the advantages of higher gain, beam flexibility, strong anti-interference capability and the like, for example, when the remote voice signal is picked up, the weak voice signal in the far-field environment is more favorable to be acquired due to the high gain characteristic of the microphone array. In addition, the microphone array has spatial filtering characteristics, can flexibly inhibit interference in different directions, and is widely applied to the fields of blind source separation, sound source positioning and the like.
Because various noises exist in the audio data and influence the quality of voice communication and man-machine interaction to different degrees, the existing algorithm still cannot achieve an ideal effect in certain specific scenes due to the complexity of application scenes and the diversity of the noises, and therefore the development of a microphone array voice enhancement algorithm with strong robustness is particularly important.
Disclosure of Invention
In view of the foregoing problems with the prior art, the present invention provides a voice recognition method based on audio enhancement, including:
the method comprises the steps that multichannel sound source sound data picked up by a microphone array are calculated through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound greater than a first preset threshold value in the first data, and obtaining second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to remove the environmental noise in the single-channel audio signal, and obtaining third data;
and recognizing the third data through a voice recognition model.
Preferably, the first ambient noise reduction algorithm includes:
inputting the single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal;
clean speech data in the single-channel audio signal is obtained based on the single-channel audio signal and the audio features.
Preferably, the first deep learning network model comprises a plurality of LSTM network models, the a-th layer outputs of the 1 st to n-th LSTM network models being commonly connected to the a+1-th layer inputs of the n-th LSTM network model.
Preferably, the method for obtaining the second filtering function includes:
performing linear combination based on multipath reflection mixed data with the arrival delay of the sound source sound greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of the multipath reflection mixed data with the arrival delay of the sound source sound greater than the first preset threshold value in the first data of the current time;
the coefficient matrix of the linear combination is obtained by adopting a weighted least square algorithm so as to minimize the time domain correlation of the estimated value meeting the second expected signal of the output signal, namely:
wherein,as an estimate of the second desired signal,
weight estimation value of weighted least squares algorithmThe method comprises the following steps:
for the power spectral density estimate of the second desired signal, M is the number of microphones in the microphone array, ε is a constant;
estimation of coefficient matrix for linear combinationThe method comprises the following steps:
wherein the method comprises the steps of
And (3) inverting the autocorrelation matrix of the multipath reflection mixed data with the arrival delay of the sound source sound in the first data being greater than a first preset threshold.
Preferably, the power spectral density estimate of the second desired signalAnd acquiring a power spectral density estimation model based on a second deep learning network, wherein the second deep learning network takes the power spectral density of the first data as input during training, and learns the mapping relation from the power spectral density of the first data to the power spectral density of the second expected signal so as to output an estimated value of the power spectral density of the second expected signal.
Preferably, the second deep learning network adopts an LSTM network, and output data of each cell of the LSTM network is input to an input of a next cell through projection processing.
The invention also provides a voice recognition device based on audio enhancement, which comprises:
the first data generation module is used for calculating the multichannel sound source sound data picked up by the microphone array through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
the second data generation module is used for calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound larger than a first preset threshold value in the first data, so as to obtain second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
the single-channel audio signal generation module is used for processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
the third data generation module is used for processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise so as to remove the environmental noise in the single-channel audio signal and obtain third data;
and the voice recognition module is used for recognizing the third data through the voice recognition model.
As a further optimization of the above solution, the second data generating module includes a second filtering function unit, where the second filtering function performs linear combination based on multiple reflection mixed data with arrival delays of sound source sounds greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of multiple reflection mixed data with arrival delays of sound source sounds greater than the first preset threshold value in the first data of the current time, and obtains a coefficient matrix of the linear combination by adopting a weighted least square algorithm so as to minimize a time-domain correlation of a second desired signal with the estimated value satisfying the output signal.
The invention also provides an electronic device, which comprises:
a memory for storing executable instructions;
and the processor is used for realizing the voice recognition method based on the audio enhancement when executing the executable instructions stored in the memory.
A computer readable storage medium storing executable instructions that when executed by a processor implement an audio enhancement based speech recognition method as described above.
The voice recognition method and device based on audio enhancement have the following beneficial effects:
1. the multichannel voice data picked up by the microphone array firstly eliminates multipath reflection mixed voice data which have different time delays with the voice data of the direct microphone and are caused by the reflection and absorption of the sound source sound meeting different barriers, then removes other sound source sound data except the target sound source sound data in the second data, finally removes environmental noise, realizes the enhancement processing of the sound source sound data, and improves the accuracy of voice recognition.
2. The a layer output of the 1 st to n th LSTM network models is commonly connected to the a+1 layer input of the n th LSTM network model, so that the a+1 layer of the n th LSTM network model is learned based on the prior knowledge provided by the 1 st to n-1 st LSTM network models, a more accurate learning direction can be obtained, the neural network training time is effectively shortened, and the accuracy of the first type audio features of the clean voice extracted by the first deep learning network model is improved.
Drawings
FIG. 1 is an overall flow chart of a voice recognition method based on audio enhancement of the present invention;
fig. 2 is a block diagram of a voice recognition apparatus based on audio enhancement according to the present invention.
Detailed Description
The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.
The voice recognition method based on audio enhancement provided by the embodiment comprises the following steps:
the method comprises the steps that multichannel sound source sound data picked up by a microphone array are calculated through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound greater than a first preset threshold value in the first data, and obtaining second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to remove the environmental noise in the single-channel audio signal, and obtaining third data;
and recognizing the third data through a voice recognition model.
In this embodiment, the multi-channel voice data picked up by the microphone array firstly eliminates the multi-channel reflection mixed voice data which has different delay with the voice data of the direct microphone and is caused by the reflection and absorption of the sound source sound meeting different obstacles, then removes other sound source sound data except the target sound source sound data in the second data, finally removes the environmental noise, realizes the enhancement processing of the sound source sound data, and improves the accuracy of voice recognition.
Wherein the first ambient noise reduction algorithm comprises:
inputting a single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal;
and a second step of obtaining clean voice data in the single-channel audio signal based on the single-channel audio signal and the audio characteristics.
Wherein in the first step, the first deep learning network model is composed of a plurality of LSTM network models, the input of the first deep learning network model is a single-channel audio signal, each LSTM network model is used for outputting the first type audio characteristics of clean voice under different signal-to-noise ratio conditions based on the first type audio characteristics of the single-channel audio signal,
in the second step, based on the first type audio features of clean voice under different signal-to-noise ratio conditions, a plurality of first type audio features are fused through mean value calculation to obtain first type fusion audio features, spectrum reconstruction is obtained based on the fusion audio features, and then reconstructed voice data is obtained.
The type of audio features described above may be logarithmic power spectrum or power spectrum masking features of clean speech and noisy speech, etc.
The first deep learning network model includes a plurality of LSTM network models, and the a-th layer outputs of the 1 st to n-th LSTM network models are commonly connected to the a+1-th layer inputs of the n-th LSTM network model.
By adopting the connection structure, the (a+1) th layer of the nth LSTM network model learns based on the prior knowledge provided by the 1 st to n-1 st LSTM network models, so that a more accurate learning direction can be obtained, the training time of the neural network is effectively shortened, and the accuracy of the first type audio features of the clean voice extracted by the first deep learning network model is improved.
During training, mixing clean voice data and noise data according to different signal to noise ratios to serve as training data, inputting first type audio features of the training data into an LSTM network model to train a single model, and completing training of each LSTM network model to start a training process of the next LSTM network model.
The method for acquiring the second filter function comprises the following steps:
performing linear combination based on multipath reflection mixed data with the arrival delay of the sound source sound greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of the multipath reflection mixed data with the arrival delay of the sound source sound greater than the first preset threshold value in the first data of the current time;
the coefficient matrix of the linear combination is obtained by adopting a weighted least square algorithm so as to minimize the time domain correlation of the estimated value meeting the second expected signal of the output signal, namely:
wherein,as an estimate of the second desired signal,
weight estimation value of weighted least squares algorithmThe method comprises the following steps:
for the power spectral density estimate of the second desired signal, M is the number of microphones in the microphone array, ε is a constant;
estimation of coefficient matrix for linear combinationThe method comprises the following steps:
wherein the method comprises the steps of
And (3) inverting the autocorrelation matrix of the multipath reflection mixed data with the arrival delay of the sound source sound in the first data being greater than a first preset threshold.
Wherein x (n) is the first data at the current time,and the mixed data is multipath reflection mixed data with the arrival delay of sound source sound larger than a first preset threshold value in the first data of all the moments before the current moment.
The power spectrum density estimated value of the second expected signal in the process of obtaining the second filtering functionThe power spectrum density estimation model based on a second deep learning network is adopted to obtain, the second deep learning network takes the power spectrum density of the first data as input during training, the mapping relation from the power spectrum density of the first data to the power spectrum density of the second expected signal is learned to output the estimated value of the power spectrum density of the second expected signal, the second deep learning network adopts an LSTM network, and the output data of each cell of the LSTM network is input to the input of the next cell through projection processing.
Specifically, each cell is based on input data x t Selecting to discard partial data f at forget gate t Output h based on last cell at input gate t-1 And the current input data x t State c of updating cell t Output o at the output gate based on the output gate t And updated cell state c t Obtain output data m t The projection processing is performed based on the output data: treatment m by means of a circulation unit t * W1 gives r t Processing m by non-circulating unit t * W2 gives p t The data input to the next cell is then: w3 r t +W4*p t +b, wherein W1, W2, W3 and W4 are weight parameters, and b is a bias parameter, so that the complexity of the model is effectively reduced and the training time is reduced through the input-output relationship among the cells.
The embodiment also provides a voice recognition device based on audio enhancement, which comprises:
the first data generation module is used for calculating the multichannel sound source sound data picked up by the microphone array through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
the second data generation module is used for calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound larger than a first preset threshold value in the first data, so as to obtain second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
the single-channel audio signal generation module is used for processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
the third data generation module is used for processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise so as to remove the environmental noise in the single-channel audio signal and obtain third data;
and the voice recognition module is used for recognizing the third data through the voice recognition model.
The second data generating module includes a second filtering function unit, where the second filtering function performs linear combination based on multi-path reflection mixed data with arrival delay of sound source being greater than a first preset threshold value in first data before the current time to obtain an estimated value of multi-path reflection mixed data with arrival delay of sound source being greater than the first preset threshold value in the first data at the current time, and obtains a coefficient matrix of the linear combination by adopting a weighted least square algorithm so as to minimize time-domain correlation of a second expected signal with the estimated value meeting an output signal.
For specific limitations of the speech recognition device, reference may be made to the above limitations of the speech recognition method, and no further description is given here. The various modules in the speech recognition device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
The embodiment also provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the voice recognition method based on the audio enhancement when executing the executable instructions stored in the memory.
The present embodiment also provides a computer-readable storage medium storing executable instructions that when executed by a processor implement an audio enhancement-based speech recognition method as described above.
The electronic device includes: at least one processor, memory, a user interface, and at least one network interface. The various components in the electronic device are coupled together by a bus system. It will be appreciated that a bus system is used to enable connected communications between these components. The bus system includes a power bus, a control bus, and a status signal bus in addition to the data bus. The user interface may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc. The processor of the electronic device is configured to provide computing and control capabilities, and the memory of the electronic device may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory, and the memory in this embodiment stores an operating system, a computer program, and a database, the computer program when executed by the processor implementing an audio enhancement-based speech recognition method as described above.
The present invention is not limited to the above-described specific embodiments, and various modifications may be made by those skilled in the art without inventive effort from the above-described concepts, and are within the scope of the present invention.
Claims (9)
1. A voice recognition method based on audio enhancement, comprising:
the method comprises the steps that multichannel sound source sound data picked up by a microphone array are calculated through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound greater than a first preset threshold value in the first data, and obtaining second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to remove the environmental noise in the single-channel audio signal, and obtaining third data;
recognizing the third data through a voice recognition model;
the first ambient noise reduction algorithm includes: inputting the single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal; clean speech data in the single-channel audio signal is obtained based on the single-channel audio signal and the audio features.
2. The audio enhancement based speech recognition method of claim 1, wherein the first deep learning network model comprises a plurality of LSTM network models, and wherein a-th layer outputs of the 1 st to n-th LSTM network models are commonly connected to an a+1-th layer input of the n-th LSTM network model.
3. The method for voice recognition based on audio enhancement according to claim 1, wherein the method for obtaining the second filter function comprises:
performing linear combination based on multipath reflection mixed data with the arrival delay of the sound source sound greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of the multipath reflection mixed data with the arrival delay of the sound source sound greater than the first preset threshold value in the first data of the current time;
the coefficient matrix of the linear combination is obtained by adopting a weighted least square algorithm so as to minimize the time domain correlation of the estimated value meeting the second expected signal of the output signal, namely:
wherein,as an estimate of the second desired signal,
weight estimation value of weighted least squares algorithmThe method comprises the following steps:
for the power spectral density estimate of the second desired signal, M is the number of microphones in the microphone array, ε is a constant;
estimation of coefficient matrix for linear combinationThe method comprises the following steps:
wherein the method comprises the steps of
0<γ≤1,/>And (3) inverting the autocorrelation matrix of the multipath reflection mixed data with the arrival delay of the sound source sound in the first data being greater than a first preset threshold.
4. A method of audio enhancement based speech recognition according to claim 3 and wherein said second desired signal has an estimate of its power spectral densityAnd acquiring a power spectral density estimation model based on a second deep learning network, wherein the second deep learning network takes the power spectral density of the first data as input during training, and learns the mapping relation from the power spectral density of the first data to the power spectral density of the second expected signal so as to output an estimated value of the power spectral density of the second expected signal.
5. The voice recognition method according to claim 4, wherein the second deep learning network uses an LSTM network, and output data of each cell of the LSTM network is input to an input of a next cell through projection processing.
6. A voice recognition apparatus based on audio enhancement, comprising:
the first data generation module is used for calculating the multichannel sound source sound data picked up by the microphone array through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
the second data generation module is used for calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound larger than a first preset threshold value in the first data, so as to obtain second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
the single-channel audio signal generation module is used for processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
the third data generation module is used for processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise so as to remove the environmental noise in the single-channel audio signal and obtain third data;
the voice recognition module is used for recognizing the third data through the voice recognition model;
the first ambient noise reduction algorithm includes: inputting the single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal; clean speech data in the single-channel audio signal is obtained based on the single-channel audio signal and the audio features.
7. The voice recognition device according to claim 6, wherein the second data generating module includes a second filter function unit that performs linear combination based on the multiple reflection mixed data having the arrival delay of the sound source sound greater than a first preset threshold value in the first data at the present time, and obtains the coefficient matrix of the linear combination by using a weighted least square algorithm to minimize the time-domain correlation of the second desired signal having the estimated value satisfying the output signal.
8. An electronic device, the electronic device comprising:
a memory for storing executable instructions;
a processor for implementing an audio enhancement based speech recognition method according to any one of claims 1 to 5 when executing executable instructions stored in said memory.
9. A computer readable storage medium storing executable instructions which when executed by a processor implement a voice recognition method based on audio enhancement as claimed in any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110955519.5A CN113823311B (en) | 2021-08-19 | 2021-08-19 | Voice recognition method and device based on audio enhancement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110955519.5A CN113823311B (en) | 2021-08-19 | 2021-08-19 | Voice recognition method and device based on audio enhancement |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113823311A CN113823311A (en) | 2021-12-21 |
CN113823311B true CN113823311B (en) | 2023-11-21 |
Family
ID=78922801
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110955519.5A Active CN113823311B (en) | 2021-08-19 | 2021-08-19 | Voice recognition method and device based on audio enhancement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113823311B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106024001A (en) * | 2016-05-03 | 2016-10-12 | 电子科技大学 | Method used for improving speech enhancement performance of microphone array |
CN106531179A (en) * | 2015-09-10 | 2017-03-22 | 中国科学院声学研究所 | Multi-channel speech enhancement method based on semantic prior selective attention |
CN108053834A (en) * | 2017-12-05 | 2018-05-18 | 北京声智科技有限公司 | audio data processing method, device, terminal and system |
CN108109617A (en) * | 2018-01-08 | 2018-06-01 | 深圳市声菲特科技技术有限公司 | A kind of remote pickup method |
CN113030862A (en) * | 2021-03-12 | 2021-06-25 | 中国科学院声学研究所 | Multi-channel speech enhancement method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017063693A1 (en) * | 2015-10-14 | 2017-04-20 | Huawei Technologies Co., Ltd. | Adaptive reverberation cancellation system |
JP6480644B1 (en) * | 2016-03-23 | 2019-03-13 | グーグル エルエルシー | Adaptive audio enhancement for multi-channel speech recognition |
-
2021
- 2021-08-19 CN CN202110955519.5A patent/CN113823311B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106531179A (en) * | 2015-09-10 | 2017-03-22 | 中国科学院声学研究所 | Multi-channel speech enhancement method based on semantic prior selective attention |
CN106024001A (en) * | 2016-05-03 | 2016-10-12 | 电子科技大学 | Method used for improving speech enhancement performance of microphone array |
CN108053834A (en) * | 2017-12-05 | 2018-05-18 | 北京声智科技有限公司 | audio data processing method, device, terminal and system |
CN108109617A (en) * | 2018-01-08 | 2018-06-01 | 深圳市声菲特科技技术有限公司 | A kind of remote pickup method |
CN113030862A (en) * | 2021-03-12 | 2021-06-25 | 中国科学院声学研究所 | Multi-channel speech enhancement method and device |
Also Published As
Publication number | Publication date |
---|---|
CN113823311A (en) | 2021-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110600018B (en) | Voice recognition method and device and neural network training method and device | |
CN110444214B (en) | Speech signal processing model training method and device, electronic equipment and storage medium | |
US10123113B2 (en) | Selective audio source enhancement | |
CN107393550B (en) | Voice processing method and device | |
CN111415676B (en) | Blind source separation method and system based on separation matrix initialization frequency point selection | |
US10679617B2 (en) | Voice enhancement in audio signals through modified generalized eigenvalue beamformer | |
KR20180127171A (en) | Apparatus and method for student-teacher transfer learning network using knowledge bridge | |
CN108417224B (en) | Training and recognition method and system of bidirectional neural network model | |
CN104157293B (en) | The signal processing method of targeted voice signal pickup in a kind of enhancing acoustic environment | |
Schmid et al. | Variational Bayesian inference for multichannel dereverberation and noise reduction | |
CN106558315B (en) | Heterogeneous microphone automatic gain calibration method and system | |
CN110189761B (en) | Single-channel speech dereverberation method based on greedy depth dictionary learning | |
CN110223708B (en) | Speech enhancement method based on speech processing and related equipment | |
CN111445919A (en) | Speech enhancement method, system, electronic device, and medium incorporating AI model | |
CN110767223A (en) | Voice keyword real-time detection method of single sound track robustness | |
CN112489668B (en) | Dereverberation method, device, electronic equipment and storage medium | |
CN110660406A (en) | Real-time voice noise reduction method of double-microphone mobile phone in close-range conversation scene | |
CN112712818A (en) | Voice enhancement method, device and equipment | |
Cui et al. | Multi-objective based multi-channel speech enhancement with BiLSTM network | |
CN115424627A (en) | Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm | |
CN107360497B (en) | Calculation method and device for estimating reverberation component | |
Cho et al. | Convolutional maximum-likelihood distortionless response beamforming with steering vector estimation for robust speech recognition | |
CN114242104A (en) | Method, device and equipment for voice noise reduction and storage medium | |
CN113823311B (en) | Voice recognition method and device based on audio enhancement | |
CN107346658B (en) | Reverberation suppression method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20231026 Address after: No. 133 Shiling Road, Dongyong Town, Nansha District, Guangzhou City, Guangdong Province, 510000 Applicant after: Guangzhou Shengwei Electronics Co.,Ltd. Address before: 230041 room 1104, building 1, Binhu Century City Guanhu garden, intersection of Luzhou Avenue and Ziyun Road, Binhu District, Baohe District, Hefei City, Anhui Province Applicant before: Anhui chuangbian Information Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |