CN113823311B - Voice recognition method and device based on audio enhancement - Google Patents

Voice recognition method and device based on audio enhancement Download PDF

Info

Publication number
CN113823311B
CN113823311B CN202110955519.5A CN202110955519A CN113823311B CN 113823311 B CN113823311 B CN 113823311B CN 202110955519 A CN202110955519 A CN 202110955519A CN 113823311 B CN113823311 B CN 113823311B
Authority
CN
China
Prior art keywords
data
signal
voice recognition
channel audio
sound source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110955519.5A
Other languages
Chinese (zh)
Other versions
CN113823311A (en
Inventor
戴李
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shengwei Electronics Co ltd
Original Assignee
Guangzhou Shengwei Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shengwei Electronics Co ltd filed Critical Guangzhou Shengwei Electronics Co ltd
Priority to CN202110955519.5A priority Critical patent/CN113823311B/en
Publication of CN113823311A publication Critical patent/CN113823311A/en
Application granted granted Critical
Publication of CN113823311B publication Critical patent/CN113823311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Noise Elimination (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a voice recognition method and a voice recognition device based on audio enhancement, wherein the voice recognition method comprises the steps of obtaining first data through first filter function calculation on multichannel sound source sound data picked up by a microphone array, obtaining second data through second filter function calculation on the first data, and processing the second data through a beam forming algorithm to obtain single-channel audio signals; processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to obtain third data; and recognizing the third data through a voice recognition model. According to the invention, the multichannel voice data picked up by the microphone array firstly eliminates multipath reflection mixed voice data with different delays caused by reflection and absorption of different barriers encountered by the sound source, then removes non-target sound source sound data in the second data, finally removes environmental noise, realizes enhancement processing of the sound source sound data, and improves the accuracy of voice recognition.

Description

Voice recognition method and device based on audio enhancement
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice recognition method and device based on audio enhancement.
Background
Compared with a single microphone, the microphone array has the advantages of higher gain, beam flexibility, strong anti-interference capability and the like, for example, when the remote voice signal is picked up, the weak voice signal in the far-field environment is more favorable to be acquired due to the high gain characteristic of the microphone array. In addition, the microphone array has spatial filtering characteristics, can flexibly inhibit interference in different directions, and is widely applied to the fields of blind source separation, sound source positioning and the like.
Because various noises exist in the audio data and influence the quality of voice communication and man-machine interaction to different degrees, the existing algorithm still cannot achieve an ideal effect in certain specific scenes due to the complexity of application scenes and the diversity of the noises, and therefore the development of a microphone array voice enhancement algorithm with strong robustness is particularly important.
Disclosure of Invention
In view of the foregoing problems with the prior art, the present invention provides a voice recognition method based on audio enhancement, including:
the method comprises the steps that multichannel sound source sound data picked up by a microphone array are calculated through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound greater than a first preset threshold value in the first data, and obtaining second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to remove the environmental noise in the single-channel audio signal, and obtaining third data;
and recognizing the third data through a voice recognition model.
Preferably, the first ambient noise reduction algorithm includes:
inputting the single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal;
clean speech data in the single-channel audio signal is obtained based on the single-channel audio signal and the audio features.
Preferably, the first deep learning network model comprises a plurality of LSTM network models, the a-th layer outputs of the 1 st to n-th LSTM network models being commonly connected to the a+1-th layer inputs of the n-th LSTM network model.
Preferably, the method for obtaining the second filtering function includes:
performing linear combination based on multipath reflection mixed data with the arrival delay of the sound source sound greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of the multipath reflection mixed data with the arrival delay of the sound source sound greater than the first preset threshold value in the first data of the current time;
the coefficient matrix of the linear combination is obtained by adopting a weighted least square algorithm so as to minimize the time domain correlation of the estimated value meeting the second expected signal of the output signal, namely:
wherein,as an estimate of the second desired signal,
weight estimation value of weighted least squares algorithmThe method comprises the following steps:
for the power spectral density estimate of the second desired signal, M is the number of microphones in the microphone array, ε is a constant;
estimation of coefficient matrix for linear combinationThe method comprises the following steps:
wherein the method comprises the steps of
And (3) inverting the autocorrelation matrix of the multipath reflection mixed data with the arrival delay of the sound source sound in the first data being greater than a first preset threshold.
Preferably, the power spectral density estimate of the second desired signalAnd acquiring a power spectral density estimation model based on a second deep learning network, wherein the second deep learning network takes the power spectral density of the first data as input during training, and learns the mapping relation from the power spectral density of the first data to the power spectral density of the second expected signal so as to output an estimated value of the power spectral density of the second expected signal.
Preferably, the second deep learning network adopts an LSTM network, and output data of each cell of the LSTM network is input to an input of a next cell through projection processing.
The invention also provides a voice recognition device based on audio enhancement, which comprises:
the first data generation module is used for calculating the multichannel sound source sound data picked up by the microphone array through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
the second data generation module is used for calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound larger than a first preset threshold value in the first data, so as to obtain second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
the single-channel audio signal generation module is used for processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
the third data generation module is used for processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise so as to remove the environmental noise in the single-channel audio signal and obtain third data;
and the voice recognition module is used for recognizing the third data through the voice recognition model.
As a further optimization of the above solution, the second data generating module includes a second filtering function unit, where the second filtering function performs linear combination based on multiple reflection mixed data with arrival delays of sound source sounds greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of multiple reflection mixed data with arrival delays of sound source sounds greater than the first preset threshold value in the first data of the current time, and obtains a coefficient matrix of the linear combination by adopting a weighted least square algorithm so as to minimize a time-domain correlation of a second desired signal with the estimated value satisfying the output signal.
The invention also provides an electronic device, which comprises:
a memory for storing executable instructions;
and the processor is used for realizing the voice recognition method based on the audio enhancement when executing the executable instructions stored in the memory.
A computer readable storage medium storing executable instructions that when executed by a processor implement an audio enhancement based speech recognition method as described above.
The voice recognition method and device based on audio enhancement have the following beneficial effects:
1. the multichannel voice data picked up by the microphone array firstly eliminates multipath reflection mixed voice data which have different time delays with the voice data of the direct microphone and are caused by the reflection and absorption of the sound source sound meeting different barriers, then removes other sound source sound data except the target sound source sound data in the second data, finally removes environmental noise, realizes the enhancement processing of the sound source sound data, and improves the accuracy of voice recognition.
2. The a layer output of the 1 st to n th LSTM network models is commonly connected to the a+1 layer input of the n th LSTM network model, so that the a+1 layer of the n th LSTM network model is learned based on the prior knowledge provided by the 1 st to n-1 st LSTM network models, a more accurate learning direction can be obtained, the neural network training time is effectively shortened, and the accuracy of the first type audio features of the clean voice extracted by the first deep learning network model is improved.
Drawings
FIG. 1 is an overall flow chart of a voice recognition method based on audio enhancement of the present invention;
fig. 2 is a block diagram of a voice recognition apparatus based on audio enhancement according to the present invention.
Detailed Description
The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.
The voice recognition method based on audio enhancement provided by the embodiment comprises the following steps:
the method comprises the steps that multichannel sound source sound data picked up by a microphone array are calculated through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound greater than a first preset threshold value in the first data, and obtaining second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to remove the environmental noise in the single-channel audio signal, and obtaining third data;
and recognizing the third data through a voice recognition model.
In this embodiment, the multi-channel voice data picked up by the microphone array firstly eliminates the multi-channel reflection mixed voice data which has different delay with the voice data of the direct microphone and is caused by the reflection and absorption of the sound source sound meeting different obstacles, then removes other sound source sound data except the target sound source sound data in the second data, finally removes the environmental noise, realizes the enhancement processing of the sound source sound data, and improves the accuracy of voice recognition.
Wherein the first ambient noise reduction algorithm comprises:
inputting a single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal;
and a second step of obtaining clean voice data in the single-channel audio signal based on the single-channel audio signal and the audio characteristics.
Wherein in the first step, the first deep learning network model is composed of a plurality of LSTM network models, the input of the first deep learning network model is a single-channel audio signal, each LSTM network model is used for outputting the first type audio characteristics of clean voice under different signal-to-noise ratio conditions based on the first type audio characteristics of the single-channel audio signal,
in the second step, based on the first type audio features of clean voice under different signal-to-noise ratio conditions, a plurality of first type audio features are fused through mean value calculation to obtain first type fusion audio features, spectrum reconstruction is obtained based on the fusion audio features, and then reconstructed voice data is obtained.
The type of audio features described above may be logarithmic power spectrum or power spectrum masking features of clean speech and noisy speech, etc.
The first deep learning network model includes a plurality of LSTM network models, and the a-th layer outputs of the 1 st to n-th LSTM network models are commonly connected to the a+1-th layer inputs of the n-th LSTM network model.
By adopting the connection structure, the (a+1) th layer of the nth LSTM network model learns based on the prior knowledge provided by the 1 st to n-1 st LSTM network models, so that a more accurate learning direction can be obtained, the training time of the neural network is effectively shortened, and the accuracy of the first type audio features of the clean voice extracted by the first deep learning network model is improved.
During training, mixing clean voice data and noise data according to different signal to noise ratios to serve as training data, inputting first type audio features of the training data into an LSTM network model to train a single model, and completing training of each LSTM network model to start a training process of the next LSTM network model.
The method for acquiring the second filter function comprises the following steps:
performing linear combination based on multipath reflection mixed data with the arrival delay of the sound source sound greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of the multipath reflection mixed data with the arrival delay of the sound source sound greater than the first preset threshold value in the first data of the current time;
the coefficient matrix of the linear combination is obtained by adopting a weighted least square algorithm so as to minimize the time domain correlation of the estimated value meeting the second expected signal of the output signal, namely:
wherein,as an estimate of the second desired signal,
weight estimation value of weighted least squares algorithmThe method comprises the following steps:
for the power spectral density estimate of the second desired signal, M is the number of microphones in the microphone array, ε is a constant;
estimation of coefficient matrix for linear combinationThe method comprises the following steps:
wherein the method comprises the steps of
And (3) inverting the autocorrelation matrix of the multipath reflection mixed data with the arrival delay of the sound source sound in the first data being greater than a first preset threshold.
Wherein x (n) is the first data at the current time,and the mixed data is multipath reflection mixed data with the arrival delay of sound source sound larger than a first preset threshold value in the first data of all the moments before the current moment.
The power spectrum density estimated value of the second expected signal in the process of obtaining the second filtering functionThe power spectrum density estimation model based on a second deep learning network is adopted to obtain, the second deep learning network takes the power spectrum density of the first data as input during training, the mapping relation from the power spectrum density of the first data to the power spectrum density of the second expected signal is learned to output the estimated value of the power spectrum density of the second expected signal, the second deep learning network adopts an LSTM network, and the output data of each cell of the LSTM network is input to the input of the next cell through projection processing.
Specifically, each cell is based on input data x t Selecting to discard partial data f at forget gate t Output h based on last cell at input gate t-1 And the current input data x t State c of updating cell t Output o at the output gate based on the output gate t And updated cell state c t Obtain output data m t The projection processing is performed based on the output data: treatment m by means of a circulation unit t * W1 gives r t Processing m by non-circulating unit t * W2 gives p t The data input to the next cell is then: w3 r t +W4*p t +b, wherein W1, W2, W3 and W4 are weight parameters, and b is a bias parameter, so that the complexity of the model is effectively reduced and the training time is reduced through the input-output relationship among the cells.
The embodiment also provides a voice recognition device based on audio enhancement, which comprises:
the first data generation module is used for calculating the multichannel sound source sound data picked up by the microphone array through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
the second data generation module is used for calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound larger than a first preset threshold value in the first data, so as to obtain second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
the single-channel audio signal generation module is used for processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
the third data generation module is used for processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise so as to remove the environmental noise in the single-channel audio signal and obtain third data;
and the voice recognition module is used for recognizing the third data through the voice recognition model.
The second data generating module includes a second filtering function unit, where the second filtering function performs linear combination based on multi-path reflection mixed data with arrival delay of sound source being greater than a first preset threshold value in first data before the current time to obtain an estimated value of multi-path reflection mixed data with arrival delay of sound source being greater than the first preset threshold value in the first data at the current time, and obtains a coefficient matrix of the linear combination by adopting a weighted least square algorithm so as to minimize time-domain correlation of a second expected signal with the estimated value meeting an output signal.
For specific limitations of the speech recognition device, reference may be made to the above limitations of the speech recognition method, and no further description is given here. The various modules in the speech recognition device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
The embodiment also provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the voice recognition method based on the audio enhancement when executing the executable instructions stored in the memory.
The present embodiment also provides a computer-readable storage medium storing executable instructions that when executed by a processor implement an audio enhancement-based speech recognition method as described above.
The electronic device includes: at least one processor, memory, a user interface, and at least one network interface. The various components in the electronic device are coupled together by a bus system. It will be appreciated that a bus system is used to enable connected communications between these components. The bus system includes a power bus, a control bus, and a status signal bus in addition to the data bus. The user interface may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc. The processor of the electronic device is configured to provide computing and control capabilities, and the memory of the electronic device may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory, and the memory in this embodiment stores an operating system, a computer program, and a database, the computer program when executed by the processor implementing an audio enhancement-based speech recognition method as described above.
The present invention is not limited to the above-described specific embodiments, and various modifications may be made by those skilled in the art without inventive effort from the above-described concepts, and are within the scope of the present invention.

Claims (9)

1. A voice recognition method based on audio enhancement, comprising:
the method comprises the steps that multichannel sound source sound data picked up by a microphone array are calculated through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound greater than a first preset threshold value in the first data, and obtaining second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise to remove the environmental noise in the single-channel audio signal, and obtaining third data;
recognizing the third data through a voice recognition model;
the first ambient noise reduction algorithm includes: inputting the single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal; clean speech data in the single-channel audio signal is obtained based on the single-channel audio signal and the audio features.
2. The audio enhancement based speech recognition method of claim 1, wherein the first deep learning network model comprises a plurality of LSTM network models, and wherein a-th layer outputs of the 1 st to n-th LSTM network models are commonly connected to an a+1-th layer input of the n-th LSTM network model.
3. The method for voice recognition based on audio enhancement according to claim 1, wherein the method for obtaining the second filter function comprises:
performing linear combination based on multipath reflection mixed data with the arrival delay of the sound source sound greater than a first preset threshold value in first data of all times before the current time to obtain an estimated value of the multipath reflection mixed data with the arrival delay of the sound source sound greater than the first preset threshold value in the first data of the current time;
the coefficient matrix of the linear combination is obtained by adopting a weighted least square algorithm so as to minimize the time domain correlation of the estimated value meeting the second expected signal of the output signal, namely:
wherein,as an estimate of the second desired signal,
weight estimation value of weighted least squares algorithmThe method comprises the following steps:
for the power spectral density estimate of the second desired signal, M is the number of microphones in the microphone array, ε is a constant;
estimation of coefficient matrix for linear combinationThe method comprises the following steps:
wherein the method comprises the steps of
0<γ≤1,/>And (3) inverting the autocorrelation matrix of the multipath reflection mixed data with the arrival delay of the sound source sound in the first data being greater than a first preset threshold.
4. A method of audio enhancement based speech recognition according to claim 3 and wherein said second desired signal has an estimate of its power spectral densityAnd acquiring a power spectral density estimation model based on a second deep learning network, wherein the second deep learning network takes the power spectral density of the first data as input during training, and learns the mapping relation from the power spectral density of the first data to the power spectral density of the second expected signal so as to output an estimated value of the power spectral density of the second expected signal.
5. The voice recognition method according to claim 4, wherein the second deep learning network uses an LSTM network, and output data of each cell of the LSTM network is input to an input of a next cell through projection processing.
6. A voice recognition apparatus based on audio enhancement, comprising:
the first data generation module is used for calculating the multichannel sound source sound data picked up by the microphone array through a first filter function to obtain first data, wherein the first filter function has a filter parameter capable of meeting the minimum mean square error between an output signal and a first expected signal;
the second data generation module is used for calculating the first data through a second filter function to eliminate multipath reflection mixed data with arrival delay of sound source sound larger than a first preset threshold value in the first data, so as to obtain second data, wherein the second filter function has a filter parameter capable of minimizing time domain correlation of a second expected signal of an output signal;
the single-channel audio signal generation module is used for processing the second data through a beam forming algorithm to obtain a single-channel audio signal;
the third data generation module is used for processing the single-channel audio signal through a noise reduction algorithm based on the first environmental noise so as to remove the environmental noise in the single-channel audio signal and obtain third data;
the voice recognition module is used for recognizing the third data through the voice recognition model;
the first ambient noise reduction algorithm includes: inputting the single-channel audio signal into a first deep learning network model to obtain the audio characteristics of environmental noise in the single-channel audio signal; clean speech data in the single-channel audio signal is obtained based on the single-channel audio signal and the audio features.
7. The voice recognition device according to claim 6, wherein the second data generating module includes a second filter function unit that performs linear combination based on the multiple reflection mixed data having the arrival delay of the sound source sound greater than a first preset threshold value in the first data at the present time, and obtains the coefficient matrix of the linear combination by using a weighted least square algorithm to minimize the time-domain correlation of the second desired signal having the estimated value satisfying the output signal.
8. An electronic device, the electronic device comprising:
a memory for storing executable instructions;
a processor for implementing an audio enhancement based speech recognition method according to any one of claims 1 to 5 when executing executable instructions stored in said memory.
9. A computer readable storage medium storing executable instructions which when executed by a processor implement a voice recognition method based on audio enhancement as claimed in any one of claims 1 to 5.
CN202110955519.5A 2021-08-19 2021-08-19 Voice recognition method and device based on audio enhancement Active CN113823311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110955519.5A CN113823311B (en) 2021-08-19 2021-08-19 Voice recognition method and device based on audio enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110955519.5A CN113823311B (en) 2021-08-19 2021-08-19 Voice recognition method and device based on audio enhancement

Publications (2)

Publication Number Publication Date
CN113823311A CN113823311A (en) 2021-12-21
CN113823311B true CN113823311B (en) 2023-11-21

Family

ID=78922801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110955519.5A Active CN113823311B (en) 2021-08-19 2021-08-19 Voice recognition method and device based on audio enhancement

Country Status (1)

Country Link
CN (1) CN113823311B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106024001A (en) * 2016-05-03 2016-10-12 电子科技大学 Method used for improving speech enhancement performance of microphone array
CN106531179A (en) * 2015-09-10 2017-03-22 中国科学院声学研究所 Multi-channel speech enhancement method based on semantic prior selective attention
CN108053834A (en) * 2017-12-05 2018-05-18 北京声智科技有限公司 audio data processing method, device, terminal and system
CN108109617A (en) * 2018-01-08 2018-06-01 深圳市声菲特科技技术有限公司 A kind of remote pickup method
CN113030862A (en) * 2021-03-12 2021-06-25 中国科学院声学研究所 Multi-channel speech enhancement method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017063693A1 (en) * 2015-10-14 2017-04-20 Huawei Technologies Co., Ltd. Adaptive reverberation cancellation system
JP6480644B1 (en) * 2016-03-23 2019-03-13 グーグル エルエルシー Adaptive audio enhancement for multi-channel speech recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106531179A (en) * 2015-09-10 2017-03-22 中国科学院声学研究所 Multi-channel speech enhancement method based on semantic prior selective attention
CN106024001A (en) * 2016-05-03 2016-10-12 电子科技大学 Method used for improving speech enhancement performance of microphone array
CN108053834A (en) * 2017-12-05 2018-05-18 北京声智科技有限公司 audio data processing method, device, terminal and system
CN108109617A (en) * 2018-01-08 2018-06-01 深圳市声菲特科技技术有限公司 A kind of remote pickup method
CN113030862A (en) * 2021-03-12 2021-06-25 中国科学院声学研究所 Multi-channel speech enhancement method and device

Also Published As

Publication number Publication date
CN113823311A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
CN110600018B (en) Voice recognition method and device and neural network training method and device
CN110444214B (en) Speech signal processing model training method and device, electronic equipment and storage medium
US10123113B2 (en) Selective audio source enhancement
CN107393550B (en) Voice processing method and device
CN111415676B (en) Blind source separation method and system based on separation matrix initialization frequency point selection
US10679617B2 (en) Voice enhancement in audio signals through modified generalized eigenvalue beamformer
KR20180127171A (en) Apparatus and method for student-teacher transfer learning network using knowledge bridge
CN108417224B (en) Training and recognition method and system of bidirectional neural network model
CN104157293B (en) The signal processing method of targeted voice signal pickup in a kind of enhancing acoustic environment
Schmid et al. Variational Bayesian inference for multichannel dereverberation and noise reduction
CN106558315B (en) Heterogeneous microphone automatic gain calibration method and system
CN110189761B (en) Single-channel speech dereverberation method based on greedy depth dictionary learning
CN110223708B (en) Speech enhancement method based on speech processing and related equipment
CN111445919A (en) Speech enhancement method, system, electronic device, and medium incorporating AI model
CN110767223A (en) Voice keyword real-time detection method of single sound track robustness
CN112489668B (en) Dereverberation method, device, electronic equipment and storage medium
CN110660406A (en) Real-time voice noise reduction method of double-microphone mobile phone in close-range conversation scene
CN112712818A (en) Voice enhancement method, device and equipment
Cui et al. Multi-objective based multi-channel speech enhancement with BiLSTM network
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
CN107360497B (en) Calculation method and device for estimating reverberation component
Cho et al. Convolutional maximum-likelihood distortionless response beamforming with steering vector estimation for robust speech recognition
CN114242104A (en) Method, device and equipment for voice noise reduction and storage medium
CN113823311B (en) Voice recognition method and device based on audio enhancement
CN107346658B (en) Reverberation suppression method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20231026

Address after: No. 133 Shiling Road, Dongyong Town, Nansha District, Guangzhou City, Guangdong Province, 510000

Applicant after: Guangzhou Shengwei Electronics Co.,Ltd.

Address before: 230041 room 1104, building 1, Binhu Century City Guanhu garden, intersection of Luzhou Avenue and Ziyun Road, Binhu District, Baohe District, Hefei City, Anhui Province

Applicant before: Anhui chuangbian Information Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant