CN113744752A - Voice processing method and device - Google Patents
Voice processing method and device Download PDFInfo
- Publication number
- CN113744752A CN113744752A CN202111003630.0A CN202111003630A CN113744752A CN 113744752 A CN113744752 A CN 113744752A CN 202111003630 A CN202111003630 A CN 202111003630A CN 113744752 A CN113744752 A CN 113744752A
- Authority
- CN
- China
- Prior art keywords
- processed
- audio signal
- signal
- estimation
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 16
- 230000005236 sound signal Effects 0.000 claims abstract description 212
- 238000012545 processing Methods 0.000 claims abstract description 115
- 238000000034 method Methods 0.000 claims abstract description 39
- 230000001629 suppression Effects 0.000 claims abstract description 19
- 230000001360 synchronised effect Effects 0.000 claims description 47
- 238000001514 detection method Methods 0.000 claims description 35
- 230000000694 effects Effects 0.000 claims description 33
- 238000009499 grossing Methods 0.000 claims description 20
- 238000001228 spectrum Methods 0.000 claims description 19
- 230000009467 reduction Effects 0.000 claims description 16
- 239000011159 matrix material Substances 0.000 claims description 12
- 206010002953 Aphonia Diseases 0.000 claims description 5
- 238000009825 accumulation Methods 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The present disclosure provides a voice processing method and apparatus, relating to the technical field of voice, wherein the method comprises obtaining at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array; estimating the direction of arrival of any two microphones in the microphone array; carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm; carrying out noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal; and outputting the target audio signal. The audio frequency picking and enhancing function is realized, and the accuracy of audio frequency identification is improved.
Description
Technical Field
The present disclosure relates to the field of speech technologies, and in particular, to a speech processing method and apparatus.
Background
With the continuous development of artificial intelligence technology, traditional equipment in various fields is gradually replaced by corresponding intelligent terminals. The intelligent terminal is a fully-open platform with multiple functions of monitoring, sensing, communication and intelligent interaction, carries an operating system, can automatically install and uninstall various application software, and continuously expands and upgrades functions. In the aspect of intelligent interaction, many complicated items are not realized only by a remote control and a touch screen which are commonly used by a target, wherein the best method is to adopt a voice remote control, and the key of the voice remote control is the acquisition and recognition of a voice signal.
In the related art, when a speech signal is acquired, the speech signal is usually filtered and output directly.
However, in the above-described technique, if the acquired speech signal includes speech in a plurality of directions, only filtering results in a large amount of noise in the finally obtained speech signal, and thus accuracy of speech recognition is reduced.
Disclosure of Invention
The embodiment of the disclosure provides a voice processing method and device, which can solve the problem that the accuracy of voice recognition is reduced in the prior art. The technical scheme is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided a speech processing method, the method including:
acquiring at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array;
estimating the direction of arrival of any two microphones in the microphone array;
carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm;
carrying out noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal;
and outputting the target audio signal.
The embodiment of the disclosure provides a voice processing method, which performs direction-of-arrival estimation on any two microphones in a microphone array when a plurality of audio signals to be processed are acquired, performs beamforming processing on the audio signals to be processed according to the direction-of-arrival estimation and a beamforming algorithm, performs noise suppression on the audio signals to be processed after the beamforming processing, and finally outputs a target audio signal obtained after the noise reduction and suppression. Therefore, the method and the device have the advantages that the direction of arrival estimation is carried out on every two audio signals to be processed, and the noise suppression processing is carried out on the audio signals to be processed after the beam forming processing, so that the audio picking and enhancing functions are realized, and the accuracy of audio identification is improved.
In one embodiment, before the estimating the direction of arrival of any two microphones in the microphone array, the method further includes:
performing voice activity detection and noise estimation on each audio signal to be processed, and determining the existence probability of the audio signal according to the results of the voice activity detection and the noise estimation;
the estimating direction of arrival of any two microphones of the microphone array comprises:
and estimating the direction of arrival of any two microphones in the microphone array according to the existence probability of the audio signal.
In one embodiment, the estimating the direction of arrival of any two microphones of the microphone array according to the audio signal existence probability comprises:
and calculating the time delay estimation of any two microphones in the microphone array according to the existence probability of the audio signal, and calculating the relative angle between a target sound source and the microphone array according to the time delay estimation result.
In one embodiment, the performing voice activity detection and noise estimation on each of the audio signals to be processed includes:
determining whether there is a synchronous input signal;
when the synchronous input signal is determined, performing echo cancellation processing on each audio signal to be processed;
performing voice activity detection and noise estimation on each audio signal to be processed after echo cancellation processing;
and when the synchronous input signal is determined not to exist, carrying out voice activity detection and noise estimation on each audio signal to be processed.
In one embodiment, the obtaining at least two audio signals to be processed comprises:
acquiring at least two original audio signals; the original audio signal is a signal output by an audio input module;
and carrying out short-time Fourier transform on each original audio signal to obtain the audio signal to be processed.
In one embodiment, the performing echo cancellation processing on each of the audio signals to be processed includes:
wherein y (t, m) represents the synchronous input signal collected by the mth microphone at the time t, s (t-l) represents the synchronous input signal at the time t-l, hlIndicating the channel between the synchronous input signal to each microphone, L being an identifier in the accumulation operator, L indicating the length of time, h (t, m) [ h ]0h1...hL-1]Representing the channel between the synchronous input signal to the mth microphone at time t;representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t +1,representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t,representing the error signal, mu the smoothing factor,denotes the echo estimate of the mth microphone at time t, x (t, m) denotes the near-end signal of the mth microphone at time t, and s (k, m) [ s (k, m) s (k-1, m) … s (k-L + 1),m)]Representing a vector of synchronous input signals, sT(k, m) represents the transpose of s (k, m).
In one embodiment, the performing voice activity detection and noise estimation on each of the audio signals to be processed, and determining the existence probability of the audio signal according to the results of the voice activity detection and noise estimation includes:
according to the formula
according to the formula
wherein alpha issSmoothing factor, alpha, representing noise estimation in presence of speechnThe smoothing factor represents the noise estimation in the absence of voice, V (k, t-1) represents the noise spectrum estimation value of the kth frequency point at the t-1 moment, V (k, t) represents the noise spectrum estimation value of the kth frequency point at the t moment, and X (k, t) represents the short-time Fourier transform of the kth frequency point at the t moment; beta is asSmoothing factor, beta, representing the estimation of the signal in the presence of speechnThe smoothing factor represents the signal estimation without voice, Y (k, t-1) represents the signal spectrum estimation value of the kth frequency point at the t-1 moment, and Y (k, t) represents the signal spectrum estimation value of the kth frequency point at the t moment; SNR (k, t) represents the estimated value of the SNR, P (k, t) represents the existence probability of the voice of the kth frequency point at the time t, THSNRRepresenting a signal-to-noise threshold.
In one embodiment, the calculating the time delay estimation of any two microphones in the microphone array according to the existence probability of the audio signal includes:
according to the formulaCalculating time delay estimation of any two microphones in the microphone array;
according to the formulaCalculating the relative angle between a target sound source and the microphone array;
where τ represents an estimate of the time delay between two audio signals to be processed, Ψ (m) represents the generalized cross-correlation of the two audio signals to be processed,the weight value is represented by a weight value,representing the expectation of the energy of the signal, theta represents the direction of arrival, c represents the speed of sound in air, and d represents the distance between the two microphones for the two audio signals to be processed.
In one embodiment, the beamforming the audio signal to be processed according to the direction of arrival estimation and beamforming algorithm includes:
wherein R ═ E { X (t) XT(t)},
d(θ)=[1e-jωδcosθ/c...e-j(M-1)ωδcosθ/c]T,
represents hBFIs transposed matrix, subject to denotes such thatIs equal to 1, dT(theta) denotes a transposed matrix of d (theta), X (t) denotes a short-time Fourier transform at time t, XT(t) denotes a transposed matrix of X (t).
In an embodiment, the performing noise suppression on the to-be-processed audio signal after the beamforming processing to obtain the target audio signal includes:
according to the formulaAnd the formula S (k, t) ═ hNR(k) X (k, t) obtaining the target audio signal;
wherein S (k, t) represents the audio signal to be processed after noise reduction processing, hNR(k) Denotes a noise reduction filter, and X (k, t) denotes a short-time fourier-transformed audio signal to be processed.
According to a second aspect of the embodiments of the present disclosure, there is provided a speech processing apparatus, the method including:
the acquisition module is used for acquiring at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array;
the first processing module is used for estimating the direction of arrival of any two microphones in the microphone array;
the second processing module is used for carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm;
the third processing module is used for carrying out noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal;
and the output module is used for outputting the target audio signal.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flowchart of a speech processing method provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart of a method of speech processing provided by an embodiment of the present disclosure;
FIG. 3a is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
FIG. 3b is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
FIG. 3c is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
FIG. 3d is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
fig. 3e is a structural diagram of a speech processing apparatus according to an embodiment of the present disclosure;
FIG. 3f is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
FIG. 3g is a block diagram of a speech processing apparatus according to an embodiment of the disclosure;
FIG. 3h is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
fig. 3i is a structural diagram of a speech processing apparatus according to an embodiment of the present disclosure;
FIG. 3j is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
fig. 4 is a block diagram of a speech processing device according to an embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
An embodiment of the present disclosure provides a speech processing method, as shown in fig. 1, the method includes the following steps:
The audio signals to be processed are all signals output by the audio input module, and the at least two audio signals to be processed comprise audio signals obtained by the microphone array.
And 102, estimating the direction of arrival of any two microphones in the microphone array.
And 103, carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm.
And step 104, performing noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal.
And 105, outputting the target audio signal.
The embodiment of the disclosure provides a voice processing method, which performs direction-of-arrival estimation on any two microphones in a microphone array when a plurality of audio signals to be processed are acquired, performs beamforming processing on the audio signals to be processed according to the direction-of-arrival estimation and a beamforming algorithm, performs noise suppression on the audio signals to be processed after the beamforming processing, and finally outputs a target audio signal obtained after the noise reduction and suppression. Therefore, the method and the device have the advantages that the direction of arrival estimation is carried out on every two audio signals to be processed, and the noise suppression processing is carried out on the audio signals to be processed after the beam forming processing, so that the audio picking and enhancing functions are realized, and the accuracy of audio identification is improved.
An embodiment of the present disclosure provides a speech processing method, as shown in fig. 2, the method includes the following steps:
The original audio signal is a signal output by an audio input module, and the original audio signal comprises an audio signal output by a microphone array and/or an audio signal output by an intelligent microphone.
Illustratively, a multi-channel original audio signal is obtained from an audio input module at a fixed period, and the source of the original audio signal may be a microphone array or other smart microphones.
It should be noted that the audio input module may include a sound collection module and at least one input channel, for example, the audio input module includes 16 input channels; the sound collection module may include analog-to-digital conversion devices, a microphone array, a smart microphone, and the like, for example, the sound collection module includes 8 analog microphone inputs and 2 analog-to-digital conversion devices; the input sources of the overall audio signal may include: microphone arrays, third party analog or digital audio streams, other smart microphones.
Optionally, according to a formulaAnd carrying out short-time Fourier transform on each original audio signal to obtain the audio signal to be processed.
Wherein, X (k, t, m) represents the short-time Fourier transform of the kth frequency point of the mth channel at the time of t, namely the audio signal to be processed, N represents the length of a time window, w (N) represents the function value of the nth window, X (N + t, m) represents the audio signal to be processed of the mth channel at the time of N + t, N is an integer greater than or equal to 1, wk2 pi K/K denotes the angular frequency, K denotes the length of the short-time fourier transform, and e is the natural index.
Illustratively, the acquired multi-channel original audio signal is converted from the time domain to the frequency domain by a short-time fourier transform.
For example, the synchronous input signal generally refers to analog and digital audio streams of a third party, and is mainly carried by a sound source of sound played in the current environment, for example, sound played by a sound box or a television; the detection of the synchronous input signal is a necessary condition for performing echo cancellation processing, so that whether the synchronous input signal is directly related to whether echo cancellation is performed is determined; specifically, the detection of the synchronous input signal is usually completed by energy detection, that is, the signal energy of the synchronous input channel is calculated, and when the signal energy is greater than or equal to a set threshold, it is determined that there is a synchronous input signal and echo cancellation is required; and when the signal energy is less than the set threshold value, determining that no synchronous input signal exists and not needing echo cancellation.
And 204, when the synchronous input signal is determined, performing echo cancellation processing on each audio signal to be processed.
The echo cancellation means that artificially played sound, i.e., a synchronization signal, is removed from the acquired audio signal to be processed, and other sound is retained to the maximum extent.
Optionally, the channel estimation is performed according to a normalized minimum mean square error method, i.e. according to a formulaAnd
wherein y (t, m) represents the synchronous input signal collected by the mth microphone at the time t, s (t-l) represents the synchronous input signal at the time t-l, hlIndicating the channel between the synchronous input signal to each microphone, L being an identifier in the accumulation operator, L indicating the length of time, h (t, m) [ h ]0h1...hL-1]Indicates that at time tStep one, inputting a channel between the signal and the mth microphone;representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t +1,representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t,representing the error signal, mu the smoothing factor,denotes the echo estimate of the mth microphone at time t, x (t, m) denotes the near-end signal of the mth microphone at time t, s (k, m) [ s (k, m) s (k-1, m) … s (k-L +1, m) ]]Representing a vector of synchronous input signals, sT(k, m) represents the transpose of s (k, m).
It should be noted that echo cancellation can also be performed by other methods in the prior art, which is not limited by the present disclosure.
Specifically, in a real acoustic scene, a speech signal does not always exist in an environment, most of the speech segments and noise segments alternate, and even most of the speech segments and noise segments are noise segments, so that speech activity detection is required, and the speech activity detection is realized by detecting the energy or amplitude of a real-time audio stream and tracking the change of speech and noise in the audio stream on the basis of the energy or amplitude. In order to obtain better noise reduction effect, noise estimation is necessary, and the noise estimation tracks the change of the acoustic frequency spectrum in real time by tracking the change of characteristics such as signal-to-noise ratio, amplitude and the like in the audio stream signal. The most typical way is to estimate the signal-to-noise ratio of the audio in real time by tracking the frequency spectrums of the voice and the noise, and then update the frequency spectrums of the voice and the noise according to the estimated signal-to-noise ratio of the audio.
Optionally, when it is determined that there is a synchronous input signal, voice activity detection and noise estimation are performed on each audio signal to be processed after echo cancellation processing, so as to obtain an audio signal existence probability.
Optionally, when it is determined that no synchronous input signal exists, voice activity detection and noise estimation are directly performed on each audio signal to be processed, so as to obtain the existence probability of the audio signal.
Illustratively, according to a formula
Wherein alpha issSmoothing factor, alpha, representing noise estimation in presence of speechnAnd the smoothing factor represents the noise estimation in the absence of voice, V (k, t-1) represents the noise spectrum estimation value of the kth frequency point at the t-1 moment, V (k, t) represents the noise spectrum estimation value of the kth frequency point at the t moment, and X (k, t) represents the short-time Fourier transform of the kth frequency point at the t moment.
According to the formula
Wherein, betasSmoothing factor, beta, representing the estimation of the signal in the presence of speechnAnd the smoothing factor represents the signal estimation in the absence of voice, Y (k, t-1) represents the signal spectrum estimation value of the kth frequency point at the t-1 moment, and Y (k, t) represents the signal spectrum estimation value of the kth frequency point at the t moment.
Where SNR (k, t) represents an estimate of the signal-to-noise ratio.
Wherein, P (k, t) represents the audio signal existence probability of the kth frequency point at the time t, THSNRRepresenting a signal-to-noise threshold.
And step 206, estimating the direction of arrival of any two microphones in the microphone array according to the existence probability of the audio signal.
The direction of arrival is the relative angle between the target sound source and the microphone array, and the estimation of the direction of arrival is divided into two steps: and calculating the time delay estimation of any two microphones in the microphone array according to the existence probability of the audio signal, and then calculating the relative angle between the target sound source and the microphone array according to the time delay estimation result.
Illustratively, according to a formulaTime delay estimates for any two microphones in the array of microphones are calculated.
According to the formulaThe relative angle of the target sound source to the microphone array is calculated.
Where τ represents an estimate of the time delay between two audio signals to be processed, Ψ (m) represents the generalized cross-correlation of the two audio signals to be processed,phi (k) represents a weight, phi (k) 1/| E { X (k,1) X*(k,2)|,E{X(k,1)X*(k,2) } denotes the expectation of the energy of the signal, θ denotes the direction of arrival, c denotes the speed of sound in air, and d denotes the distance between two microphones corresponding to two audio signals to be processed.
And step 207, performing beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm.
Specifically, when the direction of arrival is determined, the spatial information of the signal can be utilized to the maximum extent by the beam forming algorithm, and noise and reverberation from directions other than the sound source direction can be eliminated. The beamforming is to perform phase compensation on each microphone in different frequency bands, so as to achieve the effects of enhancing a target signal and suppressing noise and interference. Specifically, spatial filters are respectively designed on different frequency bands, and each audio signal to be processed is spatially filtered.
For example, the beamforming coefficients may be designed according to a distortion-free minimum mean square error, and the overall energy may be minimized while ensuring that the direction of arrival signal is unchanged. I.e. according to the formulaAnd carrying out beam forming processing on the audio signal to be processed.
Wherein R ═ E { X (t) XT(t)},
d(θ)=[1e-jωδcosθ/c...e-j(M-1)ωδcosθ/c]T,
represents hBFIs transposed matrix, subject to denotes such thatIs equal to 1, dT(theta) denotes a transposed matrix of d (theta), X (t) denotes a short-time Fourier transform at time t, XT(t) denotes a transposed matrix of X (t).
And step 208, performing noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal.
Specifically, noise cancellation is necessary because noise is ubiquitous in the real environment, where noise cancellation is achieved by frequency filtering, which can be found by minimizing the difference between the clean signal and the estimated signal. The noise reduction is usually performed by spectral subtraction, and the idea of spectral subtraction is to calculate a ratio of a clean signal to an observed signal by using the energy of a current signal and the energy of noise estimation for each frequency point, and then perform frequency filtering by using the ratio.
S(k,t)=hNR(k) X (k, t) results in the target audio signal.
Wherein S (k, t) represents the audio signal to be processed after noise reduction processing, hNR(k) Denotes a noise reduction filter, and X (k, t) denotes a short-time fourier-transformed audio signal to be processed.
And 209, performing short-time Fourier inversion on each target audio signal and outputting the target audio signal.
For example, after the target audio signal is determined, the target audio signal is converted from the frequency domain to the time domain again by using short-time inverse fourier transform to obtain a finally output digital audio stream, the output digital audio stream may be output through an audio output module, and the audio output module may be a headphone interface, a USB sound card, or other smart microphones.
The embodiment of the disclosure provides a voice processing method, when a plurality of audio signals to be processed are obtained, whether a synchronous input signal exists is detected, and when the synchronous input signal exists, echo cancellation processing is performed on each audio signal to be processed; then, voice activity detection and noise estimation are carried out on each audio signal to be processed after echo cancellation processing, and the existence probability of the audio signal is obtained; and determining direction-of-arrival estimation between every two audio signals to be processed according to the audio signal existence probability, performing noise reduction processing on each audio signal to be processed according to the direction-of-arrival estimation, and finally outputting the target audio signal subjected to the noise reduction processing. Therefore, the method not only detects the synchronous input signals of the received audio signals to be processed, but also performs voice activity detection and noise estimation, and finally performs direction-of-arrival estimation on every two audio signals to be processed according to the existence probability of the audio signals obtained by the voice activity detection and the noise estimation, and performs noise reduction processing on all the audio signals to be processed according to the direction-of-arrival estimation, thereby further reducing various noises in the target audio signal, realizing audio pickup and enhancement functions, and further improving the accuracy of audio identification; in addition, the method and the device can simultaneously acquire the audio signals to be processed output by the plurality of intelligent microphones and simultaneously process the audio signals to be processed output by the plurality of intelligent microphones, so that the combined processing of the plurality of intelligent microphones is realized, complex scenes with high processing difficulty can be matched, and the adaptability is high. The microphone designed by the method has a comprehensive middle and far field voice enhancement effect, can be applied to all scenes with middle and far field voice enhancement requirements, and has extremely high universality.
Based on the speech processing method described in the above embodiments, the following is an embodiment of the apparatus of the present disclosure, which can be used to execute the embodiment of the method of the present disclosure.
The embodiment of the present disclosure provides a voice processing apparatus, as shown in fig. 3a, the voice processing apparatus 30 includes: an acquisition module 301, a first processing module 302, a second processing module 303, a third processing module 304 and an output module 305.
The acquiring module 301 is configured to acquire at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array.
A first processing module 302, configured to perform direction-of-arrival estimation on any two microphones in the microphone array.
The second processing module 303 is configured to perform beamforming processing on the audio signal to be processed according to the direction of arrival estimation and a beamforming algorithm.
The third processing module 304 is configured to perform noise suppression on the audio signal to be processed after the beamforming processing, so as to obtain a target audio signal.
An output module 305, configured to output the target audio signal.
In one embodiment, as shown in fig. 3b, the apparatus further comprises a determination module 306, and the first processing module 302 comprises a first processing sub-module 3021.
The determining module 306 is configured to perform voice activity detection and noise estimation on each to-be-processed audio signal, and determine an existence probability of the audio signal according to results of the voice activity detection and the noise estimation.
The first processing sub-module 3021 is configured to perform direction-of-arrival estimation on any two microphones in the microphone array according to the audio signal existence probability.
In one embodiment, as shown in fig. 3c, the first processing submodule 3021 comprises a calculation unit 30211.
The calculating unit 30211 is configured to calculate time delay estimates of any two microphones in the microphone array according to the existence probability of the audio signal, and calculate a relative angle between a target sound source and the microphone array according to a result of the time delay estimates.
In one embodiment, as shown in FIG. 3d, the determination module 306 includes a first determination submodule 3061, a second processing submodule 3062, a third processing submodule 3063, and a fourth processing submodule 3064.
Therein, the first determining submodule 3061 is used for determining whether there is a synchronous input signal.
The second processing submodule 3062 is configured to perform echo cancellation processing on each to-be-processed audio signal when it is determined that the synchronization input signal exists.
The third processing submodule 3063 is configured to perform voice activity detection and noise estimation on each of the to-be-processed audio signals after the echo cancellation processing.
The fourth processing submodule 3064 is configured to, when it is determined that the synchronization input signal is not present, perform voice activity detection and noise estimation on each of the audio signals to be processed.
In one embodiment, as shown in fig. 3e, the obtaining module 301 includes a obtaining sub-module 3011 and a transforming sub-module 3012.
The obtaining sub-module 3011 is configured to obtain at least two original audio signals; the original audio signal is a signal output by the audio input module.
The transform submodule 3012 is configured to perform short-time fourier transform on each original audio signal to obtain the to-be-processed audio signal.
In one embodiment, as shown in FIG. 3f, the second processing submodule 3062 includes a processing unit 30621.
Wherein y (t, m) represents the synchronous input signal collected by the mth microphone at the time t, s (t-l) represents the synchronous input signal at the time t-l, hlIndicating the channel between the synchronous input signal to each microphone, L being an identifier in the accumulation operator, L indicating the length of time, h (t, m) [ h ]0 h1...hL-1]Representing the channel between the synchronous input signal to the mth microphone at time t;representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t +1,representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t,representing the error signal, mu the smoothing factor,denotes the echo estimate of the mth microphone at time t, x (t, m) denotes the near-end signal of the mth microphone at time t, s (k, m) denotes [ s (k, m) s (k-1, m) … s (k-L +1, m)]Representing a vector of synchronous input signals, sT(k, m) represents the transpose of s (k, m).
In one embodiment, as shown in FIG. 3g, the determination module 306 includes a detection sub-module 3065, a fifth processing sub-module 3066, and a second determination sub-module 3067.
Wherein the detection submodule 3065 is used for calculating a formula
A fifth processing submodule 3066 for processing according to a formula
Wherein alpha issSmoothing factor, alpha, representing noise estimation in presence of speechnThe smoothing factor represents the noise estimation in the absence of voice, V (k, t-1) represents the noise spectrum estimation value of the kth frequency point at the t-1 moment, V (k, t) represents the noise spectrum estimation value of the kth frequency point at the t moment, and X (k, t) represents the short-time Fourier transform of the kth frequency point at the t moment; beta is asSmoothing factor, beta, representing the estimation of the signal in the presence of speechnThe smoothing factor represents the signal estimation without voice, Y (k, t-1) represents the signal spectrum estimation value of the kth frequency point at the t-1 moment, and Y (k, t) represents the signal spectrum estimation value of the kth frequency point at the t moment; SNR (k, t) represents the estimated value of the SNR, P (k, t) represents the existence probability of the voice of the kth frequency point at the time t, THSNRRepresenting a signal-to-noise threshold.
In one embodiment, as shown in fig. 3h, the calculation unit 30211 includes a first calculation subunit 302111 and a second calculation subunit 302112.
Wherein the first calculating subunit 302111 is configured to calculate the first calculation according to the formulaTime delay estimates for any two microphones in the array of microphones are calculated.
The second calculating subunit 302112, configured to calculate the formulaThe relative angle of the target sound source to the microphone array is calculated.
Where τ represents an estimate of the time delay between two audio signals to be processed, Ψ (m) represents the generalized cross-correlation of the two audio signals to be processed,the weight value is represented by a weight value,representing the expectation of the energy of the signal, theta represents the direction of arrival, c represents the speed of sound in air, and d represents the distance between the two microphones for the two audio signals to be processed.
In one embodiment, as shown in fig. 3i, the second processing module 303 comprises a sixth processing submodule 3031.
Wherein the sixth processing submodule 3031 is configured to perform processing according to a formulaAnd carrying out beam forming processing on the audio signal to be processed.
Wherein R ═ E { X (t) XT(t)},
d(θ)=[1e-jωδcosθ/c...e-j(M-1)ωδcosθ/c]T,
represents hBFIs transposed matrix, subject to denotes such thatIs equal to 1, dT(theta) denotes a transposed matrix of d (theta), X (t) denotes a short-time Fourier transform at time t, XT(t) denotes a transposed matrix of X (t).
In one embodiment, as shown in FIG. 3j, the third processing module 304 includes a seventh processing submodule 3041.
Wherein the seventh processing submodule 3041 is configured to according to a formula
Wherein S (k, t) represents the audio signal to be processed after noise reduction processing, hNR(k) Denotes a noise reduction filter, and X (k, t) denotes a short-time fourier-transformed audio signal to be processed.
The embodiment of the present disclosure provides a voice processing apparatus, which performs direction-of-arrival estimation on any two microphones in a microphone array when acquiring a plurality of audio signals to be processed, performs beamforming processing on the audio signals to be processed according to the direction-of-arrival estimation and a beamforming algorithm, performs noise suppression on the audio signals to be processed after the beamforming processing, and finally outputs a target audio signal obtained after the noise reduction and suppression. Therefore, the method and the device have the advantages that the direction of arrival estimation is carried out on every two audio signals to be processed, and the noise suppression processing is carried out on the audio signals to be processed after the beam forming processing, so that the audio picking and enhancing functions are realized, and the accuracy of audio identification is improved.
Referring to fig. 4, an embodiment of the present disclosure further provides a speech processing apparatus, where the speech processing apparatus includes a receiver 401, a transmitter 402, a memory 403, and a processor 404, where the transmitter 402 and the memory 403 are respectively connected to the processor 404, the memory 403 stores at least one computer instruction, and the processor 404 is configured to load and execute the at least one computer instruction to implement the speech processing method described in the embodiment corresponding to fig. 1.
Based on the voice processing method described in the embodiment corresponding to fig. 1, an embodiment of the present disclosure further provides a computer-readable storage medium, for example, the non-transitory computer-readable storage medium may be a Read Only Memory (ROM), a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like. The storage medium stores computer instructions for executing the voice processing method described in the embodiment corresponding to fig. 1, which is not described herein again.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
Claims (11)
1. A method of speech processing, the method comprising:
acquiring at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array;
estimating the direction of arrival of any two microphones in the microphone array;
carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm;
carrying out noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal;
and outputting the target audio signal.
2. The method of claim 1, wherein prior to the estimating the direction of arrival for any two microphones of the array of microphones, further comprising:
performing voice activity detection and noise estimation on each audio signal to be processed, and determining the existence probability of the audio signal according to the results of the voice activity detection and the noise estimation;
the estimating direction of arrival of any two microphones of the microphone array comprises:
and estimating the direction of arrival of any two microphones in the microphone array according to the existence probability of the audio signal.
3. The method of claim 2, wherein the estimating direction of arrival for any two microphones in the microphone array according to the audio signal presence probability comprises:
calculating time delay estimation of any two microphones in the microphone array according to the existence probability of the audio signal;
and calculating the relative angle between the target sound source and the microphone array according to the time delay estimation result.
4. The method of claim 3, wherein the performing voice activity detection and noise estimation on each of the audio signals to be processed comprises:
determining whether there is a synchronous input signal;
when the synchronous input signal is determined, performing echo cancellation processing on each audio signal to be processed;
performing voice activity detection and noise estimation on each audio signal to be processed after echo cancellation processing;
and when the synchronous input signal is determined not to exist, carrying out voice activity detection and noise estimation on each audio signal to be processed.
5. The method of claim 1, wherein the obtaining at least two audio signals to be processed comprises:
acquiring at least two original audio signals; the original audio signal is a signal output by an audio input module;
and carrying out short-time Fourier transform on each original audio signal to obtain the audio signal to be processed.
6. The method of claim 4, wherein the performing echo cancellation processing on each of the audio signals to be processed comprises:
wherein y (t, m) represents the m-thSynchronous input signal collected by microphone at t moment, s (t-l) represents synchronous input signal at t-l moment, hlIndicating the channel between the synchronous input signal to each microphone, L being an identifier in the accumulation operator, L indicating the length of time, h (t, m) [ h ]0h1...hL-1]Representing the channel between the synchronous input signal to the mth microphone at time t;representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t +1,representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t,representing the error signal, mu the smoothing factor,denotes the echo estimate of the mth microphone at time t, x (t, m) denotes the near-end signal of the mth microphone at time t, s (k, m) [ s (k, m) s (k-1, m) … s (k-L +1, m) ]]Representing a vector of synchronous input signals, sT(k, m) represents the transpose of s (k, m).
7. The method of claim 6, wherein performing voice activity detection and noise estimation on each of the audio signals to be processed, and determining the audio signal existence probability according to the results of the voice activity detection and noise estimation comprises:
according to the formula
according to the formula
wherein alpha issSmoothing factor, alpha, representing noise estimation in presence of speechnThe smoothing factor represents the noise estimation in the absence of voice, V (k, t-1) represents the noise spectrum estimation value of the kth frequency point at the t-1 moment, V (k, t) represents the noise spectrum estimation value of the kth frequency point at the t moment, and X (k, t) represents the short-time Fourier transform of the kth frequency point at the t moment; beta is asSmoothing factor, beta, representing the estimation of the signal in the presence of speechnThe smoothing factor represents the signal estimation without voice, Y (k, t-1) represents the signal spectrum estimation value of the kth frequency point at the t-1 moment, and Y (k, t) represents the signal spectrum estimation value of the kth frequency point at the t moment; SNR (k, t) represents the estimated value of the SNR, P (k, t) represents the existence probability of the voice of the kth frequency point at the time t, THSNRRepresenting a signal-to-noise threshold.
8. The method of claim 7, wherein calculating an estimate of time delay for any two microphones in the array of microphones based on the probability of existence of the audio signal, and wherein calculating a relative angle of a target sound source to the array of microphones based on the result of the estimate of time delay comprises:
according to the formulaCalculating time delay estimation of any two microphones in the microphone array;
according toFormula (II)Calculating the relative angle between a target sound source and the microphone array;
where τ represents an estimate of the time delay between two audio signals to be processed, Ψ (m) represents the generalized cross-correlation of the two audio signals to be processed, the weight value is represented by a weight value,E{X(k,1)X*(k,2) } denotes the expectation of the energy of the signal, θ denotes the direction of arrival, c denotes the speed of sound in air, and d denotes the distance between two microphones corresponding to two audio signals to be processed.
9. The method of claim 8, wherein the beamforming the audio signal to be processed according to the direction of arrival estimation and beamforming algorithm comprises:
according to the formulasubject toCarrying out beam forming processing on the audio signal to be processed;
wherein R ═ E { X (t) XT(t)},
d(θ)=[1e-jωδcosθ/c ... e-j(M-1)ωδcosθ/c]T,
10. The method of claim 9, wherein the performing noise suppression on the beamformed audio signal to be processed to obtain a target audio signal comprises:
according to the formulaAnd the formula S (k, t) ═ hNR(k) X (k, t) obtaining the target audio signal;
wherein S (k, t) represents the audio signal to be processed after noise reduction processing, hNR(k) Denotes a noise reduction filter, and X (k, t) denotes a short-time fourier-transformed audio signal to be processed.
11. A speech processing apparatus, comprising:
the acquisition module is used for acquiring at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array;
the first processing module is used for estimating the direction of arrival of any two microphones in the microphone array;
the second processing module is used for carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm;
the third processing module is used for carrying out noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal;
and the output module is used for outputting the target audio signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111003630.0A CN113744752A (en) | 2021-08-30 | 2021-08-30 | Voice processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111003630.0A CN113744752A (en) | 2021-08-30 | 2021-08-30 | Voice processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113744752A true CN113744752A (en) | 2021-12-03 |
Family
ID=78733797
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111003630.0A Pending CN113744752A (en) | 2021-08-30 | 2021-08-30 | Voice processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113744752A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114783441A (en) * | 2022-05-14 | 2022-07-22 | 云知声智能科技股份有限公司 | Voice recognition method, device, equipment and medium |
CN115579016A (en) * | 2022-12-07 | 2023-01-06 | 成都海普迪科技有限公司 | Method and system for eliminating acoustic echo |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007147732A (en) * | 2005-11-24 | 2007-06-14 | Japan Advanced Institute Of Science & Technology Hokuriku | Noise reduction system and noise reduction method |
CN106251877A (en) * | 2016-08-11 | 2016-12-21 | 珠海全志科技股份有限公司 | Voice Sounnd source direction method of estimation and device |
CN108831508A (en) * | 2018-06-13 | 2018-11-16 | 百度在线网络技术(北京)有限公司 | Voice activity detection method, device and equipment |
CN108899044A (en) * | 2018-07-27 | 2018-11-27 | 苏州思必驰信息科技有限公司 | Audio signal processing method and device |
CN108922553A (en) * | 2018-07-19 | 2018-11-30 | 苏州思必驰信息科技有限公司 | Wave arrival direction estimating method and system for sound-box device |
CN110097891A (en) * | 2019-04-22 | 2019-08-06 | 广州视源电子科技股份有限公司 | Microphone signal processing method, device, equipment and storage medium |
CN110164446A (en) * | 2018-06-28 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Voice signal recognition methods and device, computer equipment and electronic equipment |
CN110556103A (en) * | 2018-05-31 | 2019-12-10 | 阿里巴巴集团控股有限公司 | Audio signal processing method, apparatus, system, device and storage medium |
CN111161751A (en) * | 2019-12-25 | 2020-05-15 | 声耕智能科技(西安)研究院有限公司 | Distributed microphone pickup system and method under complex scene |
CN111624553A (en) * | 2020-05-26 | 2020-09-04 | 锐迪科微电子科技(上海)有限公司 | Sound source positioning method and system, electronic equipment and storage medium |
CN111856402A (en) * | 2020-07-23 | 2020-10-30 | 海尔优家智能科技(北京)有限公司 | Signal processing method and device, storage medium, and electronic device |
CN113270106A (en) * | 2021-05-07 | 2021-08-17 | 深圳市友杰智新科技有限公司 | Method, device and equipment for inhibiting wind noise of double microphones and storage medium |
-
2021
- 2021-08-30 CN CN202111003630.0A patent/CN113744752A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007147732A (en) * | 2005-11-24 | 2007-06-14 | Japan Advanced Institute Of Science & Technology Hokuriku | Noise reduction system and noise reduction method |
CN106251877A (en) * | 2016-08-11 | 2016-12-21 | 珠海全志科技股份有限公司 | Voice Sounnd source direction method of estimation and device |
CN110556103A (en) * | 2018-05-31 | 2019-12-10 | 阿里巴巴集团控股有限公司 | Audio signal processing method, apparatus, system, device and storage medium |
CN108831508A (en) * | 2018-06-13 | 2018-11-16 | 百度在线网络技术(北京)有限公司 | Voice activity detection method, device and equipment |
CN110164446A (en) * | 2018-06-28 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Voice signal recognition methods and device, computer equipment and electronic equipment |
CN108922553A (en) * | 2018-07-19 | 2018-11-30 | 苏州思必驰信息科技有限公司 | Wave arrival direction estimating method and system for sound-box device |
CN108899044A (en) * | 2018-07-27 | 2018-11-27 | 苏州思必驰信息科技有限公司 | Audio signal processing method and device |
CN110097891A (en) * | 2019-04-22 | 2019-08-06 | 广州视源电子科技股份有限公司 | Microphone signal processing method, device, equipment and storage medium |
CN111161751A (en) * | 2019-12-25 | 2020-05-15 | 声耕智能科技(西安)研究院有限公司 | Distributed microphone pickup system and method under complex scene |
CN111624553A (en) * | 2020-05-26 | 2020-09-04 | 锐迪科微电子科技(上海)有限公司 | Sound source positioning method and system, electronic equipment and storage medium |
CN111856402A (en) * | 2020-07-23 | 2020-10-30 | 海尔优家智能科技(北京)有限公司 | Signal processing method and device, storage medium, and electronic device |
CN113270106A (en) * | 2021-05-07 | 2021-08-17 | 深圳市友杰智新科技有限公司 | Method, device and equipment for inhibiting wind noise of double microphones and storage medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114783441A (en) * | 2022-05-14 | 2022-07-22 | 云知声智能科技股份有限公司 | Voice recognition method, device, equipment and medium |
CN115579016A (en) * | 2022-12-07 | 2023-01-06 | 成都海普迪科技有限公司 | Method and system for eliminating acoustic echo |
CN115579016B (en) * | 2022-12-07 | 2023-03-21 | 成都海普迪科技有限公司 | Method and system for eliminating acoustic echo |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10123113B2 (en) | Selective audio source enhancement | |
JP4815661B2 (en) | Signal processing apparatus and signal processing method | |
CN106710601B (en) | Noise-reduction and pickup processing method and device for voice signals and refrigerator | |
EP3542547B1 (en) | Adaptive beamforming | |
CN107479030B (en) | Frequency division and improved generalized cross-correlation based binaural time delay estimation method | |
KR101449433B1 (en) | Noise cancelling method and apparatus from the sound signal through the microphone | |
KR101456866B1 (en) | Method and apparatus for extracting the target sound signal from the mixed sound | |
US8462962B2 (en) | Sound processor, sound processing method and recording medium storing sound processing program | |
CN109285557B (en) | Directional pickup method and device and electronic equipment | |
KR20040044982A (en) | Selective sound enhancement | |
CN106887239A (en) | For the enhanced blind source separation algorithm of the mixture of height correlation | |
CN110610718B (en) | Method and device for extracting expected sound source voice signal | |
JP2007523514A (en) | Adaptive beamformer, sidelobe canceller, method, apparatus, and computer program | |
CN111863015B (en) | Audio processing method, device, electronic equipment and readable storage medium | |
CN108109617A (en) | A kind of remote pickup method | |
CN113744752A (en) | Voice processing method and device | |
WO2007123047A1 (en) | Adaptive array control device, method, and program, and its applied adaptive array processing device, method, and program | |
KR100917460B1 (en) | Noise cancellation apparatus and method thereof | |
CN113903353A (en) | Directional noise elimination method and device based on spatial discrimination detection | |
CN112802490B (en) | Beam forming method and device based on microphone array | |
CN117169812A (en) | Sound source positioning method based on deep learning and beam forming | |
CN116106826A (en) | Sound source positioning method, related device and medium | |
CN116760442A (en) | Beam forming method, device, electronic equipment and storage medium | |
CN113948101B (en) | Noise suppression method and device based on space distinguishing detection | |
KR20090098552A (en) | Apparatus and method for automatic gain control using phase information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211203 |
|
RJ01 | Rejection of invention patent application after publication |