CN113744752A

CN113744752A - Voice processing method and device

Info

Publication number: CN113744752A
Application number: CN202111003630.0A
Authority: CN
Inventors: 聂玮奇; 刘煜; 刘博洋; 季经伟
Original assignee: Xi'an Shengbijie Information Technology Co ltd
Current assignee: Xi'an Shengbijie Information Technology Co ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-12-03

Abstract

The present disclosure provides a voice processing method and apparatus, relating to the technical field of voice, wherein the method comprises obtaining at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array; estimating the direction of arrival of any two microphones in the microphone array; carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm; carrying out noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal; and outputting the target audio signal. The audio frequency picking and enhancing function is realized, and the accuracy of audio frequency identification is improved.

Description

Voice processing method and device

Technical Field

The present disclosure relates to the field of speech technologies, and in particular, to a speech processing method and apparatus.

Background

With the continuous development of artificial intelligence technology, traditional equipment in various fields is gradually replaced by corresponding intelligent terminals. The intelligent terminal is a fully-open platform with multiple functions of monitoring, sensing, communication and intelligent interaction, carries an operating system, can automatically install and uninstall various application software, and continuously expands and upgrades functions. In the aspect of intelligent interaction, many complicated items are not realized only by a remote control and a touch screen which are commonly used by a target, wherein the best method is to adopt a voice remote control, and the key of the voice remote control is the acquisition and recognition of a voice signal.

In the related art, when a speech signal is acquired, the speech signal is usually filtered and output directly.

However, in the above-described technique, if the acquired speech signal includes speech in a plurality of directions, only filtering results in a large amount of noise in the finally obtained speech signal, and thus accuracy of speech recognition is reduced.

Disclosure of Invention

The embodiment of the disclosure provides a voice processing method and device, which can solve the problem that the accuracy of voice recognition is reduced in the prior art. The technical scheme is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a speech processing method, the method including:

acquiring at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array;

estimating the direction of arrival of any two microphones in the microphone array;

carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm;

carrying out noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal;

and outputting the target audio signal.

The embodiment of the disclosure provides a voice processing method, which performs direction-of-arrival estimation on any two microphones in a microphone array when a plurality of audio signals to be processed are acquired, performs beamforming processing on the audio signals to be processed according to the direction-of-arrival estimation and a beamforming algorithm, performs noise suppression on the audio signals to be processed after the beamforming processing, and finally outputs a target audio signal obtained after the noise reduction and suppression. Therefore, the method and the device have the advantages that the direction of arrival estimation is carried out on every two audio signals to be processed, and the noise suppression processing is carried out on the audio signals to be processed after the beam forming processing, so that the audio picking and enhancing functions are realized, and the accuracy of audio identification is improved.

In one embodiment, before the estimating the direction of arrival of any two microphones in the microphone array, the method further includes:

performing voice activity detection and noise estimation on each audio signal to be processed, and determining the existence probability of the audio signal according to the results of the voice activity detection and the noise estimation;

the estimating direction of arrival of any two microphones of the microphone array comprises:

and estimating the direction of arrival of any two microphones in the microphone array according to the existence probability of the audio signal.

In one embodiment, the estimating the direction of arrival of any two microphones of the microphone array according to the audio signal existence probability comprises:

and calculating the time delay estimation of any two microphones in the microphone array according to the existence probability of the audio signal, and calculating the relative angle between a target sound source and the microphone array according to the time delay estimation result.

In one embodiment, the performing voice activity detection and noise estimation on each of the audio signals to be processed includes:

determining whether there is a synchronous input signal;

when the synchronous input signal is determined, performing echo cancellation processing on each audio signal to be processed;

performing voice activity detection and noise estimation on each audio signal to be processed after echo cancellation processing;

and when the synchronous input signal is determined not to exist, carrying out voice activity detection and noise estimation on each audio signal to be processed.

In one embodiment, the obtaining at least two audio signals to be processed comprises:

acquiring at least two original audio signals; the original audio signal is a signal output by an audio input module;

and carrying out short-time Fourier transform on each original audio signal to obtain the audio signal to be processed.

In one embodiment, the performing echo cancellation processing on each of the audio signals to be processed includes:

according to the formula

And

formula (II)

Performing echo cancellation processing on each audio signal to be processed;

wherein y (t, m) represents the synchronous input signal collected by the mth microphone at the time t, s (t-l) represents the synchronous input signal at the time t-l, h_lIndicating the channel between the synchronous input signal to each microphone, L being an identifier in the accumulation operator, L indicating the length of time, h (t, m) [ h ]₀h₁...h_L-1]Representing the channel between the synchronous input signal to the mth microphone at time t;

representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t +1,

representing the channel estimate of the synchronous input signal acquired by the mth microphone at time t,

representing the error signal, mu the smoothing factor,

denotes the echo estimate of the mth microphone at time t, x (t, m) denotes the near-end signal of the mth microphone at time t, and s (k, m) [ s (k, m) s (k-1, m) … s (k-L + 1),m)]Representing a vector of synchronous input signals, s^T(k, m) represents the transpose of s (k, m).

In one embodiment, the performing voice activity detection and noise estimation on each of the audio signals to be processed, and determining the existence probability of the audio signal according to the results of the voice activity detection and noise estimation includes:

according to the formula

Performing voice activity detection on each audio signal to be processed;

according to the formula

Performing noise estimation on each audio signal to be processed;

according to the formula

And

formula (II)

Determining the audio signal presence probability;

wherein alpha is_sSmoothing factor, alpha, representing noise estimation in presence of speech_nThe smoothing factor represents the noise estimation in the absence of voice, V (k, t-1) represents the noise spectrum estimation value of the kth frequency point at the t-1 moment, V (k, t) represents the noise spectrum estimation value of the kth frequency point at the t moment, and X (k, t) represents the short-time Fourier transform of the kth frequency point at the t moment; beta is a_sSmoothing factor, beta, representing the estimation of the signal in the presence of speech_nThe smoothing factor represents the signal estimation without voice, Y (k, t-1) represents the signal spectrum estimation value of the kth frequency point at the t-1 moment, and Y (k, t) represents the signal spectrum estimation value of the kth frequency point at the t moment; SNR (k, t) represents the estimated value of the SNR, P (k, t) represents the existence probability of the voice of the kth frequency point at the time t, TH_SNRRepresenting a signal-to-noise threshold.

In one embodiment, the calculating the time delay estimation of any two microphones in the microphone array according to the existence probability of the audio signal includes:

according to the formula

Calculating time delay estimation of any two microphones in the microphone array;

according to the formula

Calculating the relative angle between a target sound source and the microphone array;

where τ represents an estimate of the time delay between two audio signals to be processed, Ψ (m) represents the generalized cross-correlation of the two audio signals to be processed,

the weight value is represented by a weight value,

representing the expectation of the energy of the signal, theta represents the direction of arrival, c represents the speed of sound in air, and d represents the distance between the two microphones for the two audio signals to be processed.

In one embodiment, the beamforming the audio signal to be processed according to the direction of arrival estimation and beamforming algorithm includes:

according to the formula

Carrying out beam forming processing on the audio signal to be processed;

wherein R ═ E { X (t) X^T(t)}，

d(θ)＝[1e^{-jωδcosθ/c}...e^{-j(M-1)ωδcosθ/c}]^T，

The method can obtain the following results by a Lagrange multiplier method:

represents h_BFIs transposed matrix, subject to denotes such that

Is equal to 1, d^T(theta) denotes a transposed matrix of d (theta), X (t) denotes a short-time Fourier transform at time t, X^T(t) denotes a transposed matrix of X (t).

In an embodiment, the performing noise suppression on the to-be-processed audio signal after the beamforming processing to obtain the target audio signal includes:

according to the formula

And the formula S (k, t) ═ h_NR(k) X (k, t) obtaining the target audio signal;

wherein S (k, t) represents the audio signal to be processed after noise reduction processing, h_NR(k) Denotes a noise reduction filter, and X (k, t) denotes a short-time fourier-transformed audio signal to be processed.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech processing apparatus, the method including:

the acquisition module is used for acquiring at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array;

the first processing module is used for estimating the direction of arrival of any two microphones in the microphone array;

the second processing module is used for carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm;

the third processing module is used for carrying out noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal;

and the output module is used for outputting the target audio signal.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart of a speech processing method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of speech processing provided by an embodiment of the present disclosure;

FIG. 3a is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 3b is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 3c is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 3d is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

fig. 3e is a structural diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 3f is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 3g is a block diagram of a speech processing apparatus according to an embodiment of the disclosure;

FIG. 3h is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

fig. 3i is a structural diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 3j is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

fig. 4 is a block diagram of a speech processing device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

An embodiment of the present disclosure provides a speech processing method, as shown in fig. 1, the method includes the following steps:

step 101, at least two audio signals to be processed are obtained.

The audio signals to be processed are all signals output by the audio input module, and the at least two audio signals to be processed comprise audio signals obtained by the microphone array.

And 102, estimating the direction of arrival of any two microphones in the microphone array.

And 103, carrying out beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm.

And step 104, performing noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal.

And 105, outputting the target audio signal.

An embodiment of the present disclosure provides a speech processing method, as shown in fig. 2, the method includes the following steps:

step 201, at least two original audio signals are obtained.

The original audio signal is a signal output by an audio input module, and the original audio signal comprises an audio signal output by a microphone array and/or an audio signal output by an intelligent microphone.

Illustratively, a multi-channel original audio signal is obtained from an audio input module at a fixed period, and the source of the original audio signal may be a microphone array or other smart microphones.

It should be noted that the audio input module may include a sound collection module and at least one input channel, for example, the audio input module includes 16 input channels; the sound collection module may include analog-to-digital conversion devices, a microphone array, a smart microphone, and the like, for example, the sound collection module includes 8 analog microphone inputs and 2 analog-to-digital conversion devices; the input sources of the overall audio signal may include: microphone arrays, third party analog or digital audio streams, other smart microphones.

Step 202, performing short-time fourier transform on each original audio signal to obtain the audio signal to be processed.

Optionally, according to a formula

Wherein, X (k, t, m) represents the short-time Fourier transform of the kth frequency point of the mth channel at the time of t, namely the audio signal to be processed, N represents the length of a time window, w (N) represents the function value of the nth window, X (N + t, m) represents the audio signal to be processed of the mth channel at the time of N + t, N is an integer greater than or equal to 1, w_k2 pi K/K denotes the angular frequency, K denotes the length of the short-time fourier transform, and e is the natural index.

Illustratively, the acquired multi-channel original audio signal is converted from the time domain to the frequency domain by a short-time fourier transform.

Step 203, determining whether there is a synchronous input signal.

For example, the synchronous input signal generally refers to analog and digital audio streams of a third party, and is mainly carried by a sound source of sound played in the current environment, for example, sound played by a sound box or a television; the detection of the synchronous input signal is a necessary condition for performing echo cancellation processing, so that whether the synchronous input signal is directly related to whether echo cancellation is performed is determined; specifically, the detection of the synchronous input signal is usually completed by energy detection, that is, the signal energy of the synchronous input channel is calculated, and when the signal energy is greater than or equal to a set threshold, it is determined that there is a synchronous input signal and echo cancellation is required; and when the signal energy is less than the set threshold value, determining that no synchronous input signal exists and not needing echo cancellation.

And 204, when the synchronous input signal is determined, performing echo cancellation processing on each audio signal to be processed.

The echo cancellation means that artificially played sound, i.e., a synchronization signal, is removed from the acquired audio signal to be processed, and other sound is retained to the maximum extent.

Optionally, the channel estimation is performed according to a normalized minimum mean square error method, i.e. according to a formula

And

formula (II)

Performing echo cancellation processing on each audio signal to be processed;

wherein y (t, m) represents the synchronous input signal collected by the mth microphone at the time t, s (t-l) represents the synchronous input signal at the time t-l, h_lIndicating the channel between the synchronous input signal to each microphone, L being an identifier in the accumulation operator, L indicating the length of time, h (t, m) [ h ]₀h₁...h_L-1]Indicates that at time tStep one, inputting a channel between the signal and the mth microphone;

representing the error signal, mu the smoothing factor,

denotes the echo estimate of the mth microphone at time t, x (t, m) denotes the near-end signal of the mth microphone at time t, s (k, m) [ s (k, m) s (k-1, m) … s (k-L +1, m) ]]Representing a vector of synchronous input signals, s^T(k, m) represents the transpose of s (k, m).

It should be noted that echo cancellation can also be performed by other methods in the prior art, which is not limited by the present disclosure.

Step 205, performing voice activity detection and noise estimation on each audio signal to be processed.

Specifically, in a real acoustic scene, a speech signal does not always exist in an environment, most of the speech segments and noise segments alternate, and even most of the speech segments and noise segments are noise segments, so that speech activity detection is required, and the speech activity detection is realized by detecting the energy or amplitude of a real-time audio stream and tracking the change of speech and noise in the audio stream on the basis of the energy or amplitude. In order to obtain better noise reduction effect, noise estimation is necessary, and the noise estimation tracks the change of the acoustic frequency spectrum in real time by tracking the change of characteristics such as signal-to-noise ratio, amplitude and the like in the audio stream signal. The most typical way is to estimate the signal-to-noise ratio of the audio in real time by tracking the frequency spectrums of the voice and the noise, and then update the frequency spectrums of the voice and the noise according to the estimated signal-to-noise ratio of the audio.

Optionally, when it is determined that there is a synchronous input signal, voice activity detection and noise estimation are performed on each audio signal to be processed after echo cancellation processing, so as to obtain an audio signal existence probability.

Optionally, when it is determined that no synchronous input signal exists, voice activity detection and noise estimation are directly performed on each audio signal to be processed, so as to obtain the existence probability of the audio signal.

Illustratively, according to a formula

Noise estimation is performed for each audio signal to be processed.

Wherein alpha is_sSmoothing factor, alpha, representing noise estimation in presence of speech_nAnd the smoothing factor represents the noise estimation in the absence of voice, V (k, t-1) represents the noise spectrum estimation value of the kth frequency point at the t-1 moment, V (k, t) represents the noise spectrum estimation value of the kth frequency point at the t moment, and X (k, t) represents the short-time Fourier transform of the kth frequency point at the t moment.

According to the formula

Voice activity detection is performed for each audio signal to be processed.

Wherein, beta_sSmoothing factor, beta, representing the estimation of the signal in the presence of speech_nAnd the smoothing factor represents the signal estimation in the absence of voice, Y (k, t-1) represents the signal spectrum estimation value of the kth frequency point at the t-1 moment, and Y (k, t) represents the signal spectrum estimation value of the kth frequency point at the t moment.

Where SNR (k, t) represents an estimate of the signal-to-noise ratio.

Wherein, P (k, t) represents the audio signal existence probability of the kth frequency point at the time t, TH_SNRRepresenting a signal-to-noise threshold.

And step 206, estimating the direction of arrival of any two microphones in the microphone array according to the existence probability of the audio signal.

The direction of arrival is the relative angle between the target sound source and the microphone array, and the estimation of the direction of arrival is divided into two steps: and calculating the time delay estimation of any two microphones in the microphone array according to the existence probability of the audio signal, and then calculating the relative angle between the target sound source and the microphone array according to the time delay estimation result.

Illustratively, according to a formula

Time delay estimates for any two microphones in the array of microphones are calculated.

According to the formula

The relative angle of the target sound source to the microphone array is calculated.

phi (k) represents a weight, phi (k) 1/| E { X (k,1) X^*(k,2)|，E{X(k,1)X^*(k,2) } denotes the expectation of the energy of the signal, θ denotes the direction of arrival, c denotes the speed of sound in air, and d denotes the distance between two microphones corresponding to two audio signals to be processed.

And step 207, performing beam forming processing on the audio signal to be processed according to the direction of arrival estimation and the beam forming algorithm.

Specifically, when the direction of arrival is determined, the spatial information of the signal can be utilized to the maximum extent by the beam forming algorithm, and noise and reverberation from directions other than the sound source direction can be eliminated. The beamforming is to perform phase compensation on each microphone in different frequency bands, so as to achieve the effects of enhancing a target signal and suppressing noise and interference. Specifically, spatial filters are respectively designed on different frequency bands, and each audio signal to be processed is spatially filtered.

For example, the beamforming coefficients may be designed according to a distortion-free minimum mean square error, and the overall energy may be minimized while ensuring that the direction of arrival signal is unchanged. I.e. according to the formula

And carrying out beam forming processing on the audio signal to be processed.

Wherein R ═ E { X (t) X^T(t)}，

d(θ)＝[1e^{-jωδcosθ/c}...e^{-j(M-1)ωδcosθ/c}]^T，

The method can obtain the following results by a Lagrange multiplier method:

represents h_BFIs transposed matrix, subject to denotes such that

And step 208, performing noise suppression on the audio signal to be processed after the beam forming processing to obtain a target audio signal.

Specifically, noise cancellation is necessary because noise is ubiquitous in the real environment, where noise cancellation is achieved by frequency filtering, which can be found by minimizing the difference between the clean signal and the estimated signal. The noise reduction is usually performed by spectral subtraction, and the idea of spectral subtraction is to calculate a ratio of a clean signal to an observed signal by using the energy of a current signal and the energy of noise estimation for each frequency point, and then perform frequency filtering by using the ratio.

Illustratively, according to a formula

And formula

S(k，t)＝h_NR(k) X (k, t) results in the target audio signal.

And 209, performing short-time Fourier inversion on each target audio signal and outputting the target audio signal.

For example, after the target audio signal is determined, the target audio signal is converted from the frequency domain to the time domain again by using short-time inverse fourier transform to obtain a finally output digital audio stream, the output digital audio stream may be output through an audio output module, and the audio output module may be a headphone interface, a USB sound card, or other smart microphones.

The embodiment of the disclosure provides a voice processing method, when a plurality of audio signals to be processed are obtained, whether a synchronous input signal exists is detected, and when the synchronous input signal exists, echo cancellation processing is performed on each audio signal to be processed; then, voice activity detection and noise estimation are carried out on each audio signal to be processed after echo cancellation processing, and the existence probability of the audio signal is obtained; and determining direction-of-arrival estimation between every two audio signals to be processed according to the audio signal existence probability, performing noise reduction processing on each audio signal to be processed according to the direction-of-arrival estimation, and finally outputting the target audio signal subjected to the noise reduction processing. Therefore, the method not only detects the synchronous input signals of the received audio signals to be processed, but also performs voice activity detection and noise estimation, and finally performs direction-of-arrival estimation on every two audio signals to be processed according to the existence probability of the audio signals obtained by the voice activity detection and the noise estimation, and performs noise reduction processing on all the audio signals to be processed according to the direction-of-arrival estimation, thereby further reducing various noises in the target audio signal, realizing audio pickup and enhancement functions, and further improving the accuracy of audio identification; in addition, the method and the device can simultaneously acquire the audio signals to be processed output by the plurality of intelligent microphones and simultaneously process the audio signals to be processed output by the plurality of intelligent microphones, so that the combined processing of the plurality of intelligent microphones is realized, complex scenes with high processing difficulty can be matched, and the adaptability is high. The microphone designed by the method has a comprehensive middle and far field voice enhancement effect, can be applied to all scenes with middle and far field voice enhancement requirements, and has extremely high universality.

Based on the speech processing method described in the above embodiments, the following is an embodiment of the apparatus of the present disclosure, which can be used to execute the embodiment of the method of the present disclosure.

The embodiment of the present disclosure provides a voice processing apparatus, as shown in fig. 3a, the voice processing apparatus 30 includes: an acquisition module 301, a first processing module 302, a second processing module 303, a third processing module 304 and an output module 305.

The acquiring module 301 is configured to acquire at least two audio signals to be processed; the at least two audio signals to be processed comprise audio signals acquired by a microphone array.

A first processing module 302, configured to perform direction-of-arrival estimation on any two microphones in the microphone array.

The second processing module 303 is configured to perform beamforming processing on the audio signal to be processed according to the direction of arrival estimation and a beamforming algorithm.

The third processing module 304 is configured to perform noise suppression on the audio signal to be processed after the beamforming processing, so as to obtain a target audio signal.

An output module 305, configured to output the target audio signal.

In one embodiment, as shown in fig. 3b, the apparatus further comprises a determination module 306, and the first processing module 302 comprises a first processing sub-module 3021.

The determining module 306 is configured to perform voice activity detection and noise estimation on each to-be-processed audio signal, and determine an existence probability of the audio signal according to results of the voice activity detection and the noise estimation.

The first processing sub-module 3021 is configured to perform direction-of-arrival estimation on any two microphones in the microphone array according to the audio signal existence probability.

In one embodiment, as shown in fig. 3c, the first processing submodule 3021 comprises a calculation unit 30211.

The calculating unit 30211 is configured to calculate time delay estimates of any two microphones in the microphone array according to the existence probability of the audio signal, and calculate a relative angle between a target sound source and the microphone array according to a result of the time delay estimates.

In one embodiment, as shown in FIG. 3d, the determination module 306 includes a first determination submodule 3061, a second processing submodule 3062, a third processing submodule 3063, and a fourth processing submodule 3064.

Therein, the first determining submodule 3061 is used for determining whether there is a synchronous input signal.

The second processing submodule 3062 is configured to perform echo cancellation processing on each to-be-processed audio signal when it is determined that the synchronization input signal exists.

The third processing submodule 3063 is configured to perform voice activity detection and noise estimation on each of the to-be-processed audio signals after the echo cancellation processing.

The fourth processing submodule 3064 is configured to, when it is determined that the synchronization input signal is not present, perform voice activity detection and noise estimation on each of the audio signals to be processed.

In one embodiment, as shown in fig. 3e, the obtaining module 301 includes a obtaining sub-module 3011 and a transforming sub-module 3012.

The obtaining sub-module 3011 is configured to obtain at least two original audio signals; the original audio signal is a signal output by the audio input module.

The transform submodule 3012 is configured to perform short-time fourier transform on each original audio signal to obtain the to-be-processed audio signal.

In one embodiment, as shown in FIG. 3f, the second processing submodule 3062 includes a processing unit 30621.

Wherein the processing unit 30621 is used for processing according to a formula

And

formula (II)

And performing echo cancellation processing on each audio signal to be processed.

Wherein y (t, m) represents the synchronous input signal collected by the mth microphone at the time t, s (t-l) represents the synchronous input signal at the time t-l, h_lIndicating the channel between the synchronous input signal to each microphone, L being an identifier in the accumulation operator, L indicating the length of time, h (t, m) [ h ]₀ h₁...h_L-1]Representing the channel between the synchronous input signal to the mth microphone at time t;

representing the error signal, mu the smoothing factor,

denotes the echo estimate of the mth microphone at time t, x (t, m) denotes the near-end signal of the mth microphone at time t, s (k, m) denotes [ s (k, m) s (k-1, m) … s (k-L +1, m)]Representing a vector of synchronous input signals, s^T(k, m) represents the transpose of s (k, m).

In one embodiment, as shown in FIG. 3g, the determination module 306 includes a detection sub-module 3065, a fifth processing sub-module 3066, and a second determination sub-module 3067.

Wherein the detection submodule 3065 is used for calculating a formula

And carrying out voice activity detection on each audio signal to be processed.

A fifth processing submodule 3066 for processing according to a formula

And carrying out noise estimation on each audio signal to be processed.

A second determination submodule 3067 for determining a formula

And

formula (II)

Determining the audio signal presence probability.

In one embodiment, as shown in fig. 3h, the calculation unit 30211 includes a first calculation subunit 302111 and a second calculation subunit 302112.

Wherein the first calculating subunit 302111 is configured to calculate the first calculation according to the formula

The second calculating subunit 302112, configured to calculate the formula

the weight value is represented by a weight value,

In one embodiment, as shown in fig. 3i, the second processing module 303 comprises a sixth processing submodule 3031.

Wherein the sixth processing submodule 3031 is configured to perform processing according to a formula

And carrying out beam forming processing on the audio signal to be processed.

Wherein R ═ E { X (t) X^T(t)}，

d(θ)＝[1e^{-jωδcosθ/c}...e^{-j(M-1)ωδcosθ/c}]^T，

The method can obtain the following results by a Lagrange multiplier method:

represents h_BFIs transposed matrix, subject to denotes such that

In one embodiment, as shown in FIG. 3j, the third processing module 304 includes a seventh processing submodule 3041.

Wherein the seventh processing submodule 3041 is configured to according to a formula

And the formula S (k, t) ═ h_NR(k) X (k, t) results in the target audio signal.

The embodiment of the present disclosure provides a voice processing apparatus, which performs direction-of-arrival estimation on any two microphones in a microphone array when acquiring a plurality of audio signals to be processed, performs beamforming processing on the audio signals to be processed according to the direction-of-arrival estimation and a beamforming algorithm, performs noise suppression on the audio signals to be processed after the beamforming processing, and finally outputs a target audio signal obtained after the noise reduction and suppression. Therefore, the method and the device have the advantages that the direction of arrival estimation is carried out on every two audio signals to be processed, and the noise suppression processing is carried out on the audio signals to be processed after the beam forming processing, so that the audio picking and enhancing functions are realized, and the accuracy of audio identification is improved.

Referring to fig. 4, an embodiment of the present disclosure further provides a speech processing apparatus, where the speech processing apparatus includes a receiver 401, a transmitter 402, a memory 403, and a processor 404, where the transmitter 402 and the memory 403 are respectively connected to the processor 404, the memory 403 stores at least one computer instruction, and the processor 404 is configured to load and execute the at least one computer instruction to implement the speech processing method described in the embodiment corresponding to fig. 1.

Based on the voice processing method described in the embodiment corresponding to fig. 1, an embodiment of the present disclosure further provides a computer-readable storage medium, for example, the non-transitory computer-readable storage medium may be a Read Only Memory (ROM), a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like. The storage medium stores computer instructions for executing the voice processing method described in the embodiment corresponding to fig. 1, which is not described herein again.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method of speech processing, the method comprising:

and outputting the target audio signal.

2. The method of claim 1, wherein prior to the estimating the direction of arrival for any two microphones of the array of microphones, further comprising:

3. The method of claim 2, wherein the estimating direction of arrival for any two microphones in the microphone array according to the audio signal presence probability comprises:

calculating time delay estimation of any two microphones in the microphone array according to the existence probability of the audio signal;

and calculating the relative angle between the target sound source and the microphone array according to the time delay estimation result.

4. The method of claim 3, wherein the performing voice activity detection and noise estimation on each of the audio signals to be processed comprises:

determining whether there is a synchronous input signal;

5. The method of claim 1, wherein the obtaining at least two audio signals to be processed comprises:

6. The method of claim 4, wherein the performing echo cancellation processing on each of the audio signals to be processed comprises:

according to the formula

And

formula (II)

Performing echo cancellation processing on each audio signal to be processed;

wherein y (t, m) represents the m-thSynchronous input signal collected by microphone at t moment, s (t-l) represents synchronous input signal at t-l moment, h_lIndicating the channel between the synchronous input signal to each microphone, L being an identifier in the accumulation operator, L indicating the length of time, h (t, m) [ h ]₀h₁...h_L-1]Representing the channel between the synchronous input signal to the mth microphone at time t;

representing the error signal, mu the smoothing factor,

7. The method of claim 6, wherein performing voice activity detection and noise estimation on each of the audio signals to be processed, and determining the audio signal existence probability according to the results of the voice activity detection and noise estimation comprises:

according to the formula

Performing voice activity detection on each audio signal to be processed;

according to the formula

Performing noise estimation on each audio signal to be processed;

according to the formula

And

formula (II)

Determining the audio signal presence probability;

8. The method of claim 7, wherein calculating an estimate of time delay for any two microphones in the array of microphones based on the probability of existence of the audio signal, and wherein calculating a relative angle of a target sound source to the array of microphones based on the result of the estimate of time delay comprises:

according to the formula

according toFormula (II)

the weight value is represented by a weight value,

E{X(k,1)X^*(k,2) } denotes the expectation of the energy of the signal, θ denotes the direction of arrival, c denotes the speed of sound in air, and d denotes the distance between two microphones corresponding to two audio signals to be processed.

9. The method of claim 8, wherein the beamforming the audio signal to be processed according to the direction of arrival estimation and beamforming algorithm comprises:

according to the formula

subject to

Carrying out beam forming processing on the audio signal to be processed;

wherein R ═ E { X (t) X^T(t)}，

d(θ)＝[1e^{-jωδcosθ/c} ... e^{-j(M-1)ωδcosθ/c}]^T，

The method can obtain the following results by a Lagrange multiplier method:

represents h_BFIs transposed matrix, subject to denotes such that

10. The method of claim 9, wherein the performing noise suppression on the beamformed audio signal to be processed to obtain a target audio signal comprises:

according to the formula

11. A speech processing apparatus, comprising:

and the output module is used for outputting the target audio signal.