CN117292700A

CN117292700A - Voice enhancement method and device for distributed wakeup and storage medium

Info

Publication number: CN117292700A
Application number: CN202210700223.3A
Authority: CN
Inventors: 邓邱伟; 郝斌; 王迪; 张丽
Original assignee: Qingdao Haier Technology Co Ltd; Qingdao Haier Intelligent Home Appliance Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Qingdao Haier Intelligent Home Appliance Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2023-12-26
Also published as: WO2023246223A1

Abstract

The application discloses a voice enhancement method and device for distributed wakeup and a storage medium, and relates to the technical field of intelligent home, wherein the voice enhancement method for distributed wakeup comprises the following steps: acquiring N frequency domain data corresponding to N microphones of a microphone array; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array; determining delay adding beams and delay subtracting beams of the N frequency domain data in each of the M pickup directions to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; the signal amplitudes of the M delay-added beams and the signal amplitudes of the M delay-subtracted beams are input to a speech enhancement model to enhance a target speech in speech data by the speech enhancement model.

Description

Voice enhancement method and device for distributed wakeup and storage medium

Technical Field

The application relates to the technical field of smart home, in particular to a distributed wake-up voice enhancement method and device and a storage medium.

Background

With the development of technology, intelligent voice devices gradually enter daily life, and the intelligent voice devices want to hear sound, and then need to rely on a microphone array.

A microphone array is a system for sampling and processing spatial characteristics of a sound field, which is composed of acoustic sensors (e.g., microphones) with certain data, and in which the microphones are arranged in a certain geometric structure, such as a line shape, a ring shape, a sphere shape, etc. When the microphones are arranged in a linear structure, the microphones are divided into a two-microphone linear array, a four-microphone linear array, a six-microphone linear array and the like according to the number of the microphones. The microphone array is used for carrying out space-time processing on the collected sound signals in different space directions, so that the functions of echo cancellation, dereverberation, noise reduction, sound source separation and the like are realized, and the processing quality of the voice signals is further improved.

In the related art, the sound source separation technology of the microphone array can be realized by technologies such as beam forming and AuxIva, but the low-frequency main lobe of the beam of the four-microphone linear array is wider, the voice signal processing quality is poor, the AuxIva calculation amount of the four-microphone linear array is large, and the requirement of real-time calculation is difficult to meet.

Aiming at the problems of wider beam low-frequency main lobe, poorer voice signal processing quality and the like of a microphone array in the related art, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a voice enhancement method and device for distributed wake-up and a storage medium, which are used for at least solving the problems of wider beam low-frequency main lobe, poor voice signal processing quality and the like of a microphone array in the related technology.

According to an embodiment of the present application, there is provided a voice enhancement method for distributed wake-up, including: acquiring N frequency domain data corresponding to N microphones of a microphone array; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array; determining delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; the signal amplitudes of the M delay adding beams and the signal amplitudes of the M delay subtracting beams are input into a voice enhancement model so as to enhance target voice in the voice data through the voice enhancement model.

In an exemplary embodiment, determining the delay adding beam and the delay subtracting beam of the N frequency domain data in each of the M pickup directions, to obtain M delay adding beams and M delay subtracting beams, includes: for each of the M pickup directions, determining target frequency domain data according to the N frequency domain data, and determining a weight vector corresponding to the target frequency domain data according to a time delay between the N microphones in each pickup direction; and determining the M delay adding beams and the M delay subtracting beams according to the target frequency domain data and the weight vector corresponding to the target frequency domain data, wherein the target frequency domain data is used for indicating the array signals corresponding to the N frequency domain data.

In an exemplary embodiment, determining the target frequency domain data from the N frequency domain data includes: determining a corresponding first matrix of the N frequency domain data, wherein row information of the first matrix is used for indicating the N frequency domain data; and determining the target frequency domain data according to the first matrix.

In an exemplary embodiment, determining a weight vector corresponding to the target frequency domain data according to a time delay between the N microphones of the microphone array in each pickup direction includes: determining a time delay of each microphone in N microphones relative to a target microphone and determining a sub-weight vector corresponding to each microphone according to the time delay, wherein the target microphone is the microphone which receives the voice data first; determining a corresponding second matrix of the sub-weight vectors, wherein column information of the second matrix is used for indicating the sub-weight vector corresponding to each microphone; and determining a weight vector corresponding to the target frequency domain data according to the number N of the microphones in the microphone array and the second matrix.

In one exemplary embodiment, determining a time delay of each of the N microphones relative to the target microphone includes: determining: determining the abscissa of any one of the N microphones on the coordinate axis and the ordinate of any one of the N microphones on the coordinate axis; determining a first product of the abscissa and a cosine value of a pickup direction of the arbitrary microphone, and a second product of the ordinate and a sine value of a pickup direction of the arbitrary microphone; determining the time delay of any microphone relative to a target microphone according to the sound velocity, the first product and the second product, wherein a coordinate point of the target microphone is an origin of the coordinate axis; determining the time delay of any microphone relative to a target microphone according to the sound velocity, the first product and the second product; the determining step is performed in a loop until a time delay of each of the N microphones with respect to the target microphone is determined.

In an exemplary embodiment, determining the M delay adding beams and the M delay subtracting beams according to the target frequency domain data and the weight vector corresponding to the target frequency domain data includes: determining convolution results of the target frequency domain data and weight vectors corresponding to the target frequency domain data in each pickup direction, and determining the M delay adding beams according to the convolution results; and determining conjugate complex numbers of convolution results of the target frequency domain data and weight vectors corresponding to the target frequency domain data in each pickup direction, and determining M subtraction and addition beams according to the conjugate complex numbers.

In an exemplary embodiment, after inputting the signal amplitudes of the M delay-added beams and the signal amplitudes of the M delay-subtracted beams to a speech enhancement model to enhance a target speech in the speech data by the speech enhancement model, in the case of m=3, the method further includes: performing linear filtering on first voice enhancement data in a first pickup direction according to a first preset algorithm to obtain a voice enhancement result in the first pickup direction, wherein the first preset algorithm comprises: taking the delay added beam in the first pickup direction as a main beam, and taking the voice enhancement results in the second pickup direction and the third pickup direction as noise parameters; and performing linear filtering on second voice enhancement data in a second pickup direction according to a second preset algorithm to obtain a voice enhancement result in the second pickup direction, wherein the second preset algorithm comprises: taking the delay added beam in the second pickup direction as a main beam, and taking the voice enhancement results in the first pickup direction and the third pickup direction as noise parameters; and performing linear filtering on third voice enhancement data in the second pickup direction according to a third preset algorithm to obtain a voice enhancement result in the second pickup direction, wherein the third preset algorithm comprises: taking the delay added beam in the third pickup direction as a main beam, and taking the voice enhancement results in the first pickup direction and the second pickup direction as noise parameters; the first voice enhancement data, the second voice enhancement data and the third voice enhancement data are output results of the voice enhancement model.

According to another embodiment of the present application, there is further provided a voice enhancement method apparatus for distributed wake-up, including: the acquisition module is used for acquiring N frequency domain data corresponding to N microphones of the microphone array; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array; a determining module, configured to determine delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions, to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; and the input module is used for inputting the signal amplitude of the M delay adding beams and the signal amplitude of the M delay subtracting beams into a voice enhancement model so as to enhance target voice in the voice data through the voice enhancement model.

According to yet another aspect of the embodiments of the present application, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described distributed wake-up speech enhancement method when run.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above-mentioned voice enhancement method for distributed wake-up through the computer program.

In the embodiment of the application, N frequency domain data corresponding to N microphones of a microphone array are acquired; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array; determining delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; inputting the signal amplitudes of the M delay adding beams and the signal amplitudes of the M delay subtracting beams into a voice enhancement model so as to enhance target voice in the voice data through the voice enhancement model; by adopting the technical scheme, the problems that the low-frequency main lobe of the wave beam of the microphone array is wider, the voice signal processing quality is poor and the like are solved, and then the delay adding wave beam and the delay subtracting wave beam of M pickup directions are used as the input of the voice enhancement model in the embodiment of the application, so that the voice signal processing quality is enhanced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a hardware environment of a distributed wake-up speech enhancement method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of voice enhancement for distributed wakeup according to an embodiment of the present application;

FIG. 3 is a schematic diagram (one) of a voice enhancement method of distributed wakeup according to an embodiment of the present application;

FIG. 4 is a schematic diagram (II) of a voice enhancement method of distributed wakeup according to an embodiment of the present application;

fig. 5 is a block diagram of a distributed wake-up speech enhancement method apparatus according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to one aspect of the embodiments of the present application, a method for voice enhancement for distributed wakeup is provided. The distributed wake-up voice enhancement method is widely applied to full-house intelligent digital control application scenes such as intelligent Home (Smart Home), intelligent Home equipment ecology, intelligent Home (intelligent house) ecology and the like. Alternatively, in the present embodiment, the above-described voice enhancement method for distributed wake-up may be applied to a hardware environment formed by the terminal device 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be used to provide services (such as application services and the like) for a terminal or a client installed on the terminal, a database may be set on the server or independent of the server, for providing data storage services for the server 104, and cloud computing and/or edge computing services may be configured on the server or independent of the server, for providing data computing services for the server 104.

The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: a wide area network, a metropolitan area network, a local area network, and the wireless network may include, but is not limited to, at least one of: WIFI (Wireless Fidelity ), bluetooth. The terminal device 102 may not be limited to a PC, a mobile phone, a tablet computer, an intelligent air conditioner, an intelligent smoke machine, an intelligent refrigerator, an intelligent oven, an intelligent cooking range, an intelligent washing machine, an intelligent water heater, an intelligent washing device, an intelligent dish washer, an intelligent projection device, an intelligent television, an intelligent clothes hanger, an intelligent curtain, an intelligent video, an intelligent socket, an intelligent sound box, an intelligent fresh air device, an intelligent kitchen and toilet device, an intelligent bathroom device, an intelligent sweeping robot, an intelligent window cleaning robot, an intelligent mopping robot, an intelligent air purifying device, an intelligent steam box, an intelligent microwave oven, an intelligent kitchen appliance, an intelligent purifier, an intelligent water dispenser, an intelligent door lock, and the like.

In this embodiment, a method for enhancing voice in distributed wake-up is provided and applied to a terminal device, and fig. 2 is a flowchart of the method for enhancing voice in distributed wake-up according to an embodiment of the present application, where the flowchart includes the following steps:

Step S202, N frequency domain data corresponding to N microphones of a microphone array are obtained; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array;

it should be noted that, the N microphones respectively receive the voice data x ₁ 、x ₂ 、x ₃ 、……x _n Voice data x received by N microphones respectively ₁ (t)、x ₂ (t)、x ₃ (t)、……x _n (t) performing Fourier transform to obtain N corresponding frequency domain data X ₁ (f)，X ₂ (f)，X ₃ (f)，……，X _N (f)。

Step S204, determining delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;

step S206, inputting the signal amplitude of the M delay adding beams and the signal amplitude of the M delay subtracting beams into a voice enhancement model to enhance the target voice in the voice data through the voice enhancement model.

Through the steps, N frequency domain data corresponding to N microphones of the microphone array are obtained; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array; determining delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; the signal amplitude of the M delay adding beams and the signal amplitude of the M delay subtracting beams are input into a voice enhancement model, so that target voice in voice data is enhanced through the voice enhancement model, the problems that in the related technology, the low-frequency main lobe of the beams of a microphone array is wider, the voice signal processing quality is poor and the like are solved, and furthermore, in the embodiment of the application, the delay adding beams and the delay subtracting beams in the M pickup directions are used as the input of the voice enhancement model, and the voice signal processing quality is enhanced.

Optionally, determining the delay adding beam and the delay subtracting beam of the N frequency domain data in each of the M pickup directions, to obtain M delay adding beams and M delay subtracting beams, includes: for each of M pickup directions, determining target frequency domain data according to the N frequency domain data, and determining a weight vector corresponding to the target frequency domain data according to a time delay between the N microphones in each pickup direction, wherein the target frequency domain data is used for indicating an array signal corresponding to the N frequency domain data; and determining the M delay adding beams and the M delay subtracting beams according to the target frequency domain data and the weight vectors corresponding to the target frequency domain data.

That is, the target frequency domain data is determined from the N frequency domain data, and in the case of determining the target frequency domain data and the weight vector corresponding to the target frequency domain data, M delay adding beams and M delay subtracting beams may be determined from the target frequency domain data and the weight vector.

Optionally, determining the target frequency domain data according to the N frequency domain data includes: determining a corresponding first matrix of the N frequency domain data, wherein row information of the first matrix is used for indicating the N frequency domain data; and determining the target frequency domain data according to the first matrix.

Specifically, the target frequency domain data X (f, θ) is determined according to the following formula:

X(f,θ)＝[X ₁ (f),X ₂ (f),X ₃ (f),……,X _N (f)] ^T ；

wherein X is ₁ (f)，X ₂ (f)，X ₃ (f)，……，X _N (f) And respectively representing the N frequency domain data, wherein f is frequency.

The N frequency domain data are formed into a matrix form to obtain target frequency domain data X (f, θ), for example, in the case where N is 4, X (f, θ) = [ X ₁ (f),X ₂ (f),X ₃ (f),X ₄ (f)] ^T Wherein θ is the sound pickup direction. For example, in the case where the microphone array is a four-microphone linear array, θ may be specifically 30 °, 90 °, 150 °; in the case of a microphone array being a two-microphone linear array, θ may specifically be 45 °, 135 °.

Further, determining a weight vector corresponding to the target frequency domain data according to the time delay between the N microphones of the microphone array in each pickup direction includes: determining a time delay of each microphone in N microphones relative to a target microphone and determining a sub-weight vector corresponding to each microphone according to the time delay, wherein the target microphone is the microphone which receives the voice data first; determining a corresponding second matrix of the sub-weight vectors, wherein column information of the second matrix is used for indicating the sub-weight vector corresponding to each microphone; and determining a weight vector corresponding to the target frequency domain data according to the number N of the microphones in the microphone array and the second matrix.

Optionally, determining a time delay of each of the N microphones relative to the target microphone includes: determining the abscissa of any one of the N microphones on the coordinate axis and the ordinate of any one of the N microphones on the coordinate axis; determining a first product of the abscissa and a cosine value of a pickup direction of the arbitrary microphone, and a second product of the ordinate and a sine value of a pickup direction of the arbitrary microphone; determining the time delay of any microphone relative to a target microphone according to the sound velocity, the first product and the second product, wherein a coordinate point of the target microphone is an origin of the coordinate axis; determining the time delay of any microphone relative to a target microphone according to the sound velocity, the first product and the second product; the determining step is performed in a loop until a time delay of each of the N microphones with respect to the target microphone is determined.

Specifically, the weight vector d (f, θ) corresponding to the target frequency domain data is determined according to the following formula:wherein τ ₂₁ For the delay of microphone 2 relative to microphone 1, τ ₃₁ For the delay of microphone 3 relative to microphone 1, τ _N1 Is the delay of microphone N relative to microphone 1, wherein, Wherein θ is the direction of each pickup, c is the sound velocity, wherein microphone 1 is the target microphone, a _N Is the abscissa of the microphone on the coordinate axis, b _N Is the ordinate of the microphone on the coordinate axis.

For example, in the case of a microphone array which is a four-microphone linear arrayIn this case, the weight vector d (f, θ) corresponding to the target frequency domain data in the 30 ° direction is: the weight vector d (f, θ) corresponding to the target frequency domain data in the 90 ° direction is: /> The weight vector d (f, θ) corresponding to the target frequency domain data in the 150 ° direction is:

optionally, determining the M delay adding beams and the M delay subtracting beams according to the target frequency domain data and the weight vector corresponding to the target frequency domain data includes: determining convolution results of the target frequency domain data and weight vectors corresponding to the target frequency domain data in each pickup direction, and determining the M delay adding beams according to the convolution results; and determining conjugate complex numbers of convolution results of the target frequency domain data and weight vectors corresponding to the target frequency domain data in each pickup direction, and determining M subtraction and addition beams according to the conjugate complex numbers.

Specifically, the M delay adding beams and M delays are determined according to the weight vectors corresponding to the X (f, theta) and the X (f, theta)A late subtracting beam comprising: for the delay-added beam b corresponding to each pickup direction θ among the M delay-added beams _sum (f, θ) is determined by the following formula: b _sum (f, θ) =d (f, θ) ×x (f, θ); for the delay-subtracted beam b corresponding to each pickup direction θ among the M subtracted-added beams _sub (f, θ) is determined by the following formula: b _sub (f,θ)＝conj[d(f,θ)]*X(f,θ)。

In an exemplary embodiment, in a case where m=3, after inputting the signal amplitudes of the M delay adding beams and the signal amplitudes of the M delay subtracting beams to a speech enhancement model to enhance a target speech in the speech data by the speech enhancement model, further includes: performing linear filtering on first voice enhancement data in a first pickup direction according to a first preset algorithm to obtain a voice enhancement result in the first pickup direction, wherein the first preset algorithm comprises: taking the delay added beam in the first pickup direction as a main beam, and taking the voice enhancement results in the second pickup direction and the third pickup direction as noise parameters; and performing linear filtering on second voice enhancement data in a second pickup direction according to a second preset algorithm to obtain a voice enhancement result in the second pickup direction, wherein the second preset algorithm comprises: taking the delay added beam in the second pickup direction as a main beam, and taking the voice enhancement results in the first pickup direction and the third pickup direction as noise parameters; and performing linear filtering on third voice enhancement data in the second pickup direction according to a third preset algorithm to obtain a voice enhancement result in the second pickup direction, wherein the third preset algorithm comprises: taking the delay added beam in the third pickup direction as a main beam, and taking the voice enhancement results in the first pickup direction and the second pickup direction as noise parameters; the first voice enhancement data, the second voice enhancement data and the third voice enhancement data are output results of the voice enhancement model.

It should be noted that, in order to further enhance the noise reduction effect, the microphone array signal processing generally includes two parts, namely beam forming and post filtering, after voice enhancement data is obtained, filtering is performed on the voice enhancement data, generally using an NLMS or RLS method to perform filtering, specifically, when linear filtering is performed on first voice enhancement data in a first pickup direction, a delay-and-add beam in the first pickup direction is used as a main beam, voice enhancement results in a second pickup direction and a third pickup direction are used as noise parameters, and linear filtering is performed on the first voice enhancement data by using an NLMS or RLS method; when the second voice enhancement data in the second sound pickup direction is subjected to linear filtering, taking a delay added beam in the second sound pickup direction as a main beam, taking voice enhancement results in the first sound pickup direction and the third sound pickup direction as noise parameters, and performing linear filtering on the third voice enhancement data by using an NLMS (non-linear ms) or RLS (non-linear interpolation) method; when the third voice enhancement data in the third sound pickup direction is subjected to linear filtering, the delay added beam in the third sound pickup direction is used as a main beam, the voice enhancement results in the first sound pickup direction and the second sound pickup direction are used as noise parameters, and the NLMS or RLS method is used for carrying out linear filtering on the third voice enhancement data.

In order to better understand the process of the above-mentioned distributed wake-up voice enhancement method, the following describes the implementation method flow of the above-mentioned distributed wake-up voice enhancement method with reference to an optional embodiment, but is not limited to the technical solution of the embodiment of the present application.

In this embodiment, a voice enhancement method for distributed wake-up is provided, where the microphone array is a four-microphone linear array, and fig. 3 is a schematic diagram (one) of the voice enhancement method for distributed wake-up according to an embodiment of the present application, as shown in fig. 3, which is a beam direction of the four-microphone linear array.

Taking a microphone array as a four-microphone linear array as an example, fig. 4 is a schematic diagram (two) of a voice enhancement method for distributed wake-up according to an embodiment of the present application, as shown in fig. 4, specifically including the following steps:

step S401: x1, X2, X3, X4 are subjected to short-time fourier transform (STFT) to obtain frequency domain data X1, X2, X3, X4, where X1, X2, X3, X4 respectively represent time domain data (corresponding to voice data in the above embodiment) of four microphones;

step S402: a delay adding beam SumN and a delay subtracting beam Sub1 are respectively carried out towards the directions of 30 degrees, 90 degrees and 150 degrees, wherein Y1 represents the splicing of beam results of 30 degrees, namely Y1= [ Sum1, sub1]; y2 represents the concatenation of beam results of 90 °, i.e. y2= [ Sum2, sub2]; y3 represents a concatenation of beam results of 150 °, i.e. y3= [ Sum3, sub3].

The target frequency domain data X (f, θ) is determined according to the following formula:

X(f,θ)＝[X ₁ (f),X ₂ (f),X ₃ (f),X ₄ (f)] ^T ；

wherein X is ₁ (f)，X ₂ (f)，X ₃ (f)，X ₄ (f) Respectively representing the 4 frequency domain data, wherein f is the frequency corresponding to the target frequency domain data;

determining a weight vector d (f, θ) corresponding to the target frequency domain data according to the following formula:

wherein τ ₂₁ For the delay of microphone 2 relative to microphone 1, τ ₃₁ For the delay of microphone 3 relative to microphone 1, τ ₄₁ For the delay of microphone 4 relative to microphone 1, said a _N Is the abscissa of the microphone on the coordinate axis, b _N Is the ordinate of the microphone on the coordinate axis.

b _sum (f,θ)＝d(f,θ)*X(f,θ)；b _sub (f,θ)＝conj[d(f,θ)]* X (f, θ), where b _sum Beams (complex numbers) representing delay additions; b _sub Representing the beam (complex number) of delay subtraction.

Sum1＝abs[b _sum (θ＝30°)]；

Sum2＝abs[b _sum (θ＝90°)]；

Sum3＝abs[b _sum (θ＝150°)]. Note that SumN is a real number array.

Step S403: taking absolute values of Y1, Y2 and Y3;

step S404: constructing a multidimensional array as input of a voice enhancement model, and outputting Mask1, mask2 and Mask3 of the delayed added wave beams;

in the present embodiment, the sampling rate of 16000, the stft frame length of 512, the frame shift of 256, and the window length of hanning window of 512 are used. Y1 length 257×2=514, Y length 514×3=1542. The speech enhancement model uses 2 sets of stacked TCNs, each set containing two sets of TCN blocks, convolution kernels 3,dilation rate{1,2,5, 9. Three full-connection layers Fc are connected, and a sigmoid activation function is selected.

The model structure of the speech enhancement model selects a stacked TCN block, the outermost layer has three layers of outputs, and the mask values [0,1 ] of the delayed addition beams respectively corresponding to 30 DEG, 90 DEG and 150 DEG directions]Between them. Mask1 represents b _sum Masking value of (θ=30°). Because of spectrum leakage, the delay-and-add beam results in noise in other directions, and Mask1 can suppress noise in these directions. Out1=b _sum (θ=30°). Mask1, which is a result of the 30 ° directional reinforcement. Out2=b _sum (θ=90°). Mask2, expressed as a result of the 90 ° directional enhancement; out3=b _sum (θ=150°). Mask3, which is the result of the 150 ° directional reinforcement.

Step S405: determining a loss function of the speech enhancement model;

wherein, the loss function selects Mse loss. T1, T2 and T3 represent the beam results of delay-add beam/noise-space beam for each direction only for the target sound source clean-space, respectively.

After obtaining the speech enhancement signals Out1, out2 and Out3, inputting the speech enhancement signals Out1, out2 and Out3 into a post-processing module to filter the speech enhancement signals Out1, out2 and Out3, using Sum1 as a main beam, using Out2 and Out3 as noise references, and filtering by using an NLMS or RLS method to obtain an output result of 30 degrees; using Sum2 as a main beam, and Out1 and Out3 as noise references to obtain an output result of 90 degrees; and taking Sum3 as a main beam, and Out1 and Out2 as noise references to obtain an output result of 150 degrees.

In the embodiment of the invention, the delay adding and delay subtracting beams in three directions are spliced together to be used as the input of the model, so that more space information can be learned compared with the mode that only the delay adding beam is used as the input model, and convergence is facilitated. Meanwhile, the stacked TCNs structure can process time series and has wider field of view than LSTM. Compared with the method that the model is directly output as a final result, the post-processing module is used for linearly filtering the estimated noise component, so that the voice distortion can be effectively reduced, and the voice quality/wake-up rate and recognition rate can be improved.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present application.

FIG. 5 is a block diagram of a distributed wake-up speech enhancement method apparatus according to an embodiment of the present application; as shown in fig. 4, includes:

an acquiring module 52, configured to acquire N frequency domain data corresponding to N microphones of the microphone array; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array;

a determining module 54, configured to determine delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions, to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;

an input module 56, configured to input the signal amplitudes of the M delay-added beams and the signal amplitudes of the M delay-subtracted beams to a speech enhancement model, so as to enhance the target speech in the speech data through the speech enhancement model.

Acquiring N frequency domain data corresponding to N microphones of the microphone array through the device; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array; determining delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; the signal amplitude of the M delay adding beams and the signal amplitude of the M delay subtracting beams are input into a voice enhancement model, so that target voice in voice data is enhanced through the voice enhancement model, the problems that in the related technology, the low-frequency main lobe of the beams of a microphone array is wider, the voice signal processing quality is poor and the like are solved, and furthermore, in the embodiment of the application, the delay adding beams and the delay subtracting beams in the M pickup directions are used as the input of the voice enhancement model, and the voice signal processing quality is enhanced.

In an exemplary embodiment, the determining module 54 is configured to determine, for each of M pickup directions, target frequency domain data according to the N frequency domain data, and determine a weight vector corresponding to the target frequency domain data according to a time delay between the N microphones in each of the M pickup directions, where the target frequency domain data is used to indicate an array signal corresponding to the N frequency domain data; and determining the M delay adding beams and the M delay subtracting beams according to the target frequency domain data and the weight vectors corresponding to the target frequency domain data.

In an exemplary embodiment, the determining module 54 is configured to determine a corresponding first matrix of the N frequency domain data, where row information of the first matrix is used to indicate the N frequency domain data; and determining the target frequency domain data according to the first matrix.

In an exemplary embodiment, the determining module 54 is configured to determine a time delay of each of the N microphones relative to a target microphone, and determine a sub-weight vector corresponding to each microphone according to the time delay, where the target microphone is a microphone that receives the voice data first; determining a corresponding second matrix of the sub-weight vectors, wherein column information of the second matrix is used for indicating the sub-weight vector corresponding to each microphone; and determining a weight vector corresponding to the target frequency domain data according to the number N of the microphones in the microphone array and the second matrix.

In one exemplary embodiment, the determining module 54 is configured to perform the determining step: determining first frequency domain data corresponding to any one of the N microphones and second frequency domain data corresponding to the target microphone; determining a first product of the first frequency domain data and a cosine value of the pick-up direction of any one of the microphones and a second product of the second frequency domain data and a sine value of the pick-up direction of any one of the microphones; determining the time delay of any microphone relative to a target microphone according to the sound velocity, the first product and the second product; the determining step is performed in a loop until a time delay of each of the N microphones with respect to the target microphone is determined.

In an exemplary embodiment, the determining module 54 is configured to determine a convolution result of the target frequency domain data and a weight vector corresponding to the target frequency domain data in each pickup direction, and determine the M delay-and-add beams according to the convolution result; and determining conjugate complex numbers of convolution results of the target frequency domain data and weight vectors corresponding to the target frequency domain data in each pickup direction, and determining M subtraction and addition beams according to the conjugate complex numbers.

In an exemplary embodiment, in the case where m=3, the determining module 54 is configured to perform linear filtering on first speech enhancement data in a first sound pickup direction according to a first preset algorithm to obtain a speech enhancement result in the first sound pickup direction, where the first preset algorithm includes: taking the delay added beam in the first pickup direction as a main beam, and taking the voice enhancement results in the second pickup direction and the third pickup direction as noise parameters; and performing linear filtering on second voice enhancement data in a second pickup direction according to a second preset algorithm to obtain a voice enhancement result in the second pickup direction, wherein the second preset algorithm comprises: taking the delay added beam in the second pickup direction as a main beam, and taking the voice enhancement results in the first pickup direction and the third pickup direction as noise parameters; and performing linear filtering on third voice enhancement data in the second pickup direction according to a third preset algorithm to obtain a voice enhancement result in the second pickup direction, wherein the third preset algorithm comprises: taking the delay added beam in the third pickup direction as a main beam, and taking the voice enhancement results in the first pickup direction and the second pickup direction as noise parameters; the first voice enhancement data, the second voice enhancement data and the third voice enhancement data are output results of the voice enhancement model.

Embodiments of the present application also provide a storage medium including a stored program, wherein the program performs the method of any one of the above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store program code for performing the steps of:

s1, N frequency domain data corresponding to N microphones of a microphone array are obtained; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array;

s2, determining delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;

s3, inputting the signal amplitude of the M delay adding beams and the signal amplitude of the M delay subtracting beams into a voice enhancement model so as to enhance target voice in the voice data through the voice enhancement model.

Embodiments of the present application also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be centralized on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices and, in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be implemented as individual integrated circuit modules, or as individual integrated circuit modules. Thus, the present application is not limited to any specific combination of hardware and software.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A method of voice enhancement for distributed wakeup, comprising:

acquiring N frequency domain data corresponding to N microphones of a microphone array; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array;

determining delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;

the signal amplitudes of the M delay adding beams and the signal amplitudes of the M delay subtracting beams are input into a voice enhancement model so as to enhance target voice in the voice data through the voice enhancement model.

2. The method of claim 1, wherein determining the delay-added and delay-subtracted beams for the N frequency-domain data in each of the M pickup directions results in M delay-added and M delay-subtracted beams, comprising:

for each of M pickup directions, determining target frequency domain data according to the N frequency domain data, and determining a weight vector corresponding to the target frequency domain data according to a time delay between the N microphones of the microphone array in each pickup direction, wherein the target frequency domain data is used for indicating an array signal corresponding to the N frequency domain data;

and determining the M delay adding beams and the M delay subtracting beams according to the target frequency domain data and the weight vectors corresponding to the target frequency domain data.

3. The method of distributed wake-up speech enhancement according to claim 2, wherein determining the target frequency domain data from the N frequency domain data comprises:

determining a corresponding first matrix of the N frequency domain data, wherein row information of the first matrix is used for indicating the N frequency domain data;

And determining the target frequency domain data according to the first matrix.

4. The method of claim 2, wherein determining the weight vector corresponding to the target frequency domain data based on the time delays between the N microphones of the microphone array in each pickup direction, comprises:

determining a time delay of each microphone in the N microphones relative to a target microphone and determining a sub-weight vector corresponding to each microphone according to the time delay, wherein the target microphone is the microphone which receives the voice data first;

determining a corresponding second matrix of the sub-weight vectors, wherein column information of the second matrix is used for indicating the sub-weight vector corresponding to each microphone;

and determining a weight vector corresponding to the target frequency domain data according to the number N of the microphones in the microphone array and the second matrix.

5. The method of distributed wake-up speech enhancement of claim 4, wherein determining a delay of each of the N microphones relative to the target microphone comprises:

determining: determining the abscissa of any one of the N microphones on the coordinate axis and the ordinate of any one of the N microphones on the coordinate axis; determining a first product of the abscissa and a cosine value of a pickup direction of the arbitrary microphone, and a second product of the ordinate and a sine value of a pickup direction of the arbitrary microphone; determining the time delay of any microphone relative to a target microphone according to the sound velocity, the first product and the second product, wherein a coordinate point of the target microphone is an origin of the coordinate axis;

The determining step is performed in a loop until a time delay of each of the N microphones with respect to the target microphone is determined.

6. The method of claim 2, wherein determining the M delay-added beams and the M delay-subtracted beams from the target frequency-domain data and weight vectors corresponding to the target frequency-domain data comprises:

determining convolution results of the target frequency domain data and weight vectors corresponding to the target frequency domain data in each pickup direction, and determining the M delay adding beams according to the convolution results;

and determining conjugate complex numbers of convolution results of the target frequency domain data and weight vectors corresponding to the target frequency domain data in each pickup direction, and determining M delay subtraction beams according to the conjugate complex numbers.

7. The distributed wake-up speech enhancement method according to claim 1, wherein, in the case of m=3, after inputting the signal amplitudes of the M delay-added beams and the signal amplitudes of the M delay-subtracted beams to a speech enhancement model to enhance a target speech in the speech data by the speech enhancement model, the method further comprises:

Performing linear filtering on first voice enhancement data in a first pickup direction according to a first preset algorithm to obtain a voice enhancement result in the first pickup direction, wherein the first preset algorithm comprises: taking the delay added beam in the first pickup direction as a main beam, and taking the voice enhancement results in the second pickup direction and the third pickup direction as noise parameters; and

performing linear filtering on second voice enhancement data in a second pickup direction according to a second preset algorithm to obtain a voice enhancement result in the second pickup direction, wherein the second preset algorithm comprises: taking the delay added beam in the second pickup direction as a main beam, and taking the voice enhancement results in the first pickup direction and the third pickup direction as noise parameters; and

performing linear filtering on third voice enhancement data in a second pickup direction according to a third preset algorithm to obtain a voice enhancement result in the second pickup direction, wherein the third preset algorithm comprises: taking the delay added beam in the third pickup direction as a main beam, and taking the voice enhancement results in the first pickup direction and the second pickup direction as noise parameters;

The first voice enhancement data, the second voice enhancement data and the third voice enhancement data are output results of the voice enhancement model.

8. A voice enhancement method apparatus for distributed wakeup, comprising:

the acquisition module is used for acquiring N frequency domain data corresponding to N microphones of the microphone array; the N frequency domain data are obtained by carrying out Fourier transform on voice data received by the microphone array;

a determining module, configured to determine delay adding beams and delay subtracting beams of the N frequency domain data in each of M pickup directions, to obtain M delay adding beams and M delay subtracting beams; the N frequency domain data are respectively corresponding to delay adding beams and delay subtracting beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;

and the input module is used for inputting the signal amplitude of the M delay adding beams and the signal amplitude of the M delay subtracting beams into a voice enhancement model so as to enhance target voice in the voice data through the voice enhancement model.

9. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of any of the preceding claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.