WO2023246223A1

WO2023246223A1 - Speech enhancement method and apparatus for distributed wake-up, and storage medium

Info

Publication number: WO2023246223A1
Application number: PCT/CN2023/085266
Authority: WO
Inventors: 邓邱伟; 郝斌; 王迪; 张丽
Original assignee: 青岛海尔科技有限公司; 青岛海尔智能家电科技有限公司; 海尔智家股份有限公司
Priority date: 2022-06-20
Filing date: 2023-03-30
Publication date: 2023-12-28
Also published as: CN117292700A

Abstract

A speech enhancement method and apparatus for distributed wake-up, and a storage medium, which relate to the technical field of smart homes. The method comprises: acquiring N pieces of frequency-domain data corresponding to N microphones of a microphone array, wherein the N pieces of frequency-domain data are obtained by means of performing a Fourier transform on speech data which is received by the microphone array (S202); determining a delayed addition beam and a delayed subtraction beam of the N pieces of frequency-domain data in each sound pickup direction among M sound pickup directions, so as to obtain M delayed addition beams and M delayed subtraction beams, wherein the N pieces of frequency-domain data all correspond to delayed addition beams and delayed subtraction beams in different sound pickup directions, N is an integer greater than 3, and M is an integer greater than 2 (S204); and inputting the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams into a speech enhancement model, so as to enhance target speech in the speech data by means of the speech enhancement model (S206).

Description

Distributed wake-up voice enhancement method, device and storage medium

This disclosure claims priority to the Chinese patent application filed with the China Patent Office on June 20, 2022, with application number 202210700223.3 and the invention title "Distributed Wake-up Speech Enhancement Method and Device, Storage Medium", the entire content of which is incorporated by reference. in this disclosure.

Technical field

The present disclosure relates to the field of smart home technology, and specifically to a distributed wake-up voice enhancement method and device, and a storage medium.

Background technique

With the development of technology, smart voice devices have gradually entered daily life. If smart voice devices want to hear sounds, they need to rely on microphone arrays.

In related technologies, the sound source separation technology of the microphone array can be implemented by technologies such as beam forming and AuxIva. However, the low-frequency main lobe of the beam of the four-mic linear array is wide, the voice signal processing quality is poor, and the AuxIva calculation amount of the four-mic linear array is It is large and difficult to meet the requirements of real-time calculation.

In related technologies, effective solutions have not yet been proposed for problems such as the low-frequency main lobe of the microphone array's beam is wide and the speech signal processing quality is poor.

Contents of the invention

Embodiments of the present disclosure provide a distributed wake-up voice enhancement method and device, and a storage medium to at least solve the problems in related technologies such as the low-frequency main lobe of a microphone array's beam is wide and the voice signal processing quality is poor.

According to an embodiment of the present disclosure, a distributed wake-up speech enhancement method is provided, including: obtaining N frequency domain data corresponding to N microphones of the microphone array; wherein the N frequency domain data are obtained by The speech data received by the microphone array is obtained by Fourier transform; determine the delayed addition beam and delayed subtraction wave of the N frequency domain data in each of the M pickup directions. beams to obtain M delay addition beams and M delay subtraction beams; wherein, the N frequency domain data correspond to delay addition beams and delay subtraction beams in different pickup directions, and N is greater than 3 Integer, M is an integer greater than 2; the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams are input to the speech enhancement model, so as to use the speech enhancement model to enhance the speech data The target speech is enhanced.

According to another embodiment of the present disclosure, a distributed wake-up speech enhancement method device is also provided, including: an acquisition module configured to acquire N frequency domain data corresponding to N microphones of the microphone array; wherein, the The N frequency domain data are obtained by Fourier transforming the speech data received by the microphone array; the determination module is configured to determine the position of the N frequency domain data in each of the M sound pickup directions. delay addition beams and delay subtraction beams to obtain M delay addition beams and M delay subtraction beams; wherein, the N frequency domain data correspond to delay addition beams and delays in different pickup directions Subtraction beams, N is an integer greater than 3, M is an integer greater than 2; the input module is configured to input the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams into the speech enhancement model , to enhance the target speech in the speech data through the speech enhancement model.

According to yet another aspect of the embodiments of the present disclosure, a computer-readable storage medium is also provided. The computer-readable storage medium stores a computer program, wherein the computer program is configured to execute the above-mentioned distributed wake-up when running. speech enhancement method.

According to another aspect of the embodiment of the present disclosure, an electronic device is also provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the above-mentioned steps through the computer program. Distributed arousal-based speech enhancement method.

Description of the drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to explain the embodiments of the present disclosure or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for ordinary people in the art, For those skilled in the art, other drawings can also be obtained based on these drawings without exerting creative labor.

Figure 1 is a schematic diagram of the hardware environment of a distributed wake-up voice enhancement method according to an embodiment of the present disclosure;

Figure 2 is a flow chart of a distributed wake-up speech enhancement method according to an embodiment of the present disclosure;

Figure 3 is a schematic diagram (1) of a distributed wake-up voice enhancement method according to an embodiment of the present disclosure;

Figure 4 is a schematic diagram (2) of a distributed wake-up speech enhancement method according to an embodiment of the present disclosure;

Figure 5 is a structural block diagram of a distributed wake-up speech enhancement device according to an embodiment of the present disclosure;

Figure 6 is a structural block diagram of an optional electronic device according to an embodiment of the present disclosure.

Detailed ways

In order to enable those skilled in the art to better understand the present disclosure, the following will clearly and completely describe the technical solutions in the present disclosure embodiments in conjunction with the accompanying drawings. Obviously, the described embodiments are only These are part of the embodiments of this disclosure, not all of them. Based on the embodiments in this disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this disclosure.

It should be noted that the terms "first", "second", etc. in the description and claims of the present disclosure and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.

According to one aspect of an embodiment of the present disclosure, a distributed wake-up speech enhancement method is provided. This distributed wake-up voice enhancement method is widely used in whole-house intelligent digital control applications such as Smart Home, smart home, smart home device ecology, and smart residence (IntelligenceHouse) ecology. Scenes. Optionally, in this embodiment, the above distributed wake-up voice enhancement method can be applied to a hardware environment composed of a terminal device 102 and a server 104 as shown in FIG. 1 . As shown in Figure 1, the server 104 is connected to the terminal device 102 through the network and can be used to provide services (such as application services, etc.) for the terminal or the client installed on the terminal. A database can be set up on the server or independently from the server. To provide data storage services for the server 104, cloud computing and/or edge computing services can be configured on the server or independently of the server to provide data computing services for the server 104.

The above-mentioned network may include but is not limited to at least one of the following: wired network, wireless network. The above-mentioned wired network may include but is not limited to at least one of the following: wide area network, metropolitan area network, and local area network. The above-mentioned wireless network may include at least one of the following: WIFI (Wireless Fidelity, Wireless Fidelity), Bluetooth. The terminal device 102 may be, but is not limited to, a PC, a mobile phone, a tablet, a smart air conditioner, a smart hood, a smart refrigerator, a smart oven, a smart stove, a smart washing machine, a smart water heater, a smart washing equipment, a smart dishwasher, or a smart projection device. , smart TV, smart clothes drying rack, smart curtains, smart audio and video, smart sockets, smart audio, smart speakers, smart fresh air equipment, smart kitchen and bathroom equipment, smart bathroom equipment, smart sweeping robot, smart window cleaning robot, smart mopping robot, Smart air purification equipment, smart steamers, smart microwave ovens, smart kitchen appliances, smart purifiers, smart water dispensers, smart door locks, etc.

In this embodiment, a distributed wake-up voice enhancement method is provided, which is applied to a terminal device. Figure 2 is a flow chart of a distributed wake-up voice enhancement method according to an embodiment of the present disclosure. The process includes the following steps:

Step S202: Obtain N pieces of frequency domain data corresponding to the N microphones of the microphone array; wherein the N pieces of frequency domain data are obtained by Fourier transforming the speech data received by the microphone array;

It should be noted that N microphones receive voice data x ₁ , x ₂ , x ₃ ,...x _n respectively, and the voice data x ₁ (t), x ₂ (t) received by N microphones respectively are , x ₃ (t),...x _n (t) are Fourier transformed to obtain the corresponding N frequency domain data X ₁ (f), _{X 2} ₍ f), ..., X _N (f).

Step S204, determine the delay addition beam and delay subtraction beam of the N frequency domain data in each of the M sound pickup directions, and obtain M delay addition beams and M delay subtraction beams; Among them, the N frequency domain data correspond to delayed addition beams and delayed subtraction beams in different pickup directions. Beam, N is an integer greater than 3, M is an integer greater than 2;

Step S206: Input the signal amplitudes of the M delay addition beams and the M delay subtraction beams into the speech enhancement model, so as to enhance the target speech in the speech data through the speech enhancement model.

Through the above steps, N pieces of frequency domain data corresponding to the N microphones of the microphone array are obtained; wherein the N pieces of frequency domain data are obtained by performing Fourier transform on the speech data received by the microphone array; determine the Delayed addition beams and delayed subtraction beams of the N frequency domain data in each of the M sound pickup directions are used to obtain M delay addition beams and M delay subtraction beams; wherein, the N Each frequency domain data corresponds to a delay addition beam and a delay subtraction beam in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; the signal amplitudes of the M delay addition beams are summed The signal amplitudes of the M delayed subtraction beams are input to the speech enhancement model to enhance the target speech in the speech data through the speech enhancement model, solving the problem in related technologies that the low-frequency main lobe of the beam of the microphone array is wide. There are problems such as poor voice signal processing quality. Furthermore, in the embodiment of the present disclosure, delay addition beams and delay subtraction beams in M pickup directions are used as the input of the speech enhancement model to enhance the speech signal processing quality.

Optionally, determine the delay addition beams and delay subtraction beams of the N frequency domain data in each of the M sound pickup directions to obtain M delay addition beams and M delay subtraction beams. , including: for each of the M sound pickup directions, determining the target frequency domain data according to the N frequency domain data, and determining the target frequency domain data according to the time between the N microphones in each of the sound pickup directions. Determine the weight vector corresponding to the target frequency domain data, wherein the target frequency domain data is used to indicate the array signal corresponding to the N frequency domain data; according to the target frequency domain data and the target frequency domain data Corresponding weight vectors determine the M delay-add beams and M delay-subtract beams.

That is to say, the target frequency domain data is determined by N pieces of frequency domain data. When the target frequency domain data and the weight vector corresponding to the target frequency domain data are determined, M delays can be determined based on the target frequency domain data and the weight vector. Addition beams and M delayed subtraction beams.

Optionally, determining the target frequency domain data according to the N frequency domain data includes: determining a corresponding first matrix of the N frequency domain data, wherein the row information of the first matrix is used to indicate The N Frequency domain data; determining the target frequency domain data according to the first matrix.

Specifically, the target frequency domain data X(f, θ) is determined according to the following formula:
X(f, θ)=[X ₁ (f), X ₂ (f), X ₃ (f), ..., X _N (f)] ^T ;

Among them, X ₁ (f), X ₂ (f), X ₃ (f), ..., X _N (f) respectively represent the N frequency domain data, and f is the frequency.

The N frequency domain data are formed into a matrix form to obtain the target frequency domain data X(f, θ). For example, when N is 4, X(f, θ)=[X ₁ (f) , X ₂ (f), X ₃ (f), X ₄ (f)] ^T , where θ is the pickup direction. For example, when the microphone array is a four-mic linear array, θ can be specifically 30°, 90°, and 150°; when the microphone array is a two-mic linear array, specifically θ can be 45°, 135°.

Further, determining the weight vector corresponding to the target frequency domain data according to the time delay between the N microphones in each sound pickup direction of the microphone array includes: determining each of the N microphones Relative to the time delay of the target microphone and determining the sub-weight vector corresponding to each microphone based on the time delay, wherein the target microphone is the microphone that first receives the voice data; determining the sub-weight vector The corresponding second matrix, wherein the column information of the second matrix is used to indicate the sub-weight vector corresponding to each microphone; the target frequency is determined according to the number N of microphones in the microphone array and the second matrix. The weight vector corresponding to the domain data.

Optionally, determining the time delay of each of the N microphones relative to the target microphone includes: determining the abscissa of any of the N microphones on the coordinate axis, and the vertical coordinate of any of the microphones on the coordinate axis. Coordinates; determine the first product of the abscissa and the cosine value of the pickup direction of any microphone, and the second product of the ordinate and the sine value of the pickup direction of any microphone; according to the speed of sound , the first product and the second product determine the delay of any microphone relative to the target microphone, where the coordinate point of the target microphone is the origin of the coordinate axis; determined according to the speed of sound, the first product and the second product The time delay of any microphone relative to the target microphone; perform the determining step cyclically until the time delay of each microphone among the N microphones relative to the target microphone is determined.

Specifically, the weight vector d(f, θ) corresponding to the target frequency domain data is determined according to the following formula: Among them, τ ₂₁ is the time delay of microphone 2 relative to microphone 1, τ ₃₁ is the time delay of microphone 3 relative to microphone 1, τ _N1 is the time delay of microphone N relative to microphone 1, where, Among them, θ is each sound pickup direction, c is the sound speed, microphone 1 is the target microphone, a _N is the abscissa of the microphone on the coordinate axis, and b _N is the ordinate of the microphone on the coordinate axis.

For example, when the microphone array is a four-microphone linear array, the weight vector d(f, θ) corresponding to the target frequency domain data in the 30° direction is: The weight vector d(f, θ) corresponding to the target frequency domain data in the 90° direction is: The weight vector d(f, θ) corresponding to the target frequency domain data in the 150° direction is:

Optionally, determining the M delay addition beams and M delay subtraction beams according to the target frequency domain data and the weight vector corresponding to the target frequency domain data includes: determining the direction of each sound pickup direction. The convolution result of the target frequency domain data and the weight vector corresponding to the target frequency domain data, determine the M delayed addition beams according to the convolution result; determine the target in each sound pickup direction The M subtraction and addition beams are determined based on the conjugate complex number of the convolution result of the weight vector corresponding to the frequency domain data and the target frequency domain data.

Specifically, the M delays are determined according to the weight vector corresponding to the X(f, θ) and the X(f, θ) The addition beam and M delay addition beams include: among the M delay addition beams, the delay addition beam b _sum (f, θ) corresponding to each of the pickup directions θ is determined by the following formula : b _sum (f, θ) = d (f, θ) * X (f, θ); for the M subtraction and addition beams, the delayed subtraction beam corresponding to each of the pickup directions θ b _sub (f, θ) is determined by the following formula: b _sub (f, θ) = conj [d (f, θ)]*X (f, θ).

In an exemplary embodiment, in the case of M=3, the signal amplitudes of the M delay addition beams and the signal amplitudes of the M delay subtraction beams are input to the speech enhancement model to use the speech enhancement After the model enhances the target speech in the speech data, it also includes: linearly filtering the first speech enhancement data in the first sound pickup direction according to a first preset algorithm to obtain the first speech enhancement data in the first sound pickup direction. Speech enhancement results, wherein the first preset algorithm includes: using the delayed addition beam in the first sound pickup direction as the main beam, and using the speech enhancement results in the second sound pickup direction and the third sound pickup direction as Noise parameters; and perform linear filtering on the second speech enhancement data in the second sound pickup direction according to the second preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein the second preset algorithm It includes: using the delayed addition beam in the second sound pickup direction as the main beam, using the speech enhancement results in the first sound pickup direction and the third sound pickup direction as noise parameters; and using the third preset algorithm to The third speech enhancement data in the sound pickup direction is linearly filtered to obtain the speech enhancement result in the second sound pickup direction, wherein the third preset algorithm includes: using the delay phase of the third sound pickup direction Add a beam as the main beam, and use the speech enhancement results in the first sound pickup direction and the second sound pickup direction as noise parameters; wherein, the first speech enhancement data, the second speech enhancement data, and the third speech enhancement data is the output result of the speech enhancement model.

It should be noted that microphone array signal processing usually consists of two parts: beam forming and post-filtering. In order to further improve the noise reduction effect, after obtaining the speech enhancement data, the speech enhancement data is then filtered, generally using NLMS or The RLS method is used for filtering. Specifically, when performing linear filtering on the first speech enhancement data in the first pickup direction, the delayed addition beam in the first pickup direction is used as the main beam, and the second pickup direction is used as the main beam. And the speech enhancement result in the third pickup direction is used as the noise parameter, and the NLMS or RLS method is used to perform linear filtering on the first speech enhancement data; when linear filtering is performed on the second speech enhancement data in the second pickup direction, Use the delayed addition beam in the second pickup direction as the main beam, use the speech enhancement results in the first pickup direction and the third pickup direction as noise parameters, and use the NLMS or RLS method to enhance the third speech data. Perform linear filtering; in the third pickup direction When the third speech enhancement data is linearly filtered, the delayed addition beam in the third pickup direction is used as the main beam, and the speech enhancement results in the first pickup direction and the second pickup direction are used as noise parameters, using The NLMS or RLS method performs linear filtering on the third speech enhancement data.

In order to better understand the process of the above-mentioned distributed wake-up speech enhancement method, the flow of the implementation method of the above-mentioned distributed wake-up speech enhancement method will be described below in conjunction with optional embodiments, but this is not intended to limit the technical solutions of the embodiments of the present disclosure. .

In this embodiment, a distributed wake-up voice enhancement method is provided. The microphone array is a four-microphone linear array. Figure 3 is a schematic diagram (1) of the distributed wake-up voice enhancement method according to an embodiment of the present disclosure, as shown in Fig. 3 shows the beam direction of the four-mic linear array.

Taking the microphone array as a four-microphone linear array as an example, Figure 4 is a schematic diagram (2) of the distributed wake-up voice enhancement method according to an embodiment of the present disclosure. As shown in Figure 4, the specific steps are as follows:

Step S401: After short-time Fourier transform (STFT), x1, x2, x3, and x4 obtain frequency domain data X1, X2, X3, and Data (equivalent to the voice data in the above embodiment);

Step S402: Make delayed addition beams SumN and delayed subtraction beams Sub1 in directions of 30°, 90° and 150° respectively, where Y1 represents the splicing of the 30° beam results, that is, Y1 = [Sum1, Sub1]; Y2 represents The splicing of 90° beam results, that is, Y2 = [Sum2, Sub2]; Y3 represents the splicing of 150° beam results, that is, Y3 = [Sum3, Sub3].

It should be noted that the target frequency domain data X(f, θ) is determined according to the following formula:
X(f, θ)=[X ₁ (f), X ₂ (f), X ₃ (f), X ₄ (f)] ^T ;

Among them, X ₁ (f), X ₂ (f), X ₃ (f), and X ₄ (f) respectively represent the four frequency domain data, and f is the frequency corresponding to the target frequency domain data;

The weight vector d(f, θ) corresponding to the target frequency domain data is determined according to the following formula:

Among them, τ ₂₁ is the time delay of microphone 2 relative to microphone 1, τ ₃₁ is the time delay of microphone 3 relative to microphone 1, τ ₄₁ is the time delay of microphone 4 relative to microphone 1, a _N is the abscissa of the microphone on the coordinate axis, and b _N is the ordinate of the microphone on the coordinate axis.

b _sum (f, θ) = d (f, θ)*X (f, θ); b _sub (f, θ) = conj [d (f, θ)] * X (f, θ), where, b _sum represents a beam with delays added (a complex number); b _sub represents a beam with delays subtracted (a complex number).

Sum1＝abs[b _sum (θ＝30°)];

Sum2＝abs[b _sum (θ＝90°)];

Sum3=abs[b _sum (θ=150°)]. It should be noted that SumN is an array of real numbers.

Step S403: Get the absolute values of Y1, Y2, and Y3;

Step S404: Construct a multi-dimensional array as the input of the speech enhancement model, and output Mask1, Mask2, and Mask3 of the delayed and added beams;

It should be noted that in the example of this disclosure, a sampling rate of 16,000 is used, the stft frame length is 512, the frame shift is 256, and the window length of the Hanning window is 512. The length of Y1 is 257*2=514, and the length of Y is 514*3=1542. The speech enhancement model uses 2 groups of stacked TCN, each group contains two groups of TCN blocks, convolution kernel 3, dilation rate {1, 2, 5, 9}. Followed by three fully connected layers Fc, the sigmoid activation function is selected.

The model structure of the speech enhancement model uses stacked TCN blocks. The outermost layer has three output layers, corresponding to the mask values [0, 1] of the 30°, 90° and 150° direction delay addition beams. Mask1 represents the mask value of b _sum (θ=30°). Due to spectrum leakage, the delayed addition beam result will have noise in other directions, and Mask1 can suppress the noise in these directions. Out1＝b _sum (θ＝30°)*Mask1, expressed as the result after enhancement in the 30° direction. Out2＝b _sum (θ＝90°)*Mask2, expressed as the result after enhancement in the 90° direction; Out3＝b _sum (θ＝150°)*Mask3, expressed as the result after enhancement in the 150° direction.

Step S405: Determine the loss function of the speech enhancement model;

Among them, Mse loss is selected as the loss function. T1, T2 and T3 respectively represent the delayed addition beam when there is only clean speech from the target sound source in each direction/the beam results of noisy speech when there is noise.

It should be noted that after obtaining the speech enhancement signals Out1, Out2, and Out3, the speech enhancement The signals Out1, Out2, and Out3 are input into the post-processing module to filter the speech enhancement signals Out1, Out2, and Out3. Sum1 is used as the main beam, and Out2 and Out3 are used as noise references. They are filtered using the NLMS or RLS method to obtain a 30° Output results: Using Sum2 as the main beam, Out1 and Out3 as noise references, an output result of 90° is obtained; using Sum3 as the main beam, Out1 and Out2 as noise references, an output result of 150° is obtained.

In the embodiment of the present disclosure, the delay addition and delay subtraction beams in three directions are put together as the input of the model. Compared with only using the delay addition beam as the input model, the model can learn more spatial information, which is conducive to convergence. . At the same time, the stacked TCNs structure can handle time series and has a wider field of view than LSTM. Compared with directly using the model output as the final result, linearly filtering the estimated noise components with a post-processing module can effectively reduce speech distortion and help improve speech quality/arousal rate and recognition rate.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is Better implementation. Based on this understanding, the technical solution of the present disclosure can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk, CD), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods of various embodiments of the present disclosure.

Figure 5 is a structural block diagram of a distributed wake-up speech enhancement device according to an embodiment of the present disclosure; as shown in Figure 5, it includes:

The acquisition module 52 is configured to acquire N frequency domain data corresponding to the N microphones of the microphone array; wherein the N frequency domain data are obtained by Fourier transforming the speech data received by the microphone array;

The determination module 54 is configured to determine the delayed addition beams and delayed subtraction beams of the N frequency domain data in each of the M pickup directions, and obtain M delayed addition beams and M delayed subtraction beams. Subtraction beam; wherein, the N frequency domain data correspond to delayed addition beams and delayed subtraction beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;

The input module 56 is configured to input the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams into the speech enhancement model, so as to use the speech enhancement model to enhance the target speech in the speech data. Make enhancements.

Through the above device, N pieces of frequency domain data corresponding to the N microphones of the microphone array are obtained; wherein the N pieces of frequency domain data are obtained by Fourier transforming the speech data received by the microphone array; determining the Delayed addition beams and delayed subtraction beams of the N frequency domain data in each of the M sound pickup directions are used to obtain M delay addition beams and M delay subtraction beams; wherein, the N Each frequency domain data corresponds to a delay addition beam and a delay subtraction beam in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; the signal amplitudes of the M delay addition beams are summed The signal amplitudes of the M delayed subtraction beams are input to the speech enhancement model to enhance the target speech in the speech data through the speech enhancement model, solving the problem in related technologies that the low-frequency main lobe of the beam of the microphone array is wide. There are problems such as poor voice signal processing quality. Furthermore, in the embodiment of the present disclosure, delay addition beams and delay subtraction beams in M pickup directions are used as the input of the speech enhancement model to enhance the speech signal processing quality.

In an exemplary embodiment, the determining module 54 is configured to determine the target frequency domain data based on the N frequency domain data for each of the M sound pickup directions, and determine the target frequency domain data based on the The time delay between the N microphones in the direction determines the weight vector corresponding to the target frequency domain data, wherein the target frequency domain data is used to indicate the array signal corresponding to the N frequency domain data; according to the The target frequency domain data and the weight vector corresponding to the target frequency domain data determine the M delay addition beams and M delay subtraction beams.

In an exemplary embodiment, the determining module 54 is configured to determine the corresponding first matrix of the N frequency domain data, wherein the row information of the first matrix is used to indicate the N frequency domain data; The target frequency domain data is determined according to the first matrix.

In an exemplary embodiment, the determination module 54 is configured to determine the time delay of each microphone among the N microphones relative to the target microphone and determine the sub-weight vector corresponding to each microphone according to the time delay, where , the target microphone is the microphone that first receives the voice data; determine the corresponding second matrix of the sub-weight vector, wherein the column information of the second matrix is used to indicate The sub-weight vector corresponding to each microphone; the weight vector corresponding to the target frequency domain data is determined according to the number N of microphones in the microphone array and the second matrix.

In an exemplary embodiment, the determination module 54 is configured to perform the determination steps: determine the first frequency domain data corresponding to any microphone among the N microphones, and the second frequency domain data corresponding to the target microphone; determine the The first product of the first frequency domain data and the cosine value of the sound pickup direction of any microphone, and the second product of the second frequency domain data and the sine value of the sound pickup direction of any microphone; according to the sound speed , the first product and the second product determine the time delay of any microphone relative to the target microphone; the determination step is executed cyclically until the time delay of each microphone among the N microphones relative to the target microphone is determined.

In an exemplary embodiment, the determination module 54 is configured to determine the convolution result of the target frequency domain data for each pickup direction and the weight vector corresponding to the target frequency domain data. According to the convolution As a result, the M delayed addition beams are determined; the conjugate complex number of the convolution result of the target frequency domain data in each pickup direction and the weight vector corresponding to the target frequency domain data is determined. According to the common The complex yoke determines the M subtractive-add beams.

In an exemplary embodiment, when M=3, the determination module 54 is configured to linearly filter the first speech enhancement data in the first sound pickup direction according to the first preset algorithm to obtain the first Speech enhancement results in the sound pickup direction, wherein the first preset algorithm includes: using the delayed addition beam in the first sound pickup direction as the main beam, adding the second sound pickup direction and the third sound pickup direction The speech enhancement result is used as the noise parameter; and linear filtering is performed on the second speech enhancement data in the second sound pickup direction according to the second preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein, The second preset algorithm includes: using the delayed addition beam in the second pick-up direction as the main beam, and using the speech enhancement results in the first pick-up direction and the third pick-up direction as noise parameters; and according to the third preset Assume that the algorithm performs linear filtering on the third speech enhancement data in the second sound pickup direction to obtain the speech enhancement result in the second sound pickup direction, wherein the third preset algorithm includes: using the third sound pickup direction The delayed addition beam in the sound direction is used as the main beam, and the speech enhancement results in the first sound pickup direction and the second sound pickup direction are used as noise parameters; wherein, the first speech enhancement data, the second speech enhancement data, The third speech enhancement data is the output result of the speech enhancement model.

Embodiments of the present disclosure also provide a storage medium including a stored program, wherein: Methods that perform any of the above when the above program is run.

Optionally, in this embodiment, the above-mentioned storage medium may be configured to store program codes for performing the following steps:

S1, obtain N frequency domain data corresponding to N microphones of the microphone array; wherein the N frequency domain data are obtained by performing Fourier transform on the speech data received by the microphone array;

S2, determine the delay addition beams and delay subtraction beams of the N frequency domain data in each of the M sound pickup directions, and obtain M delay addition beams and M delay subtraction beams; where , the N frequency domain data correspond to delay addition beams and delay subtraction beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;

S3: Input the signal amplitudes of the M delay addition beams and the M delay subtraction beams into the speech enhancement model, so as to enhance the target speech in the speech data through the speech enhancement model.

According to yet another aspect of the embodiment of the present disclosure, an electronic device for implementing the above semantic conversion method is also provided. As shown in Figure 6, the electronic device includes a memory 602 and a processor 604. The memory 602 stores a computer Program, the processor 604 is configured to execute the steps in any of the above method embodiments through a computer program.

Optionally, in this embodiment, the above-mentioned electronic device may be located in at least one network device among multiple network devices of the computer network.

Optionally, in this embodiment, the above-mentioned processor may be configured to perform the following steps through a computer program:

S2, determine the delay addition beams and delay subtraction beams of the N frequency domain data in each of the M sound pickup directions, and obtain M delay addition beams and M delay subtraction beams; where , the N frequency domain data correspond to delayed addition beams and delayed subtraction beams in different pickup directions, N is an integer greater than 3, M is an integer greater than 2;

Optionally, those of ordinary skill in the art can understand that the structure shown in Figure 6 is only illustrative, and the electronic device can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a handheld computer, and a mobile Internet device (Mobile Internet Devices, MID), PAD and other terminal equipment. FIG. 6 does not limit the structure of the above-mentioned electronic device. For example, the electronic device may also include more or fewer components (such as network interfaces, etc.) than shown in FIG. 6 , or have a different configuration than that shown in FIG. 6 .

The memory 602 can be used to store software programs and modules, such as the program instructions/modules corresponding to the semantic conversion method and device in the embodiment of the present disclosure. The processor 604 executes various software programs and modules by running the software programs and modules stored in the memory 602. A kind of functional application and data processing, that is, to implement the above-mentioned semantic conversion method. Memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 602 may further include memory located remotely relative to the processor 604, and these remote memories may be connected to the terminal through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof. As an example, as shown in Figure 6, the above-mentioned memory 602 may include, but is not limited to, the acquisition module 52, the determination module 54, and the input module 56 in the above-mentioned cloth wake-up speech enhancement device. In addition, it may also include, but is not limited to The other module units in the above-mentioned cloth wake-up voice enhancement device will not be described again in this example.

Optionally, the above-mentioned transmission device 606 is used to receive or send data via a network. Specific examples of the above-mentioned network may include wired networks and wireless networks. In one example, the transmission device 606 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices and routers through network cables to communicate with the Internet or a local area network. In one example, the transmission device 606 is a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet wirelessly.

In addition, the above-mentioned electronic device also includes: a display 608 for displaying the above-mentioned frequency domain data; and a connection bus 610 for connecting various module components in the above-mentioned electronic device.

Optionally, in this embodiment, the above storage medium may include but is not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), Various media that can store program code, such as mobile hard drives, magnetic disks, or optical disks.

Optionally, for specific examples in this embodiment, reference can be made to the examples described in the above-mentioned embodiments and optional implementations, and details will not be described again in this embodiment.

Obviously, those skilled in the art should understand that the above-mentioned modules or steps of the present disclosure can be implemented using general-purpose computing devices, and they can be concentrated on a single computing device, or distributed across a network composed of multiple computing devices. , optionally, they may be implemented in program code executable by a computing device, such that they may be stored in a storage device for execution by the computing device, and in some cases, may be in a sequence different from that herein. The steps shown or described are performed either individually as individual integrated circuit modules, or as multiple modules or steps among them as a single integrated circuit module. As such, the present disclosure is not limited to any specific combination of hardware and software.

The above are only preferred embodiments of the present disclosure. It should be pointed out that for those of ordinary skill in the art, several improvements and modifications can be made without departing from the principles of the present disclosure. These improvements and modifications can also be made. should be regarded as the scope of protection of this disclosure.

Claims

A distributed wake-up speech enhancement method, including:

Obtain N pieces of frequency domain data corresponding to the N microphones of the microphone array; wherein the N pieces of frequency domain data are obtained by Fourier transforming the speech data received by the microphone array;

Determine the delay addition beams and delay subtraction beams of the N frequency domain data in each of the M sound pickup directions, and obtain M delay addition beams and M delay subtraction beams; where, The above N frequency domain data correspond to delay addition beams and delay subtraction beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;

The signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams are input to the speech enhancement model to enhance the target speech in the speech data through the speech enhancement model.
The speech enhancement method of distributed wake-up according to claim 1, wherein the delay addition beam and delay subtraction beam of the N frequency domain data in each of the M sound pickup directions are determined, and we obtain M delay addition beams and M delay subtraction beams, including:

For each of the M sound pickup directions, the target frequency domain data is determined according to the N frequency domain data, and the distance between the N microphones in each sound pickup direction of the microphone array is determined. The time delay determines the weight vector corresponding to the target frequency domain data, wherein the target frequency domain data is used to indicate the array signal corresponding to the N frequency domain data;

The M delay addition beams and M delay subtraction beams are determined according to the target frequency domain data and the weight vector corresponding to the target frequency domain data.
The speech enhancement method of distributed wake-up according to claim 2, wherein determining the target frequency domain data according to the N frequency domain data includes:

Determine the corresponding first matrix of the N frequency domain data, wherein the row information of the first matrix is used to indicate the N frequency domain data;

The target frequency domain data is determined according to the first matrix.
The voice enhancement method of distributed wake-up according to claim 2, wherein the target frequency domain data corresponding to the target frequency domain data is determined according to the time delay between the N microphones in each sound pickup direction of the microphone array. Weight vector, including:

Determine the time delay of each microphone among the N microphones relative to the target microphone, and determine the sub-weight vector corresponding to each microphone based on the time delay, wherein the target microphone is the first to receive the voice data. microphone;

Determine the corresponding second matrix of the sub-weight vector, wherein the column information of the second matrix is used to indicate the sub-weight vector corresponding to each microphone;

The weight vector corresponding to the target frequency domain data is determined according to the number N of microphones in the microphone array and the second matrix.
The speech enhancement method of distributed wake-up according to claim 4, wherein determining the delay of each microphone in the N microphones relative to the target microphone includes:

Determining step: Determine the abscissa coordinate of any microphone among the N microphones on the coordinate axis, and the ordinate coordinate of any microphone on the coordinate axis; determine the cosine value of the abscissa coordinate and the pickup direction of any microphone and the second product of the ordinate and the sine value of the pickup direction of any microphone; determine the time of any microphone relative to the target microphone based on the speed of sound, the first product, and the second product. extension, wherein the coordinate point of the target microphone is the origin of the coordinate axis;

The determining step is performed in a loop until the time delay of each microphone among the N microphones relative to the target microphone is determined.
The speech enhancement method of distributed wake-up according to claim 2, wherein the M delayed addition beams and M delayed subtraction beams are determined according to the target frequency domain data and the weight vector corresponding to the target frequency domain data. Beams, including:

Determine the convolution result of the target frequency domain data for each sound pickup direction and the weight vector corresponding to the target frequency domain data, and determine the M delayed addition beams according to the convolution result;

Determine the conjugate complex number of the convolution result of the target frequency domain data for each sound pickup direction and the weight vector corresponding to the target frequency domain data, and determine the M delay subtractions based on the conjugate complex number result. beam.
The speech enhancement method of distributed wake-up according to claim 1, wherein, in the case of M=3, the signal amplitudes of the M delay addition beams and the signal amplitudes of the M delay subtraction beams are input to the speech Enhancement model, after enhancing the target speech in the speech data through the speech enhancement model, the method further includes:

Perform linear filtering on the first speech enhancement data in the first sound pickup direction according to a first preset algorithm to obtain the speech enhancement result in the first sound pickup direction, wherein the first preset algorithm includes: The delayed addition beam in the first pickup direction is used as the main beam, and the speech enhancement results in the second pickup direction and the third pickup direction are used as noise parameters; and

Perform linear filtering on the second speech enhancement data in the second sound pickup direction according to a second preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein the second preset algorithm includes: The delayed addition beam in the second pickup direction is used as the main beam, and the speech enhancement results in the first pickup direction and the third pickup direction are used as noise parameters; and

Perform linear filtering on the third speech enhancement data in the second sound pickup direction according to a third preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein the third preset algorithm includes: The delayed addition beam in the third pickup direction is used as the main beam, and the speech enhancement results in the first pickup direction and the second pickup direction are used as noise parameters;

The first speech enhancement data, the second speech enhancement data, and the third speech enhancement data are the output results of the speech enhancement model.
A distributed wake-up voice enhancement method device, including:

The acquisition module is configured to acquire N frequency domain data corresponding to the N microphones of the microphone array; wherein the N frequency domain data are obtained by Fourier transforming the speech data received by the microphone array;

Determining module, configured to determine whether the N frequency domain data picks up sound in each of the M sound pickup directions. The delay addition beams and delay subtraction beams in the direction are used to obtain M delay addition beams and M delay subtraction beams; wherein, the N frequency domain data correspond to delay addition beams in different pickup directions. and delay subtraction beam, N is an integer greater than 3, M is an integer greater than 2;

An input module configured to input the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams into a speech enhancement model, so as to perform target speech in the speech data through the speech enhancement model. Enhance.
The distributed wake-up speech enhancement device according to claim 8, comprising:

The determination module is configured to determine the target frequency domain data according to the N frequency domain data for each of the M sound pickup directions, and determine the target frequency domain data according to the N microphones in each of the sound pickup directions. The time delay between determines the weight vector corresponding to the target frequency domain data, wherein the target frequency domain data is used to indicate the array signal corresponding to the N frequency domain data; according to the target frequency domain data and the The weight vector corresponding to the target frequency domain data determines the M delay addition beams and M delay subtraction beams.
The distributed wake-up speech enhancement device according to claim 9, comprising:

The determination module is configured to determine the corresponding first matrix of the N frequency domain data, wherein the row information of the first matrix is used to indicate the N frequency domain data; determine according to the first matrix The target frequency domain data.
The distributed wake-up speech enhancement device according to claim 9, comprising:

The determination module is configured to determine the time delay of each microphone among the N microphones relative to the target microphone and determine the sub-weight vector corresponding to each microphone according to the time delay, wherein the target microphone is the most The microphone that first receives the voice data; determines the corresponding second matrix of the sub-weight vector, wherein the column information of the second matrix is used to indicate the sub-weight vector corresponding to each microphone; according to the microphone array The number of microphones N in and the second matrix determine the weight vector corresponding to the target frequency domain data.
The distributed wake-up speech enhancement device according to claim 11, comprising:

The determination module is configured to perform the determination steps: determine the first frequency domain data corresponding to any microphone among the N microphones, and the second frequency domain data corresponding to the target microphone; determine the first frequency domain data and The first product of the cosine value of the sound pickup direction of any microphone, and the second product of the second frequency domain data and the sine value of the sound pickup direction of any microphone; according to the sound speed, the first product, the second The product determines the time delay of any microphone relative to the target microphone; the determination step is performed in a loop until the time delay of each microphone among the N microphones relative to the target microphone is determined.
The distributed wake-up speech enhancement device according to claim 9, comprising:

The determination module is configured to determine the convolution result of the target frequency domain data for each sound pickup direction and the weight vector corresponding to the target frequency domain data, and determine the M delays according to the convolution result. Add beams; determine the conjugate complex number of the convolution result of the target frequency domain data in each pickup direction and the weight vector corresponding to the target frequency domain data, and determine the M based on the conjugate complex number result subtraction and addition beams.
The distributed wake-up speech enhancement device according to claim 8, comprising:

In the case of M=3, the determination module is configured to linearly filter the first speech enhancement data in the first sound pickup direction according to the first preset algorithm to obtain the speech enhancement in the first sound pickup direction. As a result, the first preset algorithm includes: using the delayed addition beam in the first pickup direction as the main beam, and using the speech enhancement results in the second pickup direction and the third pickup direction as noise parameters ; and perform linear filtering on the second speech enhancement data in the second sound pickup direction according to the second preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein the second preset algorithm includes: Using the delayed addition beam in the second sound pickup direction as the main beam, using the speech enhancement results in the first sound pickup direction and the third sound pickup direction as noise parameters; and according to the third preset algorithm, the second sound pickup The third speech enhancement data in the direction is linearly filtered to obtain the speech enhancement result in the second sound pickup direction, wherein the third preset algorithm includes: adding the beam with delay in the third sound pickup direction As the main beam, the speech enhancement results in the first sound pickup direction and the second sound pickup direction are used as noise parameters; wherein, the first speech enhancement data, the second speech enhancement data, and the third speech enhancement data are the The output results of the speech enhancement model are described.
A computer-readable storage medium including a stored program, Wherein, when the program is running, the method described in any one of claims 1 to 7 is executed.
An electronic device includes a memory and a processor, a computer program is stored in the memory, and the processor is configured to execute the method described in any one of claims 1 to 7 through the computer program.