CN112735459A

CN112735459A - Voice signal enhancement method, server and system based on distributed microphones

Info

Publication number: CN112735459A
Application number: CN201911032121.3A
Authority: CN
Inventors: 何源; 王伟国; 李金明; 金梦
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2021-04-30
Anticipated expiration: 2039-10-28
Also published as: CN112735459B

Abstract

The embodiment of the invention provides a voice signal enhancement method, a server and a system based on a distributed microphone, wherein the method comprises the following steps: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of any two microphones and then obtaining a cross-correlation function; obtaining a delay time difference according to peak information of the cross-correlation function in a delay time difference estimation window; acquiring delay time differences of other microphones according to a relation of delay time differences of every two of the three microphones and peak value information; and acquiring delay time of the two microphones relative to the target sound source, and further aligning and enhancing relative to the target sound source. The embodiment of the invention overcomes the defects of the existing centralized microphone array by adopting the deployment mode of the distributed microphone array, realizes the clock synchronization of the distributed microphone array by utilizing the voice chirp signal assistance, and effectively realizes the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source.

Description

Voice signal enhancement method, server and system based on distributed microphones

Technical Field

The invention relates to the technical field of communication, in particular to a voice signal enhancement method, a server and a system based on distributed microphones.

Background

Currently, sound is an important input source for many diagnostic systems. Machine diagnostics in an industrial setting is a typical example: the running sounds emitted by the machines in different states are different, and the experienced inspector can distinguish the running state of the machine by listening to the running sounds. However, in practice, the environment of the factory building is very noisy, various sounds interfere with each other, and even the volume of the noise may be larger than that of the target machine, which brings great interference to the judgment of the inspector. The inspector has to approach the machine and bring the ear close to the machine to diagnose the condition of the machine. Obviously, working for a long time in such an extremely noisy environment can greatly impair the inspector's hearing.

The current more sophisticated speech enhancement technology is based on a beam-forming centralized microphone array. However, these techniques have the following disadvantages: (1) lower resolution ratio: when the angles Of Arrival (DOAs) Of multiple sound sources are the same or close, it is difficult for the centralized microphone array to distinguish the sound sources. (2) The coverage is limited: although a centralized microphone array can improve coverage to some extent by increasing the number of microphones, its sound signal still exhibits very significant signal attenuation when the sound source is far from the microphone array.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a method, a server, and a system for enhancing a voice signal based on a distributed microphone.

In a first aspect, an embodiment of the present invention provides a distributed microphone-based speech signal enhancement method, including: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones; acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.

Further, the relationship of the delay time differences corresponding to two microphones of the three microphones is as follows:

wherein the content of the first and second substances,

representing the difference of the delay times of the microphone a and the microphone B with respect to the voice chirp signal and with respect to the target sound source;

representing microphones A and C with respect to the speech chirp signal and with respect to the target sound sourceA difference in delay times;

which represents the difference in delay time of the microphone B and the microphone C with respect to the speech chirp signal and with respect to the target sound source.

Further, the expression of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:

wherein the content of the first and second substances,

representing the difference in delay times of microphone a and microphone B with respect to the speech chirp signal and with respect to the target audio source,

represents the distance between microphone a and the source of the chirp voice signal,

representing the distance between microphone a and the target audio source,

represents the distance between microphone B and the source of the chirp voice signal,

representing the distance between the microphone B and the target audio source.

Further, the expression of the absolute value of the error of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:

wherein the content of the first and second substances,

absolute value representing an error of the difference between the delay times of the microphone a and the microphone B with respect to the voice chirp signal and with respect to the target sound source, e_dRepresenting an upper bound of the distance measurement error, c_minAnd c_maxRespectively representing the minimum value and the maximum value of the sound velocity.

Further, the determining a target sound source to be subjected to speech signal enhancement comprises: and determining the sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.

Further, before the determining a target sound source to be subjected to speech signal enhancement, the method further comprises: and acquiring sound signals received by all microphones in the distributed microphone array.

In a second aspect, an embodiment of the present invention provides a server, including: a target audio source determination module to: determining a target sound source to be subjected to voice signal enhancement; a voice chirp signal alignment module configured to: aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; a first delay time difference acquisition module configured to: acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; a second delay time difference obtaining module, configured to: iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones; a target audio source delay time acquisition module for: acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; a target source speech signal alignment enhancement module for: and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.

In a third aspect, an embodiment of the present invention provides a speech signal enhancement system based on distributed microphones, including: the system comprises a wireless node, a distributed microphone array, at least one sound source, a chirp voice signal source and a server; wherein the wireless node is connected with at least one microphone in the microphone array and is used for transmitting the sound signals received by the connected microphone to the server.

In a fourth aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the computer program.

In a fifth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method as provided in the first aspect.

According to the distributed microphone-based voice signal enhancement method, the server and the system, the defects of the existing centralized microphone array are overcome by adopting the arrangement mode of the distributed microphone array, the clock synchronization of the distributed microphone array is realized by the aid of the voice chirp signals, and the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source are effectively realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating display contents of a display screen in the distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;

fig. 3 is a schematic view of a usage scenario in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a signal processing process of a distributed microphone array in the distributed microphone based speech signal enhancement method according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating alignment of a voice chirp signal in a distributed microphone-based voice signal enhancement method according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating coarse grain alignment in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;

fig. 7 is a schematic diagram of fine-grained alignment in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;

FIG. 8 is a block diagram of a server according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a distributed microphone-based speech signal enhancement system according to an embodiment of the present invention;

fig. 10 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 101, determining a target sound source to be subjected to voice signal enhancement;

the voice signal enhancement method based on the distributed microphone is not only suitable for enhancing single-voice source voice signals, but also suitable for enhancing multi-voice source voice signals. The method provided by the embodiment of the invention is operated on the server. When only one sound source in the sound field needs to be monitored, the sound source can be directly used as a fixed target sound source to enhance signals for monitoring. When a sound field with multiple sound sources needs to be monitored, a target sound source to be subjected to speech signal enhancement needs to be determined. Since the enhancement of the voice signal of a certain sound source is to hear the sound of the corresponding sound source more clearly, the enhancement of the voice signal of the remaining sound sources is not performed in the process of enhancing the voice signal of the certain sound source.

The method for determining the target sound source to be subjected to speech signal enhancement can be preset and can be realized by adopting various methods. For example, the selection may be performed in a sound source list.

102, aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance;

because the position of each microphone in the distributed microphone array is different, the sound signals of the same sound source received by each microphone have time difference, so that the time delay time of the sound signals of the target sound source received by each microphone needs to be calculated, the sound signals of the target sound source received by each microphone can be aligned, and after the time alignment, the sound signals of the target sound source received by each microphone are superposed, so that the sound signals of the target sound source can be enhanced.

Here, the voice chirp signal is introduced as a reference signal to assist alignment. The speech chirp signal is very sensitive to signal misalignment and is a sinusoidal signal whose frequency varies rapidly and linearly with time. Misalignment of the two voice chirp signals in the time domain will cause a sharp drop in the cross-power spectral strength, resulting in very narrow peaks in the cross-power spectrum, so that the voice chirp signals can be accurately aligned.

After aligning voice chirp signals of sound signals received by every two microphones in the distributed microphone array, solving a cross-correlation function CCF of the sound signals received by any two microphones.

103, obtaining peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window;

the sound signals received by the two microphones include a sound signal of a target sound source, a sound signal of a chirp voice signal source, a sound signal of another sound source (if any), a sound signal of an interfering sound source (if any), and the like. To obtain the delay time of the target sound source, a delay time difference estimation window corresponding to the target sound source needs to be set. The voice chirp signals received by the two microphones have a delay time, which is called as a first delay time; the sound signals of the target sound source received by the two microphones also have a delay time, which is called a second delay time. The difference between the first delay time and the second delay time is defined as a delay time difference. The delay time difference estimation window is a window corresponding to the estimation range of the delay time difference. The true values of the delay time differences correspond to peaks in the respective delay time difference estimation windows.

Since the delay time difference estimation windows of the respective sound sources may have overlap, a plurality of peaks may exist in the delay time difference estimation window of the target sound source, but the plurality of peaks that occur necessarily include a peak corresponding to the delay time difference corresponding to the target sound source. Therefore, the peak information of each cross-correlation function in the corresponding delay time difference estimation window is obtained, and if any one of the delay time difference estimation windows has a unique peak, the unique peak corresponds to the delay time difference of the two microphones relative to the target sound source.

104, iteratively acquiring the delay time differences of other two microphones according to the determined delay time differences of the two microphones in the three microphones and a relation formula of the delay time differences corresponding to the two microphones in the three microphones and peak value information in the delay time difference estimation window;

the relation of the delay time difference corresponding to every two of the three microphones can be obtained through calculation, and the relation shows the restriction relation of the delay time difference corresponding to every two of the three microphones. This relationship or constraint can be obtained according to the existing techniques, as long as the expression is correct and not limited.

The delay time difference estimation window is a window corresponding to the estimation range of the difference value of the first delay time of the two microphones relative to the voice chirp signal and the second delay time relative to the target sound source. Thus, for any two microphones, there corresponds a respective said delay time difference estimation window. And, the peak value at the corresponding delay time difference estimation window corresponds to a difference value of a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source.

Similarly, there may be a plurality of peak signals in each delay time difference estimation window, but based on the above-mentioned relation of the delay time differences corresponding to two microphones in three microphones, a peak value corresponding to the difference between the first delay time of the two microphones with respect to the speech chirp signal and the second delay time of the two microphones with respect to the target sound source can be obtained, that is, the delay time differences of the other microphones can be obtained.

Step 105, obtaining the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and obtaining the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference;

the delay time difference is a difference value between the first delay time with respect to the voice chirp signal and the second delay time with respect to the target sound source. And the first delay times of the two microphones with respect to the voice chirp signal are easy to obtain, the second delay times of the respective two microphones with respect to the target sound source are obtained based on the first delay time and the delay time difference, that is, the second delay times of any two microphones with respect to the target sound source can be obtained.

Step 106, aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array;

after the second delay time of each two microphones with respect to the target sound source is obtained, the delay information of the sound signal of the target sound source received between the microphones is clarified, so that the sound received by each microphone can be aligned with respect to the target sound source according to the second delay time between each two microphones of the distributed microphone array, and then the signals after alignment are superposed to enhance the sound signal of the target sound source.

The embodiment of the invention overcomes the defects of the existing centralized microphone array by adopting the deployment mode of the distributed microphone array, realizes the clock synchronization of the distributed microphone array by utilizing the voice chirp signal assistance, and effectively realizes the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source.

Further, based on the above embodiment, the relationship of the delay time differences corresponding to two microphones of the three microphones is as follows:

wherein the content of the first and second substances,

the difference value of the delay time of the microphone A and the delay time of the microphone B relative to the voice chirp signal and the target sound source are represented, namely the delay time difference corresponding to the microphone A and the microphone B is represented;

the difference value of the delay time of the microphone A and the microphone C relative to the voice chirp signal and the target sound source is represented, namely the delay time difference corresponding to the microphone A and the microphone C;

the difference between the delay times of the microphone B and the microphone C with respect to the voice chirp signal and with respect to the target sound source is represented, that is, the delay time difference corresponding to the microphone B and the microphone C.

The delay time difference corresponding to each two of any three microphones satisfies the above formula, and the above A, B, C is only used to distinguish the microphones and is not limited to specific microphones.

As can be seen from the above relationship, the delay time differences corresponding to two microphones in any three microphones satisfy the simple constraint relationship, and then, after obtaining the delay time differences corresponding to one or two microphones, the unknown delay time differences corresponding to two microphones can be obtained according to the peak value conditions of the cross-correlation function in the corresponding delay time difference estimation window.

On the basis of the above embodiment, the embodiment of the present invention provides a simple constraint relationship of the delay time differences corresponding to two microphones among three microphones, and facilitates to quickly and easily acquire the unknown delay time differences corresponding to two microphones according to the known delay time differences corresponding to two microphones.

Further, based on the above-described embodiment, the expression of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:

wherein the content of the first and second substances,

representing the distance between microphone a and the target audio source,

representing the distance between the microphone B and the target audio source.

The difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source satisfies the above formula, and the A, B is only used for distinguishing the microphones and is not limited to a specific microphone.

It can be seen that the difference between the delay times of the two microphones with respect to the voice chirp signal and with respect to the target sound source can be calculated from the distance between the two microphones and the chirp voice signal source, the distance between the two microphones and the target sound source, and the sound velocity.

Since the distance calculation has errors and the sound velocity changes with the temperature, the delay time difference between any two microphones cannot be directly calculated by using the formula of the difference between the delay times of any two microphones with respect to the voice chirp signal and the target sound source, but the delay time difference is obtained by adopting the method of corresponding the peak values through the cross-correlation function. However, it is necessary to calculate the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source to obtain the delay time difference estimation window, because the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the range of the delay time difference can be estimated based on the result of the above calculation to obtain the delay time difference estimation window.

On the basis of the above embodiments, the embodiments of the present invention provide a basis for obtaining the delay time difference estimation window by giving an expression of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source.

Further, based on the above-described embodiment, the expression of the absolute value of the error of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:

wherein the content of the first and second substances,

absolute value representing an error of the difference between the delay times of the microphone a and the microphone B with respect to the voice chirp signal and with respect to the target sound source, e_dIndicating distance measurementUpper bound of the quantity error, c_minAnd c_maxRespectively representing the minimum value and the maximum value of the sound velocity.

The absolute value of the error of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is related to the distance measurement error, the minimum value and the maximum value of the sound velocity, and can be represented by the above equation.

The embodiment of the invention provides an expression of the absolute value of the error of any two microphones relative to the difference value of the voice chirp signal and the delay time of the target sound source, namely, the upper limit and the lower limit of the error range are provided

Therefore, the range of the delay time difference estimation window can be determined, and the peak value corresponding to the delay time difference is ensured to be in the range of the delay time difference estimation window.

On the basis of the above embodiments, the embodiment of the present invention obtains the range of the delay time difference estimation window by giving the expression form of the absolute value of the error between the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source, and ensures that the peak value corresponding to the delay time difference appears within the range of the delay time difference estimation window, thereby ensuring that when there is a unique peak value in the delay time difference estimation window, the unique peak value is determined to correspond to the delay time difference of the corresponding two microphones.

Further, based on the above embodiment, the determining a target sound source to be subjected to speech signal enhancement includes: and determining the sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.

For more convenient operation and visualization, each sound source can be designed into different icons to be displayed on the display screen of the server. When the monitoring personnel want to acquire the enhanced voice of a certain sound source, the corresponding sound source icon can be clicked or clicked nearby the corresponding sound source icon. And after receiving the click information of the display screen, the server acquires the click position on the display screen and determines the sound source corresponding to the sound source icon closest to the click position as the target sound source.

On the basis of the above embodiment, according to the embodiment of the present invention, the sound source corresponding to the sound source icon closest to the click position is determined as the target sound source according to the click position on the display screen, so that convenience in obtaining the target sound source is improved.

Further, based on the above embodiment, before the determining the target sound source to be subjected to speech signal enhancement, the method further includes: and acquiring sound signals received by all microphones in the distributed microphone array.

The server performs speech enhancement processing of the target sound source, and naturally needs to acquire a sound signal of the target sound source. Each of the sound sources disposed in the sound field may become a target sound source, and sound signals of each of the sound sources are received by the microphones. Therefore, the server needs to acquire the sound signals it receives from each microphone in the microphone array. Specifically, the microphone may be connected to the wireless module, and then the sound signal received by the microphone may be transmitted to the server through the wireless module.

On the basis of the above embodiments, the embodiments of the present invention provide a basis for performing multi-source speech signal enhancement by obtaining the sound signals received by each microphone in the distributed microphone array.

Fig. 2 is a schematic diagram of display contents of a display screen in the distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. Fig. 3 is a schematic view of a usage scenario in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. Fig. 4 is a schematic diagram of a signal processing process of a distributed microphone array in the distributed microphone based speech signal enhancement method according to an embodiment of the present invention. Fig. 5 is a schematic diagram of aligning the voice chirp signals in the distributed microphone-based voice signal enhancement method according to an embodiment of the present invention. Fig. 6 is a schematic coarse-grained alignment diagram in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. Fig. 7 is a schematic diagram of fine-grained alignment in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. The following describes the distributed microphone-based speech signal enhancement method provided by the embodiment of the present invention in further detail with reference to fig. 2 to 7.

To overcome the inherent drawbacks of conventional centralized microphone arrays, embodiments of the present invention propose a distributed microphone array, named chord mics. ChordMics utilizes distributed beamforming technology to achieve highly controllable multi-source signal enhancement. Fig. 2 shows a schematic diagram of the distribution of ChordMics on a display screen. Different from the existing centralized microphone array technology, the microphone nodes in the chord mics are dispersedly deployed in the whole sound field, so that abundant spatial diversity is brought, and the coverage range can be greatly improved. In addition, since each microphone node is wirelessly connected (communication between the microphone and the server), the microphone node is not limited by a wire, and chord mics has strong expandability. The user can arbitrarily increase or decrease the array size and coverage area by adding or subtracting microphones directly. More importantly, ChordMics can realize highly controllable target signal enhancement, namely enhancing the signal of a sound source near any one point in a sound field.

The embodiment of the invention aims to realize a distributed microphone system to perform multi-source target signal enhancement. Specifically, the embodiment of the invention deploys a plurality of microphones in the monitoring environment, and realizes signal enhancement and interference elimination for a certain sound source by coherently superposing signals collected by the microphones. With this system, the inspector simply needs to sit remotely in front of the computer display and click on a location in the screen, as shown in fig. 3, and the system will emphasize and play out the audio source near the clicked target area.

The following is a detailed description of the implementation principles of embodiments of the present invention.

Beamforming can make full use of spatial information, delay and combine multiple signals, enhance signals from a particular direction, and subtract signals from other directions, thereby achieving speech enhancement. Specifically, as shown in fig. 4, three microphones are placed at equal intervals by a distance d. If a toneWith the source at a sufficiently distant location, the wavefronts of the source can be seen as a plane and the arrival microphone propagation paths can be seen approximately parallel. Assuming that the angle between the propagation path and the microphone is θ, the relative delay between two adjacent microphones is τ d cos (θ)/c, and c is the speed of sound. By compensating for the relative delays of the individual microphone signals and adding, the speech signal can be enhanced, i.e. the beamformed output is:

where M is the total number of microphones, x_mIs the voice signal of the mth microphone.

From the above, it can be seen that the multi-microphone enhancement is achieved, and the most central problem is to calculate the time delay of the target signal relative to each microphone, so as to align the signals. But for a distributed microphone scenario, the following real problems exist: (1) there is a significant clock synchronization problem between nodes, so the signals cannot be aligned with an absolute timestamp. (2) There is some measurement or deployment error in the position between each node, and (3) the speed of sound is not strictly fixed and will vary with temperature changes, making it difficult to calculate the relative delay accurately.

The above problems are addressed. The embodiment of the invention provides a method for combining coarse grain alignment and fine grain alignment, and the method can accurately align target signals.

(1) Coarse grain size alignment

The first task of ChordMics is to control the upper error bound of the relative error estimate. The estimation error is composed of three parts: time synchronization error, distance measurement error, and uncertain sound velocity (which varies with air temperature). Without a time synchronization mechanism, time synchronization errors can accumulate gradually, making it impossible to determine an upper limit for estimation errors. On the other hand, the current mainstream time synchronization mechanism is far from the precision required by chord mics. To solve this problem, an additional voice chirp signal is introduced as a reference signal, thereby eliminating a time synchronization error. For a more easy understanding of the design of the embodiments of the present invention, reference is first made to a simple example:

as shown in fig. 5, it is assumed that an additional voice signal source is provided at each target sound source. The additional signal source broadcasts a voice chirp reference signal. Because the target signal and the chirp signal travel around the same ground direction, the relative delays of the two signals to reach the respective microphones are the same. Thus, the chirp signal is mainly aligned, and the target signal can be aligned. The problem of calculating the relative delay is translated into the problem of detecting the chirp signal.

The reason why the chirp signal is selected as the reference signal is that the chirp signal is very sensitive to the misalignment of the signal. Specifically, chirp is a sinusoidal signal whose frequency varies rapidly linearly with time. Misalignment of the two chirp signals in the time domain will cause a sharp drop in the cross-power spectral strength, resulting in very narrow peaks in the cross-power spectrum, facilitating accurate alignment of the chirp signals.

Now consider the more general problem that the chirp signal is not co-located with the target signal. Without loss of generality, the two microphone scenario is used to introduce chormimo ics (see fig. 6). In this example, the target signal and the chirp signal are located at two different locations. The two microphone nodes will receive the target signal at different times.

By x_A(t) and x_B(t) represents the speech signals received by microphones a and B, respectively.

And

which represents the signal after alignment of the voice chirp signal.

Index here_AAnd index_BRespectively, indicating chirp signals at x_A(t) and x_BSampling point location in (t), F_sWhich represents the sampling rate of the microphone(s),

representing the relative delays of microphones a and B with respect to the chirp signal. From FIG. 6, it can be observed

It is emphasized that all that is required is to align the target audio source, i.e. to find the relative time delay of the microphone with respect to the target audio source

It can be understood from FIG. 6

According to the propagation speed of the signal, there are

Where c denotes the speed of sound. The same is as

The formula (3) and the formula (4) are substituted into the formula (2),

equation (1) illustrates that aligning the chirp signal allows cancellation

So long as Δ can be obtained_ABCan find out

However, in practice the position of the microphone and chirp is not accurate and even the speed of sound varies with temperature, so thatIt is difficult to accurately calculate

However, it is noted that even if it cannot be calculated accurately

But can determine

To thereby assist in finalizing the steps following embodiments of the present invention

Δ_ABHas an upper error limit of

E herein_dUpper bound for distance measurement error, c_minAnd c_maxRepresenting the minimum and maximum possible values of the speed of sound. It can be estimated that, in a room with a length and width of 20 m,

is less than 20 milliseconds.

(2) Fine grain alignment

The method of accurately determining the relative delay is described below. Consider the speech signals received by microphones A and B as defined by equation (1)

And

the cross-correlation function (CCF) of the two is defined as

It is obvious that when p is equal to delta_ABThe two signals will be perfectly aligned, and

where there is a peak. Naturally, Δ can be obtained by the following formula_AB

Herein, the

To substitute coarse distance and speed of sound into Δ_ABA rough estimate obtained by the calculation formula, and

is Δ_ABMaximum error of estimation, will

Is called delta_ABIs otherwise called

Referred to as the maximum error window.

However, in an actual sound field, there are multiple sound sources, and there may be multiple peaks within the maximum error window, so that it is difficult for the chord mics to directly determine which peak corresponds to the target signal. Specifically, it is assumed that there are two sound sources (target sound source s)^D(t) and interfering sound source s^I(t)), the signal received by a certain microphone ω ∈ Ω ═ { a, B } can be expressed as

Herein, the

Representing the attenuation coefficients of the two signals,

and

representing the propagation delays, n, of two sources to the microphone omega_ω(t) represents noise.

Signal x_A(t) and x_BCCF between (t) is

And the CCF of the aligned signal is

From the above formula, Cor can be seen_ABWill be at

And

two locations appear to peak. It is obvious that

Fall into

Is estimated by the window

It is difficult to directly judge in

Which is the peak of the target signal.

Expression determination s^D(t) an autocorrelation function;

expression determination s^I(t) autocorrelation function.

To solve this problem, embodiments of the present invention propose a method of continuous disambiguation. The method fully excavates the set diversity of the distributed microphones and determines the peak position of the target signal in an iterative mode. Specifically, taking FIG. 7 as an example, the target signal is

And

the time instants reach the microphones A, B and C, respectively. The relative time delays of microphones a and B with respect to the target signal are then

Further can obtain

Similarly, the relative delay with respect to the chirp signal is

Formula (12) minus formula (13) to

Δ_AB＝Δ_AC-Δ_BC (14)

The above formula reveals a very important relationship between the relative delays (the above formula is not superscripted, meaning that it applies to all sources). By using this relationship, the target signal can be determined: as long as there is only in the maximum error window of the CCF of the two microphonesIf there is a peak, the peak corresponding to the target signal between other microphones can be iteratively found. Fig. 7 is a specific example (in which the CCF has been normalized by a corresponding coarse estimate of the target signal): looking at the CCFs of microphones B and C, it can be seen that there is only one peak within the maximum error window (delay time difference estimation window). Since fine-grained alignment ensures that there must be a peak in the target signal within the maximum error window, it can be concluded that

The peak in the maximum error window is the peak of the target signal and can thus be determined

The exact value of (c). Further, according to

Can determine

And

in that

And

of (5), only the peak satisfying the formula (14) can be determined as

And

this process can be extended for multiple microphones as well. Fig. 7 is only an example for explaining the calculation process of the delay time difference, and it should be noted that the delay time difference estimation windows corresponding to the cross-correlation functions for two different microphones are not necessarily the same.

In summary, the method of successive disambiguation is summarized as follows:

1. calculating CCF of every two microphone signals;

2. finding out the CCF with only one peak in the maximum error range, and determining the peak as a target peak;

3. the peaks of the other target signals are iteratively found by equation (14).

The hardware for realizing ChordMics mainly comprises: wireless node, microphone sensor and server. One embodiment is: 6 Raspberry pies (Raspberry Pi 3 Model B +) equipped with WiFi modules were used as wireless nodes. Each raspberry pi is connected to two microphones (12 microphones in total) via USB interfaces. These 12 microphones were randomly distributed in a room of 10m x 12 m. A plurality of JBL sounds are used as a target sound source and an interference sound source. All microphones and sound are commercially available, inexpensive devices. The raspberry pi transmits the signal collected by the microphone to the server in a streaming mode. All signal detection, alignment and enhancement is performed at the server.

On one hand, the chord mics system introduces an additional chirp voice signal to realize clock synchronization among distributed nodes, and by referring to the chirp signal, the chord mics can eliminate clock errors among the nodes. On the other hand, by calculating the relative time of the signals received by each microphone and combining the geometrical diversity characteristics, ChordMics can accurately find out the relative delay of the target sound source among the microphones, so that the voice signals of the microphones can be accurately aligned, and the signals are coherently superposed, thereby realizing the enhancement of the target sound source and the elimination of interference.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention. As shown in fig. 8, the server 1 includes a target sound source determining module 10, a voice chirp signal aligning module 20, a first delay time difference obtaining module 30, a second delay time difference obtaining module 40, a target sound source delay time obtaining module 50, and a target sound source voice signal aligning enhancing module 60, wherein: the target audio source determination module 10 is configured to: determining a target sound source to be subjected to voice signal enhancement; the voice chirp signal alignment module 20 is configured to: aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; the first delay time difference obtaining module 30 is configured to: acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; the second delay time difference obtaining module 40 is configured to: iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones; the target sound source delay time acquisition module 50 is configured to: acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; the target source speech signal alignment enhancement module 60 is configured to: and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.

Further, based on the above embodiment, the target sound source determining module 10 is specifically configured to: and determining the sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.

Further, based on the above embodiment, the server further includes a sound signal obtaining module, where the sound signal obtaining module is configured to: and acquiring sound signals received by all microphones in the distributed microphone array.

Fig. 9 is a schematic structural diagram of a distributed microphone-based speech signal enhancement system according to an embodiment of the present invention. As shown in fig. 9, the system includes: the system comprises a wireless node 2, a distributed microphone array 3, at least one sound source 4, a chirp voice signal source 5 and a server 1; wherein the wireless node 2 is connected to at least one microphone of the microphone array 3 for transmitting sound signals received by the connected microphone to the server 1.

The device provided by the embodiment of the present invention is used for the method, and specific functions may refer to the above method flow, which is not described herein again.

Fig. 10 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may call logic instructions in memory 1030 to perform the following method: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones; acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.

Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method provided by the foregoing embodiments, for example, including: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; iteratively acquiring the delay time differences of other microphones according to a relation of the delay time differences corresponding to every two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the microphones; acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for distributed microphone-based speech signal enhancement, comprising:

determining a target sound source to be subjected to voice signal enhancement;

aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance;

acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window;

iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones;

acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference;

and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.

2. The distributed microphone based speech signal enhancement method of claim 1, wherein the delay time differences for every two of the three microphones are expressed by the following relation:

wherein the content of the first and second substances,

representing the difference of the delay times of the microphone a and the microphone C with respect to the voice chirp signal and with respect to the target sound source;

3. The distributed microphone based speech signal enhancement method of claim 2, wherein the expression of the difference between the delay times of any two microphones with respect to the speech chirp signal and with respect to the target sound source is:

wherein the content of the first and second substances,

representing the distance between microphone a and the target audio source,

representing the distance between the microphone B and the target audio source.

4. The distributed microphone based speech signal enhancement method of claim 3, wherein the expression of the absolute value of the error of the difference of the delay times of any two microphones with respect to the speech chirp signal and with respect to the target sound source is:

wherein the content of the first and second substances,

5. The distributed microphone-based speech signal enhancement method of claim 1, wherein the determining a target sound source for which speech signal enhancement is to be performed comprises:

and determining the sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.

6. The distributed microphone-based speech signal enhancement method of claim 5, wherein prior to the determining the target audio source for speech signal enhancement, the method further comprises: and acquiring sound signals received by all microphones in the distributed microphone array.

7. A server, comprising:

a target audio source determination module to: determining a target sound source to be subjected to voice signal enhancement;

a voice chirp signal alignment module configured to: aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance;

a first delay time difference acquisition module configured to: acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window;

a second delay time difference obtaining module, configured to: iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones;

a target audio source delay time acquisition module for: acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference;

a target source speech signal alignment enhancement module for: and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.

8. A voice signal enhancement system based on distributed microphones is characterized by comprising a wireless node, a distributed microphone array, at least one sound source, a chirp voice signal source and a server; wherein the wireless node is connected with at least one microphone in the microphone array and is used for transmitting the sound signals received by the connected microphone to the server.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the distributed microphone based speech signal enhancement method according to any of claims 1 to 6 when executing the computer program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the distributed microphone based speech signal enhancement method according to any one of claims 1 to 6.