CN112735459B

CN112735459B - Voice signal enhancement method, server and system based on distributed microphone

Info

Publication number: CN112735459B
Application number: CN201911032121.3A
Authority: CN
Inventors: 何源; 王伟国; 李金明; 金梦
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2024-03-26
Anticipated expiration: 2039-10-28
Also published as: CN112735459A

Abstract

The embodiment of the invention provides a voice signal enhancement method, a server and a system based on a distributed microphone, wherein the method comprises the following steps: determining a target sound source to be subjected to voice signal enhancement; aligning the voice chirp signals of any two microphones and then solving a cross-correlation function; acquiring a delay time difference according to peak information of the cross correlation function in a delay time difference estimation window; acquiring delay time differences of other microphones according to the relation of delay time differences of every two microphones in the three microphones and the peak value information; the delay time of the two microphones relative to the target sound source is acquired, and then the alignment and enhancement relative to the target sound source are performed. The embodiment of the invention overcomes the defects of the existing centralized microphone array by adopting a deployment mode of the distributed microphone array, realizes clock synchronization of the distributed microphone array by utilizing the assistance of the voice chirp signal, and effectively realizes the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source.

Description

Voice signal enhancement method, server and system based on distributed microphone

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method, a server, and a system for enhancing a voice signal based on a distributed microphone.

Background

Currently, sound is an important input source for many diagnostic systems. Machine diagnostics in an industrial setting is a typical example: the operating sounds made by the machine in different states are different, and an experienced inspector can distinguish the operating state of the machine by listening to the operating sounds. However, in practice, the environment of the factory building is very noisy, various sounds will interfere with each other, and even the volume of the noise may be larger than that of the target machine, which brings great interference to the judgment of the inspector. The inspector has to approach the machine and bring the ear close to the machine to diagnose the condition of the machine. Clearly, working for a long time in such extremely noisy environments can greatly impair the hearing of the inspector.

The current state-of-the-art technology for speech enhancement is beamforming-based centralized microphone arrays. However, these techniques have the following disadvantages: (1) lower resolution: when the angles of arrival (Direction Of Arrival, DOA) of multiple sound sources are the same or similar, it is more difficult for the centralized microphone array to distinguish between the sound sources. (2) limited coverage: although a centralized microphone array can improve coverage to some extent by increasing the number of microphones, its sound signals still exhibit very significant signal attenuation when the sound source is far from the microphone array.

Disclosure of Invention

In order to solve the problems in the prior art, the embodiment of the invention provides a voice signal enhancement method, a server and a system based on a distributed microphone.

In a first aspect, an embodiment of the present invention provides a method for enhancing a voice signal based on a distributed microphone, including: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of sound signals received by two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by the two microphones; wherein the voice chirp signal is sent by a chirp voice signal source which is placed in the sound field in advance; acquiring peak information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any one of the delay time difference estimation windows has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the delay time difference estimation range, and the delay time difference corresponds to a peak value in the delay time difference estimation window; based on the determined delay time difference of every two microphones, iteratively acquiring the delay time difference of every other two microphones according to a relation of the delay time difference corresponding to every two microphones in three microphones and peak information in the delay time difference estimation window; acquiring the first delay time of every two microphones in the distributed microphone array about the voice chirp signal, and acquiring the second delay time of every two corresponding microphones about the target sound source based on the first delay time and the delay time difference; and aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array.

Further, the relationship between the delay time differences corresponding to two microphones in the three microphones is:

wherein,representing the difference in delay times of microphone a and microphone B with respect to the speech chirp signal and with respect to the target sound source; />Representing the difference in delay times of microphone a and microphone C with respect to the speech chirp signal and with respect to the target sound source; />The difference in delay time of microphone B and microphone C with respect to the speech chirp signal and with respect to the target sound source is represented.

Further, the expression of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:

wherein,representing the difference in delay time of microphone a and microphone B with respect to the speech chirp signal and with respect to said target sound source, +.>Representing the distance between microphone A and the source of the chirp speech signal,/and>indicating the distance between microphone a and said target sound source,/->Representing the distance between microphone B and the source of said chirp speech signal,/o>Representing the distance between microphone B and the target sound source.

Further, the expression of the absolute value of the error of the difference between the arbitrary two microphones with respect to the voice chirp signal and with respect to the delay time of the target sound source is:

Wherein,an absolute value representing an error of a difference between microphone a and microphone B with respect to a speech chirp signal and with respect to a delay time of the target sound source, e _d Representing the upper bound of the distance measurement error, c _min And c _max Representing the minimum and maximum values of sound velocity, respectively.

Further, the determining the target sound source to be subjected to the voice signal enhancement includes: and determining a sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.

Further, before the determining the target sound source to be subjected to the speech signal enhancement, the method further includes: sound signals received by each microphone in the distributed microphone array are acquired.

In a second aspect, an embodiment of the present invention provides a server, including: the target sound source determining module is used for: determining a target sound source to be subjected to voice signal enhancement; a voice chirp signal alignment module for: aligning voice chirp signals of sound signals received by two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by the two microphones; wherein the voice chirp signal is sent by a chirp voice signal source which is placed in the sound field in advance; a first delay time difference acquisition module configured to: acquiring peak information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any one of the delay time difference estimation windows has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the delay time difference estimation range, and the delay time difference corresponds to a peak value in the delay time difference estimation window; a second delay time difference acquisition module configured to: based on the determined delay time difference of every two microphones, iteratively acquiring the delay time difference of every other two microphones according to a relation of the delay time difference corresponding to every two microphones in three microphones and peak information in the delay time difference estimation window; the target sound source delay time acquisition module is used for: acquiring the first delay time of every two microphones in the distributed microphone array about the voice chirp signal, and acquiring the second delay time of every two corresponding microphones about the target sound source based on the first delay time and the delay time difference; the target sound source voice signal alignment enhancement module is used for: and aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array.

In a third aspect, an embodiment of the present invention provides a voice signal enhancement system based on a distributed microphone, including: the system comprises a wireless node, a distributed microphone array, at least one sound source, a chirp voice signal source and the server; the wireless node is connected with at least one microphone in the microphone array and is used for transmitting sound signals received by the connected microphones to the server.

In a fourth aspect, an embodiment of the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method as provided in the first aspect when executing the computer program.

In a fifth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as provided by the first aspect.

According to the voice signal enhancement method, the server and the system based on the distributed microphone, the defects of the existing centralized microphone array are overcome by adopting the deployment mode of the distributed microphone array, clock synchronization of the distributed microphone array is realized by utilizing voice chirp signal assistance, and alignment of voice signals of the distributed microphone array and signal enhancement of a target sound source are effectively realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for enhancing speech signals based on distributed microphones according to an embodiment of the present invention;

fig. 2 is a schematic diagram of display contents of a display screen in a method for enhancing a voice signal based on a distributed microphone according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a usage scenario in a method for enhancing a voice signal based on a distributed microphone according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a signal processing procedure of a distributed microphone array in a method for enhancing a voice signal based on a distributed microphone according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating alignment of voice chirp signals in a method for enhancing voice signals based on distributed microphones according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of coarse-granularity alignment in a method for enhancing a voice signal based on a distributed microphone according to an embodiment of the present invention;

FIG. 7 is a schematic illustration of fine grain alignment in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a server according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a voice signal enhancement system based on distributed microphones according to an embodiment of the present invention;

fig. 10 is a schematic physical structure of an electronic device according to an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of a method for enhancing a voice signal based on a distributed microphone according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 101, determining a target sound source to be subjected to voice signal enhancement;

The voice signal enhancement method based on the distributed microphone provided by the embodiment of the invention is not only suitable for enhancing single-sound-source voice signals, but also suitable for enhancing multi-sound-source voice signals. The method provided by the embodiment of the invention operates on the server. When only one sound source in the sound field needs to be monitored, the sound source can be directly used as a fixed target sound source to enhance signals for monitoring. When there are a plurality of sound sources in the sound field to be monitored, it is necessary to determine a target sound source to be subjected to speech signal enhancement. Since the enhancement of the voice signal of a certain sound source is to hear the sound of the corresponding sound source more clearly, the enhancement of the voice signal of the rest of sound sources is not performed in the process of enhancing the voice signal of a certain sound source.

The method for determining the target sound source to be subjected to the speech signal enhancement may be preset, and may be implemented in various ways. For example, the selection may be performed in a sound source list.

Step 102, aligning voice chirp signals of sound signals received by two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by the two microphones; wherein the voice chirp signal is sent by a chirp voice signal source which is placed in the sound field in advance;

Because the positions of the microphones in the distributed microphone array are different, the time difference exists between the sound signals of the same sound source received by the microphones, so that the time delay time of the sound signals of the target sound source received by the microphones is required to be obtained, the sound signals of the target sound source received by the microphones can be aligned, and after the time delay is carried out, the sound signals of the target sound source received by the microphones are overlapped, and the sound signals of the target sound source can be enhanced.

Here, a voice chirp signal is introduced as a reference signal to assist alignment. The speech chirp signal is very sensitive to misalignment of the signal and is a sinusoidal signal with a frequency that varies linearly and rapidly over time. Misalignment of the two voice chirp signals in the time domain will cause a steep drop in the cross-power spectrum intensity, so that there is a very narrow peak in the cross-power spectrum, and thus the voice chirp signals can be aligned accurately.

After aligning voice chirp signals of sound signals received by two microphones in the distributed microphone array, a cross correlation function CCF is obtained for the sound signals received by any two microphones.

Step 103, obtaining peak information of each cross-correlation function in a corresponding delay time difference estimation window, and if any one of the delay time difference estimation windows has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the delay time difference estimation range, and the delay time difference corresponds to a peak value in the delay time difference estimation window;

The sound signals received by the two microphones include the sound signal of the target sound source, the sound signal of the chirp sound source, the sound signals of other sound sources (if present), the sound signals of the interfering sound sources (if present), and the like. To obtain the delay time of the target sound source, a delay time difference estimation window corresponding to the target sound source needs to be set. The voice chirp signals received by the two microphones have delay time, which is called first delay time; the sound signals of the target sound source received by the two microphones also have a delay time, which is called a second delay time. The difference between the first delay time and the second delay time is defined as a delay time difference. The delay time difference estimation window is a window corresponding to the delay time difference estimation range. The true values of the delay time differences correspond to peaks in the respective delay time difference estimation windows.

Since the delay time difference estimation windows of the respective sound sources may have an overlap, there may be a plurality of peaks in the delay time difference estimation window of the target sound source, but the plurality of peaks that occur necessarily include the peak corresponding to the delay time difference corresponding to the target sound source. Therefore, the peak information of each cross-correlation function in the corresponding delay time difference estimation window is obtained, and if any one of the delay time difference estimation windows has a unique peak value, the unique peak value corresponds to the delay time difference of the corresponding two microphones with respect to the target sound source.

Step 104, based on the determined delay time difference of every two microphones, iteratively acquiring the delay time difference of every other two microphones according to a relation of the delay time difference corresponding to every two microphones in three microphones and peak information in the delay time difference estimation window;

the relation of the delay time differences corresponding to the two microphones in the three microphones can be obtained through calculation, and the relation shows the constraint relation of the delay time differences corresponding to the two microphones in the three microphones. This relation or constraint can be obtained according to the existing technology, as long as the expression form thereof is not limited properly.

The delay time difference estimation window is a window corresponding to an estimation range of a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source. Thus, for any two microphones there is a corresponding said delay time difference estimation window. And, the peak value in the corresponding delay time difference estimation window corresponds to the difference between the first delay time of the two microphones with respect to the voice chirp signal and the second delay time with respect to the target sound source.

Also, a plurality of peak signals may appear in each delay time difference estimation window, but a peak value corresponding to a difference between a first delay time of two microphones with respect to a voice chirp signal and a second delay time with respect to the target sound source may be obtained based on the relationship of the delay time differences corresponding to two microphones among the above three microphones, that is, the delay time differences of the other microphones may be obtained.

Step 105, acquiring the first delay time of every two microphones in the distributed microphone array with respect to the voice chirp signal, and acquiring the second delay time of every two corresponding microphones with respect to the target sound source based on the first delay time and the delay time difference;

the delay time difference is a difference between the first delay time with respect to a voice chirp signal and the second delay time with respect to the target sound source. And the first delay time of two microphones with respect to the voice chirp signal is easy to obtain, the second delay time of the corresponding two microphones with respect to the target sound source is obtained based on the first delay time and the delay time difference, and the second delay time of any two microphones with respect to the target sound source can be obtained.

Step 106, aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array;

after the second delay time of each pair of microphones about the target sound source is acquired, delay information of sound signals of the target sound source received between the microphones is clarified, so that the sound received by each pair of microphones can be aligned about the target sound source according to the second delay time between the two pairs of microphones of the distributed microphone array, and then the subsequent signals are superimposed to further enhance the sound signals of the target sound source.

The embodiment of the invention overcomes the defects of the existing centralized microphone array by adopting a deployment mode of the distributed microphone array, realizes clock synchronization of the distributed microphone array by utilizing the assistance of the voice chirp signal, and effectively realizes the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source.

Further, based on the above embodiment, the relationship between the delay time differences corresponding to two microphones in the three microphones is:

wherein, Representing the difference between the delay times of microphone a and microphone B with respect to the speech chirp signal and with respect to the target sound source, i.e., the delay time differences corresponding to microphone a and microphone B; />Representing the difference between delay times of microphone a and microphone C with respect to the speech chirp signal and with respect to the target sound source, i.e., the delay time differences corresponding to microphone a and microphone C; />The difference between the delay times of microphone B and microphone C with respect to the speech chirp signal and with respect to the target sound source, that is, the delay time difference corresponding to microphone B and microphone C, is represented.

The delay time differences corresponding to two microphones in any three microphones satisfy the above formula, and A, B, C is merely used for distinguishing microphones, and is not limited to specific microphones.

From the above relation, if the delay time difference corresponding to each pair of microphones in any three microphones satisfies the simple constraint relation, the delay time difference corresponding to each pair of microphones can be obtained, and then the unknown delay time difference corresponding to each pair of microphones can be obtained according to the peak condition of the cross-correlation function in the corresponding delay time difference estimation window.

Based on the above embodiments, the embodiment of the present invention provides a simple constraint relationship of delay time differences corresponding to two microphones in three microphones, so that the unknown delay time differences corresponding to two microphones can be obtained quickly and simply according to the known delay time differences corresponding to two microphones.

Further, based on the above embodiment, the expression of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:

The difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source satisfies the above formula, and A, B is merely for distinguishing microphones, and is not limited to a specific microphone.

It can be seen that the difference in delay times of the two microphones with respect to the speech chirp signal and with respect to the target sound source can be calculated from the distance of the two microphones from the chirp speech signal source, the distance of the two microphones from the target sound source, and the speed of sound.

Since there is an error in calculating the distance and the sound velocity varies with the temperature, the delay time difference of all the two microphones is not directly calculated by using the above equation of the difference between the delay time of any two microphones with respect to the voice chirp signal and the delay time of the target sound source, but is obtained by using the method of the peak value corresponding to the cross correlation function. However, calculating the difference in delay time of any two microphones with respect to the voice chirp signal and with respect to the target sound source is necessary to obtain a delay time difference estimation window, because the delay time difference estimation window is a window corresponding to the estimated range of the delay time difference, the range of the delay time difference can be estimated from the result of the above calculation, and thus a delay time difference estimation window is obtained.

On the basis of the above-described embodiments, the embodiments of the present invention provide a basis for acquisition of a delay time difference estimation window by giving an expression of the difference between the delay times of any two microphones with respect to a voice chirp signal and with respect to a target sound source.

Further, based on the above embodiment, the expression of the absolute value of the error of the difference of the delay time of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:

The absolute value of the error of any two microphones with respect to the difference in delay time of the voice chirp signal and with respect to the target sound source is related to the distance measurement error, the minimum value and the maximum value of the sound velocity, and can be expressed by the above formula.

The embodiment of the invention provides any two microphones related to theAn expression of the absolute value of the difference between the speech chirp signal and the delay time with respect to the target sound source, i.e. giving the upper and lower limits of the error range, i.eTherefore, the range of the delay time difference estimation window can be determined, and the peak value corresponding to the delay time difference is ensured to appear in the range of the delay time difference estimation window.

On the basis of the above embodiment, the embodiment of the present invention obtains the range of the delay time difference estimation window by giving the expression form of the absolute value of the error of the difference between the delay time of the target audio source and the speech chirp signal of any two microphones, and ensures that the peak corresponding to the delay time difference appears in the range of the delay time difference estimation window, thereby ensuring that when the delay time difference estimation window has a unique peak, the unique peak is determined to correspond to the delay time difference of the corresponding two microphones.

Further, based on the above embodiment, the determining the target sound source to be subjected to the speech signal enhancement includes: and determining a sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.

For easier operation and more visualization, each sound source may be designed to be displayed as a different icon on the display screen of the server. When the monitoring personnel want to acquire the enhanced voice of a certain sound source, clicking the corresponding sound source icon or clicking nearby the corresponding sound source icon can be carried out. After receiving click information of a display screen, the server acquires a click position on the display screen, and determines a sound source corresponding to a sound source icon closest to the click position as the target sound source.

On the basis of the embodiment, according to the embodiment of the invention, the sound source corresponding to the sound source icon closest to the click position is determined as the target sound source according to the click position on the display screen, so that the convenience of acquiring the target sound source is improved.

Further, based on the above embodiment, before the determining the target sound source to be subjected to the speech signal enhancement, the method further includes: sound signals received by each microphone in the distributed microphone array are acquired.

The server performs the voice enhancement processing of the target sound source, and naturally acquires the sound signal of the target sound source. Each of the sound sources disposed in the sound field may become a target sound source, and sound signals of the respective sound sources are received by a microphone. Thus, the server needs to obtain the sound signals it receives from each microphone in the microphone array. Specifically, the microphone may be connected to the wireless module, and further transmit the sound signal received by the microphone to the server through the wireless module.

Based on the above embodiments, the embodiments of the present invention provide a basis for multi-source speech signal enhancement by acquiring sound signals received by each microphone in the distributed microphone array.

Fig. 2 is a schematic diagram of display contents of a display screen in a voice signal enhancement method based on a distributed microphone according to an embodiment of the present invention. Fig. 3 is a schematic diagram of a usage scenario in a method for enhancing a voice signal based on a distributed microphone according to an embodiment of the present invention. Fig. 4 is a schematic diagram of a signal processing procedure of a distributed microphone array in a method for enhancing a voice signal based on a distributed microphone according to an embodiment of the present invention. Fig. 5 is a schematic diagram of alignment of voice chirp signals in a voice signal enhancement method based on a distributed microphone according to an embodiment of the present invention. Fig. 6 is a schematic diagram of coarse-granularity alignment in a method for enhancing a voice signal based on a distributed microphone according to an embodiment of the present invention. Fig. 7 is a schematic illustration of fine granularity alignment in a method for enhancing a voice signal based on a distributed microphone according to an embodiment of the present invention. The voice signal enhancement method based on the distributed microphone according to the embodiment of the invention is further described in detail below with reference to fig. 2 to 7.

In order to overcome the inherent defects of the traditional centralized microphone array, the embodiment of the invention provides a distributed microphone array named chord ics. Chord ics utilizes distributed beamforming techniques to achieve highly controllable multi-source signal enhancement. Fig. 2 shows a schematic distribution of chord ics on a display screen. Unlike the prior art of centralized microphone array, microphone nodes in chord ics are distributed in the whole sound field, so that abundant space diversity is brought, and coverage can be greatly improved. In addition, since each microphone node is wirelessly connected (communication between the microphone and the server), the microphone node is not limited by the connection line, and thus the chord ics has strong expandability. The user can arbitrarily increase or decrease the array size and coverage by directly increasing or decreasing the microphone. More importantly, chord ics enables highly controllable target signal enhancement, i.e., enhancement of the signal of a sound source near any point in the sound field.

Embodiments of the present invention are directed to implementing a distributed microphone system for multi-source target signal enhancement. Specifically, a plurality of microphones are deployed in a monitoring environment, and signal reinforcement and interference elimination for a certain sound source are achieved through coherent superposition of signals acquired by the microphones. With this system, as shown in fig. 3, the inspector simply needs to sit remotely in front of the computer display and click on a certain position in the screen, and the system will intensify and play the sound source near the clicked target area.

The implementation principle of the embodiment of the present invention is described in detail below.

The beam forming can fully utilize the space information, delay and combine the multipath signals, strengthen the signals from the specific direction, and reduce the signals from other directions, thereby realizing the voice enhancement. Specifically, as shown in fig. 4, three microphones are placed at equal intervals by a distance d. If a sound source is far enough away, the wavefront of the sound source can be seen as a plane and the arrival paths to the microphones can be seen approximately parallel. Let the angle between the propagation path and the microphone be θ, the relative delay between two adjacent microphones is τ=dcos (θ)/c, c being the sound velocity. By compensating for the relative delays of the individual microphone signals and superimposing, the speech signal can be enhanced, i.e. the beamformed output is:where M is the total number of microphones, x _m Is the speech signal of the mth microphone.

From the above, it can be seen that multi-microphone enhancement is achieved, and the most central problem is to calculate the time delay of the target signal relative to each microphone, so as to align each signal. But for a distributed microphone scenario there is a real problem as follows: (1) There is a significant problem of clock misalignment between the various nodes so the signals cannot be aligned with absolute time stamps. (2) There is some measurement or deployment error in the position between the nodes, and (3) the speed of sound is not strictly fixed, and can vary with temperature changes, resulting in a relative delay that is difficult to calculate accurately.

The above problems are addressed. The embodiment of the invention provides a method for combining coarse granularity alignment and fine granularity alignment, which aims at a target signal accurately.

(1) Coarse grain alignment

The first task of chord ics is to control the upper error limit of the relative error estimate. The estimation error is composed of three parts: time synchronization errors, distance measurement errors, and an uncertain sound speed (the sound speed may change with changes in air temperature). Without the time synchronization mechanism, the time synchronization errors would gradually accumulate, such that the upper limit of the estimation error cannot be determined. On the other hand, the currently prevailing time synchronization mechanism is far from reaching the precision required by chord ics. To solve this problem, an additional voice chirp signal is introduced as a reference signal, thereby eliminating a time synchronization error. For an easier understanding of the design of the embodiments of the present invention, reference is made to a simple example:

as shown in fig. 5, it is assumed that one additional voice signal source is configured at each target sound source. The additional source broadcasts a voice chirp reference signal. Because the target signal and the chirp signal propagate around the same ground direction, the relative delays of the two signals to the respective microphones are the same. This primarily aligns the chirp signal, which can align the target signal. The problem of calculating the relative delay is thus translated into the problem of detecting the chirp signal.

The reason for choosing the chirp signal as the reference signal is that chirp is very sensitive to signal misalignment. Specifically, chirp is a sinusoidal signal whose frequency varies linearly and rapidly with time. Misalignment of the two chirp signals in the time domain will cause a sharp drop in the cross-power spectrum intensity, with very narrow peaks on the cross-power spectrum, facilitating accurate alignment of the chirp signals.

Consider now the more general problem that the chirp signal is not co-located with the target signal. Without loss of generality, a scenario of two microphones is used to introduce chord ics (as in fig. 6). In this example, the target signal and the chirp signal are located at two different sites. The two microphone nodes will receive the target signal at different times.

By x _A (t) and x _B (t) represents the speech signals received by microphones a and B, respectively.And->Representing the signal after alignment of the speech chirp signal.

Index here _A And index _B Respectively indicate that chirp signals are in x _A (t) and x _B Sample point position in (t), F _s Representing the sampling rate of the microphone,representing the relative delays of microphones a and B with respect to the chirp signal. From FIG. 6, it can be seen thatIt is emphasized that what is to be done is to align the target sound source, that is to say to find the relative delay of the microphone with respect to the target sound source >From FIG. 6, it can be seen that

According to the propagation speed of the signal

Here c represents the sound velocity. Likewise there are

Substituting the formulas (3) and (4) into the formula (2),

equation (1) illustrates that alignment of the chirp signal can be eliminatedSo long as delta can be obtained _AB Can get +.>However, in practice the position of the microphone and chirp is inaccurate and even the sound speed will vary with temperature changes, so that it is difficult to calculate +.>

It is noted, however, that even if it is not precisely calculatedBut can determine +.>To help the final determination of the following steps of the embodiments of the invention +.>Δ _AB Upper error limit of (2) is

E here _d Being the upper bound of the distance measurement error c _min And c _max Representing the minimum and maximum possible values of sound velocity. It can be estimated that, in a room with a length and width of 20 meters,is less than 20 milliseconds.

(2) Fine grain alignment

The method of accurately determining the relative delay is described below. Consider speech signals received by microphones A and B defined by (1)And->The cross-correlation (CCF) of the two is defined as

Obviously when p=Δ _AB When the two signals are perfectly aligned, andthere is a peak. Naturally, Δ can be found by the following formula _AB

Here, the To substitute the coarse distance and sound velocity into delta _AB A rough estimate of the formula, whereas +.>Is delta _AB Maximum error of estimation, will->Referred to as delta _AB Is otherwise referred to as->Referred to as the maximum error window.

However, there are a plurality of sound sources in the actual sound field, and there may be a plurality of peaks within the maximum error window, so that it is difficult for chord ics to directly determine which peak corresponds to the target signal. Specifically, it is assumed that there are two sound sources (target sound source s ^D (t) and interfering sound sources s ^I (t)), the signal received by a certain microphone ω∈Ω= { a, B } may be expressed as

Here, theRepresenting the attenuation coefficient of both signals, +.>And->Representing the propagation delay of two sound sources to the microphone omega, n _ω And (t) represents noise.

Signal x _A (t) and x _B CCF between (t) is

And CCF of the aligned signals is

From the above, it can be seen that Cor _AB Will be atAnd->Peaks appear at two locations. Obviously, when +.>Fall into->Is>It is difficult to judge directly that +.>Which of the estimated windows is the peak of the target signal. />Representation of s ^D An autocorrelation function of (t); />Representation of s ^I Autocorrelation function of (t)。

To solve this problem, embodiments of the present invention propose a method of continuous disambiguation. The method fully exploits the diversity of the set of distributed microphones to iteratively determine the location of the peak of the target signal. Specifically, taking fig. 7 as an example, the target signal is And->The time of day reaches microphones A, B and C, respectively. The relative time delay of microphones a and B with respect to the target signal is

Further can obtain

Also, the relative delay with respect to the chirp signal is

Formula (12) minus formula (13)

Δ _AB ＝Δ _AC -Δ _BC (14)

The above equation reveals a very important relationship between the relative delays (the above equation is not labeled, indicating that it applies to all sources). By using this relationship, the target signal can be determined: as long as there is only one peak in the maximum error window of the CCF of the two microphones, the peak corresponding to the target signal between the other microphones can be found iteratively. Fig. 7 is a specific example (in this example, CCF has been normalized by the corresponding coarse estimate of the target signal): looking at the CCF of microphones B and C, it can be seen that there is only one peak within the maximum error window (delay time difference estimation window). Because fine granularity alignment ensuresThe presence of a peak of the target signal within the maximum error window is certain to be ascertainedThe peak in the maximum error window is the peak of the target signal, so that +.>Is a precise value of (a). Further, according to->Can determine +.>And->At->Andin (2), only the peak satisfying the formula (14) can be determined as +.>And- >This process can also be extended for multiple microphones. Fig. 7 is merely an example for illustrating the calculation of the delay time difference, and it should be noted that the delay time difference estimation windows corresponding to the cross-correlation functions of two different microphones are not necessarily the same.

In summary, the method of continuous disambiguation is summarized as follows:

1. calculating CCF of the microphone signals;

2. finding a CCF having only one peak within the maximum error range, then determining the peak as the target peak;

3. peaks of other target signals are found iteratively by equation (14).

The implementation of chord ics hardware mainly comprises: wireless node, microphone sensor and server. One embodiment is: 6 Raspberry groups (Raspberry Pi 3 Model B+) equipped with WiFi modules were used as wireless nodes. Each raspberry group is connected to two microphones (12 microphones total) through a USB interface. These 12 microphones are randomly distributed in a 10m x 12m room. A plurality of JBL sounds are used as a target sound source and an interfering sound source. All microphones and sound are commercial, inexpensive devices. The raspberry group streams the signals collected by the microphone to the server. All signal detection, alignment and enhancement are performed on the server.

On the one hand, the chord mics system introduces an extra chirp voice signal to realize clock synchronization among distributed nodes, and by referring to the chirp signal, the chord mics can eliminate clock errors among the nodes. On the other hand, by calculating the relative time of the signals received by each microphone and combining the geometric diversity characteristics, chord ics can accurately find out the relative delay of the target sound source among the microphones, so that the voice signals of the microphones can be accurately aligned, and the signals are coherently overlapped, thereby realizing the enhancement of the target sound source and the elimination of interference.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention. As shown in fig. 8, the server 1 includes a target sound source determining module 10, a voice chirp signal alignment module 20, a first delay time difference obtaining module 30, a second delay time difference obtaining module 40, a target sound source delay time obtaining module 50, and a target sound source voice signal alignment enhancing module 60, wherein: the target sound source determining module 10 is configured to: determining a target sound source to be subjected to voice signal enhancement; the voice chirp signal alignment module 20 is configured to: aligning voice chirp signals of sound signals received by two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by the two microphones; wherein the voice chirp signal is sent by a chirp voice signal source which is placed in the sound field in advance; the first delay time difference acquisition module 30 is configured to: acquiring peak information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any one of the delay time difference estimation windows has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the delay time difference estimation range, and the delay time difference corresponds to a peak value in the delay time difference estimation window; the second delay time difference acquisition module 40 is configured to: based on the determined delay time difference of every two microphones, iteratively acquiring the delay time difference of every other two microphones according to a relation of the delay time difference corresponding to every two microphones in three microphones and peak information in the delay time difference estimation window; the target sound source delay time acquisition module 50 is configured to: acquiring the first delay time of every two microphones in the distributed microphone array about the voice chirp signal, and acquiring the second delay time of every two corresponding microphones about the target sound source based on the first delay time and the delay time difference; the target audio source speech signal alignment enhancement module 60 is configured to: and aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array.

Further, based on the above embodiment, the target sound source determining module 10 is specifically configured to: and determining a sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.

Further, based on the above embodiment, the server further includes a sound signal acquisition module, where the sound signal acquisition module is configured to: sound signals received by each microphone in the distributed microphone array are acquired.

Fig. 9 is a schematic structural diagram of a voice signal enhancement system based on a distributed microphone according to an embodiment of the present invention. As shown in fig. 9, the system includes: a wireless node 2, a distributed microphone array 3, at least one sound source 4, a chirp voice signal source 5 and the server 1; wherein the wireless node 2 is connected to at least one microphone in the microphone array 3, and is configured to transmit a sound signal received by the connected microphone to the server 1.

The device provided in the embodiment of the present invention is used in the above method, and specific functions may refer to the above method flow, which is not described herein again.

Fig. 10 is a schematic physical structure of an electronic device according to an embodiment of the invention. As shown in fig. 10, the electronic device may include: a processor 1010, a communication interface (Communications Interface) 1020, a memory 1030, and a communication bus 1040, wherein the processor 1010, the communication interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may call logic instructions in memory 1030 to perform the following methods: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of sound signals received by two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by the two microphones; wherein the voice chirp signal is sent by a chirp voice signal source which is placed in the sound field in advance; acquiring peak information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any one of the delay time difference estimation windows has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the delay time difference estimation range, and the delay time difference corresponds to a peak value in the delay time difference estimation window; based on the determined delay time difference of every two microphones, iteratively acquiring the delay time difference of every other two microphones according to a relation of the delay time difference corresponding to every two microphones in three microphones and peak information in the delay time difference estimation window; acquiring the first delay time of every two microphones in the distributed microphone array about the voice chirp signal, and acquiring the second delay time of every two corresponding microphones about the target sound source based on the first delay time and the delay time difference; and aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array.

Further, the logic instructions in the memory 1030 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the method provided in the above embodiments, for example, including: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of sound signals received by two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by the two microphones; wherein the voice chirp signal is sent by a chirp voice signal source which is placed in the sound field in advance; acquiring peak information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any one of the delay time difference estimation windows has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the delay time difference estimation range, and the delay time difference corresponds to a peak value in the delay time difference estimation window; based on the determined delay time differences of the microphones, iteratively acquiring the delay time differences of other microphones according to the relation of the delay time differences corresponding to every two microphones in the three microphones and peak information in the delay time difference estimation window; acquiring the first delay time of every two microphones in the distributed microphone array about the voice chirp signal, and acquiring the second delay time of every two corresponding microphones about the target sound source based on the first delay time and the delay time difference; and aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for enhancing a speech signal based on a distributed microphone, comprising:

determining a target sound source to be subjected to voice signal enhancement;

aligning voice chirp signals of sound signals received by two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by the two microphones; wherein the voice chirp signal is sent by a chirp voice signal source which is placed in the sound field in advance;

acquiring peak information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any one of the delay time difference estimation windows has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the delay time difference estimation range, and the delay time difference corresponds to a peak value in the delay time difference estimation window;

Based on the determined delay time difference of every two microphones, iteratively acquiring the delay time difference of every other two microphones according to a relation of the delay time difference corresponding to every two microphones in three microphones and peak information in the delay time difference estimation window;

acquiring the first delay time of every two microphones in the distributed microphone array about the voice chirp signal, and acquiring the second delay time of every two corresponding microphones about the target sound source based on the first delay time and the delay time difference;

aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array; the relation of the delay time difference corresponding to every two microphones in the three microphones is as follows:

wherein,representing the difference in delay times of microphone a and microphone B with respect to the speech chirp signal and with respect to the target sound source; />Representing the difference in delay times of microphone a and microphone C with respect to the speech chirp signal and with respect to the target sound source; />Representing the difference in delay times of microphone B and microphone C with respect to the speech chirp signal and with respect to the target sound source;

The expression of the difference between the delay time of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:

wherein,representing the difference in delay time of microphone a and microphone B with respect to the speech chirp signal and with respect to said target sound source, +.>Representing the distance between microphone A and the source of the chirp speech signal,/and>indicating the distance between microphone a and said target sound source,/->Representing the distance between microphone B and the source of said chirp speech signal,/o>Representing the distance between the microphone B and the target sound source, c representing the speed of sound;

the expression of the absolute value of the error of any two microphones with respect to the difference of the delay time of the voice chirp signal and with respect to the target sound source is:

2. The method of claim 1, wherein the determining a target sound source for speech signal enhancement comprises:

and determining a sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.

3. The method of distributed microphone-based speech signal enhancement according to claim 2, wherein prior to said determining a target sound source for speech signal enhancement, the method further comprises: sound signals received by each microphone in the distributed microphone array are acquired.

4. A server, comprising:

the target sound source determining module is used for: determining a target sound source to be subjected to voice signal enhancement;

a voice chirp signal alignment module for: aligning voice chirp signals of sound signals received by two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by the two microphones; wherein the voice chirp signal is sent by a chirp voice signal source which is placed in the sound field in advance;

a first delay time difference acquisition module configured to: acquiring peak information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any one of the delay time difference estimation windows has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the delay time difference estimation range, and the delay time difference corresponds to a peak value in the delay time difference estimation window;

A second delay time difference acquisition module configured to: based on the determined delay time difference of every two microphones, iteratively acquiring the delay time difference of every other two microphones according to a relation of the delay time difference corresponding to every two microphones in three microphones and peak information in the delay time difference estimation window;

the target sound source delay time acquisition module is used for: acquiring the first delay time of every two microphones in the distributed microphone array about the voice chirp signal, and acquiring the second delay time of every two corresponding microphones about the target sound source based on the first delay time and the delay time difference;

the target sound source voice signal alignment enhancement module is used for: aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array;

the relation of the delay time difference corresponding to every two microphones in the three microphones is as follows:

wherein,representing the difference in delay times of microphone a and microphone B with respect to the speech chirp signal and with respect to the target sound source; / >Representing the difference in delay times of microphone a and microphone C with respect to the speech chirp signal and with respect to the target sound source; />Representing the difference in delay times of microphone B and microphone C with respect to the speech chirp signal and with respect to the target sound source;

5. A distributed microphone-based speech signal enhancement system comprising a wireless node, a distributed microphone array, at least one sound source, a chirp speech signal source, and the server of claim 4; the wireless node is connected with at least one microphone in the microphone array and is used for transmitting sound signals received by the connected microphones to the server.

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the distributed microphone based speech signal enhancement method according to any of claims 1 to 3 when the computer program is executed by the processor.

7. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the distributed microphone based speech signal enhancement method according to any of claims 1 to 3.