CN112735459A - Voice signal enhancement method, server and system based on distributed microphones - Google Patents

Voice signal enhancement method, server and system based on distributed microphones Download PDF

Info

Publication number
CN112735459A
CN112735459A CN201911032121.3A CN201911032121A CN112735459A CN 112735459 A CN112735459 A CN 112735459A CN 201911032121 A CN201911032121 A CN 201911032121A CN 112735459 A CN112735459 A CN 112735459A
Authority
CN
China
Prior art keywords
delay time
microphones
microphone
voice
sound source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911032121.3A
Other languages
Chinese (zh)
Other versions
CN112735459B (en
Inventor
何源
王伟国
李金明
金梦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201911032121.3A priority Critical patent/CN112735459B/en
Publication of CN112735459A publication Critical patent/CN112735459A/en
Application granted granted Critical
Publication of CN112735459B publication Critical patent/CN112735459B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Abstract

The embodiment of the invention provides a voice signal enhancement method, a server and a system based on a distributed microphone, wherein the method comprises the following steps: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of any two microphones and then obtaining a cross-correlation function; obtaining a delay time difference according to peak information of the cross-correlation function in a delay time difference estimation window; acquiring delay time differences of other microphones according to a relation of delay time differences of every two of the three microphones and peak value information; and acquiring delay time of the two microphones relative to the target sound source, and further aligning and enhancing relative to the target sound source. The embodiment of the invention overcomes the defects of the existing centralized microphone array by adopting the deployment mode of the distributed microphone array, realizes the clock synchronization of the distributed microphone array by utilizing the voice chirp signal assistance, and effectively realizes the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source.

Description

Voice signal enhancement method, server and system based on distributed microphones
Technical Field
The invention relates to the technical field of communication, in particular to a voice signal enhancement method, a server and a system based on distributed microphones.
Background
Currently, sound is an important input source for many diagnostic systems. Machine diagnostics in an industrial setting is a typical example: the running sounds emitted by the machines in different states are different, and the experienced inspector can distinguish the running state of the machine by listening to the running sounds. However, in practice, the environment of the factory building is very noisy, various sounds interfere with each other, and even the volume of the noise may be larger than that of the target machine, which brings great interference to the judgment of the inspector. The inspector has to approach the machine and bring the ear close to the machine to diagnose the condition of the machine. Obviously, working for a long time in such an extremely noisy environment can greatly impair the inspector's hearing.
The current more sophisticated speech enhancement technology is based on a beam-forming centralized microphone array. However, these techniques have the following disadvantages: (1) lower resolution ratio: when the angles Of Arrival (DOAs) Of multiple sound sources are the same or close, it is difficult for the centralized microphone array to distinguish the sound sources. (2) The coverage is limited: although a centralized microphone array can improve coverage to some extent by increasing the number of microphones, its sound signal still exhibits very significant signal attenuation when the sound source is far from the microphone array.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide a method, a server, and a system for enhancing a voice signal based on a distributed microphone.
In a first aspect, an embodiment of the present invention provides a distributed microphone-based speech signal enhancement method, including: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones; acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.
Further, the relationship of the delay time differences corresponding to two microphones of the three microphones is as follows:
Figure BDA0002250460360000021
wherein the content of the first and second substances,
Figure BDA0002250460360000022
representing the difference of the delay times of the microphone a and the microphone B with respect to the voice chirp signal and with respect to the target sound source;
Figure BDA0002250460360000023
representing microphones A and C with respect to the speech chirp signal and with respect to the target sound sourceA difference in delay times;
Figure BDA0002250460360000024
which represents the difference in delay time of the microphone B and the microphone C with respect to the speech chirp signal and with respect to the target sound source.
Further, the expression of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:
Figure BDA0002250460360000031
wherein the content of the first and second substances,
Figure BDA0002250460360000032
representing the difference in delay times of microphone a and microphone B with respect to the speech chirp signal and with respect to the target audio source,
Figure BDA0002250460360000033
represents the distance between microphone a and the source of the chirp voice signal,
Figure BDA0002250460360000034
representing the distance between microphone a and the target audio source,
Figure BDA0002250460360000035
represents the distance between microphone B and the source of the chirp voice signal,
Figure BDA0002250460360000036
representing the distance between the microphone B and the target audio source.
Further, the expression of the absolute value of the error of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:
Figure BDA0002250460360000037
wherein the content of the first and second substances,
Figure BDA0002250460360000038
absolute value representing an error of the difference between the delay times of the microphone a and the microphone B with respect to the voice chirp signal and with respect to the target sound source, edRepresenting an upper bound of the distance measurement error, cminAnd cmaxRespectively representing the minimum value and the maximum value of the sound velocity.
Further, the determining a target sound source to be subjected to speech signal enhancement comprises: and determining the sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.
Further, before the determining a target sound source to be subjected to speech signal enhancement, the method further comprises: and acquiring sound signals received by all microphones in the distributed microphone array.
In a second aspect, an embodiment of the present invention provides a server, including: a target audio source determination module to: determining a target sound source to be subjected to voice signal enhancement; a voice chirp signal alignment module configured to: aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; a first delay time difference acquisition module configured to: acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; a second delay time difference obtaining module, configured to: iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones; a target audio source delay time acquisition module for: acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; a target source speech signal alignment enhancement module for: and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.
In a third aspect, an embodiment of the present invention provides a speech signal enhancement system based on distributed microphones, including: the system comprises a wireless node, a distributed microphone array, at least one sound source, a chirp voice signal source and a server; wherein the wireless node is connected with at least one microphone in the microphone array and is used for transmitting the sound signals received by the connected microphone to the server.
In a fourth aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the computer program.
In a fifth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method as provided in the first aspect.
According to the distributed microphone-based voice signal enhancement method, the server and the system, the defects of the existing centralized microphone array are overcome by adopting the arrangement mode of the distributed microphone array, the clock synchronization of the distributed microphone array is realized by the aid of the voice chirp signals, and the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source are effectively realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating display contents of a display screen in the distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;
fig. 3 is a schematic view of a usage scenario in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a signal processing process of a distributed microphone array in the distributed microphone based speech signal enhancement method according to an embodiment of the present invention;
fig. 5 is a schematic diagram illustrating alignment of a voice chirp signal in a distributed microphone-based voice signal enhancement method according to an embodiment of the present invention;
fig. 6 is a schematic diagram illustrating coarse grain alignment in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;
fig. 7 is a schematic diagram of fine-grained alignment in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention;
FIG. 8 is a block diagram of a server according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a distributed microphone-based speech signal enhancement system according to an embodiment of the present invention;
fig. 10 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step 101, determining a target sound source to be subjected to voice signal enhancement;
the voice signal enhancement method based on the distributed microphone is not only suitable for enhancing single-voice source voice signals, but also suitable for enhancing multi-voice source voice signals. The method provided by the embodiment of the invention is operated on the server. When only one sound source in the sound field needs to be monitored, the sound source can be directly used as a fixed target sound source to enhance signals for monitoring. When a sound field with multiple sound sources needs to be monitored, a target sound source to be subjected to speech signal enhancement needs to be determined. Since the enhancement of the voice signal of a certain sound source is to hear the sound of the corresponding sound source more clearly, the enhancement of the voice signal of the remaining sound sources is not performed in the process of enhancing the voice signal of the certain sound source.
The method for determining the target sound source to be subjected to speech signal enhancement can be preset and can be realized by adopting various methods. For example, the selection may be performed in a sound source list.
102, aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance;
because the position of each microphone in the distributed microphone array is different, the sound signals of the same sound source received by each microphone have time difference, so that the time delay time of the sound signals of the target sound source received by each microphone needs to be calculated, the sound signals of the target sound source received by each microphone can be aligned, and after the time alignment, the sound signals of the target sound source received by each microphone are superposed, so that the sound signals of the target sound source can be enhanced.
Here, the voice chirp signal is introduced as a reference signal to assist alignment. The speech chirp signal is very sensitive to signal misalignment and is a sinusoidal signal whose frequency varies rapidly and linearly with time. Misalignment of the two voice chirp signals in the time domain will cause a sharp drop in the cross-power spectral strength, resulting in very narrow peaks in the cross-power spectrum, so that the voice chirp signals can be accurately aligned.
After aligning voice chirp signals of sound signals received by every two microphones in the distributed microphone array, solving a cross-correlation function CCF of the sound signals received by any two microphones.
103, obtaining peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window;
the sound signals received by the two microphones include a sound signal of a target sound source, a sound signal of a chirp voice signal source, a sound signal of another sound source (if any), a sound signal of an interfering sound source (if any), and the like. To obtain the delay time of the target sound source, a delay time difference estimation window corresponding to the target sound source needs to be set. The voice chirp signals received by the two microphones have a delay time, which is called as a first delay time; the sound signals of the target sound source received by the two microphones also have a delay time, which is called a second delay time. The difference between the first delay time and the second delay time is defined as a delay time difference. The delay time difference estimation window is a window corresponding to the estimation range of the delay time difference. The true values of the delay time differences correspond to peaks in the respective delay time difference estimation windows.
Since the delay time difference estimation windows of the respective sound sources may have overlap, a plurality of peaks may exist in the delay time difference estimation window of the target sound source, but the plurality of peaks that occur necessarily include a peak corresponding to the delay time difference corresponding to the target sound source. Therefore, the peak information of each cross-correlation function in the corresponding delay time difference estimation window is obtained, and if any one of the delay time difference estimation windows has a unique peak, the unique peak corresponds to the delay time difference of the two microphones relative to the target sound source.
104, iteratively acquiring the delay time differences of other two microphones according to the determined delay time differences of the two microphones in the three microphones and a relation formula of the delay time differences corresponding to the two microphones in the three microphones and peak value information in the delay time difference estimation window;
the relation of the delay time difference corresponding to every two of the three microphones can be obtained through calculation, and the relation shows the restriction relation of the delay time difference corresponding to every two of the three microphones. This relationship or constraint can be obtained according to the existing techniques, as long as the expression is correct and not limited.
The delay time difference estimation window is a window corresponding to the estimation range of the difference value of the first delay time of the two microphones relative to the voice chirp signal and the second delay time relative to the target sound source. Thus, for any two microphones, there corresponds a respective said delay time difference estimation window. And, the peak value at the corresponding delay time difference estimation window corresponds to a difference value of a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source.
Similarly, there may be a plurality of peak signals in each delay time difference estimation window, but based on the above-mentioned relation of the delay time differences corresponding to two microphones in three microphones, a peak value corresponding to the difference between the first delay time of the two microphones with respect to the speech chirp signal and the second delay time of the two microphones with respect to the target sound source can be obtained, that is, the delay time differences of the other microphones can be obtained.
Step 105, obtaining the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and obtaining the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference;
the delay time difference is a difference value between the first delay time with respect to the voice chirp signal and the second delay time with respect to the target sound source. And the first delay times of the two microphones with respect to the voice chirp signal are easy to obtain, the second delay times of the respective two microphones with respect to the target sound source are obtained based on the first delay time and the delay time difference, that is, the second delay times of any two microphones with respect to the target sound source can be obtained.
Step 106, aligning and enhancing the sound received by each microphone with respect to the target sound source according to the second delay time between every two microphones of the distributed microphone array;
after the second delay time of each two microphones with respect to the target sound source is obtained, the delay information of the sound signal of the target sound source received between the microphones is clarified, so that the sound received by each microphone can be aligned with respect to the target sound source according to the second delay time between each two microphones of the distributed microphone array, and then the signals after alignment are superposed to enhance the sound signal of the target sound source.
The embodiment of the invention overcomes the defects of the existing centralized microphone array by adopting the deployment mode of the distributed microphone array, realizes the clock synchronization of the distributed microphone array by utilizing the voice chirp signal assistance, and effectively realizes the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source.
Further, based on the above embodiment, the relationship of the delay time differences corresponding to two microphones of the three microphones is as follows:
Figure BDA0002250460360000091
wherein the content of the first and second substances,
Figure BDA0002250460360000092
the difference value of the delay time of the microphone A and the delay time of the microphone B relative to the voice chirp signal and the target sound source are represented, namely the delay time difference corresponding to the microphone A and the microphone B is represented;
Figure BDA0002250460360000093
the difference value of the delay time of the microphone A and the microphone C relative to the voice chirp signal and the target sound source is represented, namely the delay time difference corresponding to the microphone A and the microphone C;
Figure BDA0002250460360000094
the difference between the delay times of the microphone B and the microphone C with respect to the voice chirp signal and with respect to the target sound source is represented, that is, the delay time difference corresponding to the microphone B and the microphone C.
The delay time difference corresponding to each two of any three microphones satisfies the above formula, and the above A, B, C is only used to distinguish the microphones and is not limited to specific microphones.
As can be seen from the above relationship, the delay time differences corresponding to two microphones in any three microphones satisfy the simple constraint relationship, and then, after obtaining the delay time differences corresponding to one or two microphones, the unknown delay time differences corresponding to two microphones can be obtained according to the peak value conditions of the cross-correlation function in the corresponding delay time difference estimation window.
On the basis of the above embodiment, the embodiment of the present invention provides a simple constraint relationship of the delay time differences corresponding to two microphones among three microphones, and facilitates to quickly and easily acquire the unknown delay time differences corresponding to two microphones according to the known delay time differences corresponding to two microphones.
Further, based on the above-described embodiment, the expression of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:
Figure BDA0002250460360000101
wherein the content of the first and second substances,
Figure BDA0002250460360000102
representing the difference in delay times of microphone a and microphone B with respect to the speech chirp signal and with respect to the target audio source,
Figure BDA0002250460360000103
represents the distance between microphone a and the source of the chirp voice signal,
Figure BDA0002250460360000104
representing the distance between microphone a and the target audio source,
Figure BDA0002250460360000105
represents the distance between microphone B and the source of the chirp voice signal,
Figure BDA0002250460360000106
representing the distance between the microphone B and the target audio source.
The difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source satisfies the above formula, and the A, B is only used for distinguishing the microphones and is not limited to a specific microphone.
It can be seen that the difference between the delay times of the two microphones with respect to the voice chirp signal and with respect to the target sound source can be calculated from the distance between the two microphones and the chirp voice signal source, the distance between the two microphones and the target sound source, and the sound velocity.
Since the distance calculation has errors and the sound velocity changes with the temperature, the delay time difference between any two microphones cannot be directly calculated by using the formula of the difference between the delay times of any two microphones with respect to the voice chirp signal and the target sound source, but the delay time difference is obtained by adopting the method of corresponding the peak values through the cross-correlation function. However, it is necessary to calculate the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source to obtain the delay time difference estimation window, because the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the range of the delay time difference can be estimated based on the result of the above calculation to obtain the delay time difference estimation window.
On the basis of the above embodiments, the embodiments of the present invention provide a basis for obtaining the delay time difference estimation window by giving an expression of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source.
Further, based on the above-described embodiment, the expression of the absolute value of the error of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is:
Figure BDA0002250460360000111
wherein the content of the first and second substances,
Figure BDA0002250460360000112
absolute value representing an error of the difference between the delay times of the microphone a and the microphone B with respect to the voice chirp signal and with respect to the target sound source, edIndicating distance measurementUpper bound of the quantity error, cminAnd cmaxRespectively representing the minimum value and the maximum value of the sound velocity.
The absolute value of the error of the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source is related to the distance measurement error, the minimum value and the maximum value of the sound velocity, and can be represented by the above equation.
The embodiment of the invention provides an expression of the absolute value of the error of any two microphones relative to the difference value of the voice chirp signal and the delay time of the target sound source, namely, the upper limit and the lower limit of the error range are provided
Figure BDA0002250460360000113
Therefore, the range of the delay time difference estimation window can be determined, and the peak value corresponding to the delay time difference is ensured to be in the range of the delay time difference estimation window.
On the basis of the above embodiments, the embodiment of the present invention obtains the range of the delay time difference estimation window by giving the expression form of the absolute value of the error between the difference between the delay times of any two microphones with respect to the voice chirp signal and with respect to the target sound source, and ensures that the peak value corresponding to the delay time difference appears within the range of the delay time difference estimation window, thereby ensuring that when there is a unique peak value in the delay time difference estimation window, the unique peak value is determined to correspond to the delay time difference of the corresponding two microphones.
Further, based on the above embodiment, the determining a target sound source to be subjected to speech signal enhancement includes: and determining the sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.
For more convenient operation and visualization, each sound source can be designed into different icons to be displayed on the display screen of the server. When the monitoring personnel want to acquire the enhanced voice of a certain sound source, the corresponding sound source icon can be clicked or clicked nearby the corresponding sound source icon. And after receiving the click information of the display screen, the server acquires the click position on the display screen and determines the sound source corresponding to the sound source icon closest to the click position as the target sound source.
On the basis of the above embodiment, according to the embodiment of the present invention, the sound source corresponding to the sound source icon closest to the click position is determined as the target sound source according to the click position on the display screen, so that convenience in obtaining the target sound source is improved.
Further, based on the above embodiment, before the determining the target sound source to be subjected to speech signal enhancement, the method further includes: and acquiring sound signals received by all microphones in the distributed microphone array.
The server performs speech enhancement processing of the target sound source, and naturally needs to acquire a sound signal of the target sound source. Each of the sound sources disposed in the sound field may become a target sound source, and sound signals of each of the sound sources are received by the microphones. Therefore, the server needs to acquire the sound signals it receives from each microphone in the microphone array. Specifically, the microphone may be connected to the wireless module, and then the sound signal received by the microphone may be transmitted to the server through the wireless module.
On the basis of the above embodiments, the embodiments of the present invention provide a basis for performing multi-source speech signal enhancement by obtaining the sound signals received by each microphone in the distributed microphone array.
Fig. 2 is a schematic diagram of display contents of a display screen in the distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. Fig. 3 is a schematic view of a usage scenario in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. Fig. 4 is a schematic diagram of a signal processing process of a distributed microphone array in the distributed microphone based speech signal enhancement method according to an embodiment of the present invention. Fig. 5 is a schematic diagram of aligning the voice chirp signals in the distributed microphone-based voice signal enhancement method according to an embodiment of the present invention. Fig. 6 is a schematic coarse-grained alignment diagram in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. Fig. 7 is a schematic diagram of fine-grained alignment in a distributed microphone-based speech signal enhancement method according to an embodiment of the present invention. The following describes the distributed microphone-based speech signal enhancement method provided by the embodiment of the present invention in further detail with reference to fig. 2 to 7.
To overcome the inherent drawbacks of conventional centralized microphone arrays, embodiments of the present invention propose a distributed microphone array, named chord mics. ChordMics utilizes distributed beamforming technology to achieve highly controllable multi-source signal enhancement. Fig. 2 shows a schematic diagram of the distribution of ChordMics on a display screen. Different from the existing centralized microphone array technology, the microphone nodes in the chord mics are dispersedly deployed in the whole sound field, so that abundant spatial diversity is brought, and the coverage range can be greatly improved. In addition, since each microphone node is wirelessly connected (communication between the microphone and the server), the microphone node is not limited by a wire, and chord mics has strong expandability. The user can arbitrarily increase or decrease the array size and coverage area by adding or subtracting microphones directly. More importantly, ChordMics can realize highly controllable target signal enhancement, namely enhancing the signal of a sound source near any one point in a sound field.
The embodiment of the invention aims to realize a distributed microphone system to perform multi-source target signal enhancement. Specifically, the embodiment of the invention deploys a plurality of microphones in the monitoring environment, and realizes signal enhancement and interference elimination for a certain sound source by coherently superposing signals collected by the microphones. With this system, the inspector simply needs to sit remotely in front of the computer display and click on a location in the screen, as shown in fig. 3, and the system will emphasize and play out the audio source near the clicked target area.
The following is a detailed description of the implementation principles of embodiments of the present invention.
Beamforming can make full use of spatial information, delay and combine multiple signals, enhance signals from a particular direction, and subtract signals from other directions, thereby achieving speech enhancement. Specifically, as shown in fig. 4, three microphones are placed at equal intervals by a distance d. If a toneWith the source at a sufficiently distant location, the wavefronts of the source can be seen as a plane and the arrival microphone propagation paths can be seen approximately parallel. Assuming that the angle between the propagation path and the microphone is θ, the relative delay between two adjacent microphones is τ d cos (θ)/c, and c is the speed of sound. By compensating for the relative delays of the individual microphone signals and adding, the speech signal can be enhanced, i.e. the beamformed output is:
Figure BDA0002250460360000141
where M is the total number of microphones, xmIs the voice signal of the mth microphone.
From the above, it can be seen that the multi-microphone enhancement is achieved, and the most central problem is to calculate the time delay of the target signal relative to each microphone, so as to align the signals. But for a distributed microphone scenario, the following real problems exist: (1) there is a significant clock synchronization problem between nodes, so the signals cannot be aligned with an absolute timestamp. (2) There is some measurement or deployment error in the position between each node, and (3) the speed of sound is not strictly fixed and will vary with temperature changes, making it difficult to calculate the relative delay accurately.
The above problems are addressed. The embodiment of the invention provides a method for combining coarse grain alignment and fine grain alignment, and the method can accurately align target signals.
(1) Coarse grain size alignment
The first task of ChordMics is to control the upper error bound of the relative error estimate. The estimation error is composed of three parts: time synchronization error, distance measurement error, and uncertain sound velocity (which varies with air temperature). Without a time synchronization mechanism, time synchronization errors can accumulate gradually, making it impossible to determine an upper limit for estimation errors. On the other hand, the current mainstream time synchronization mechanism is far from the precision required by chord mics. To solve this problem, an additional voice chirp signal is introduced as a reference signal, thereby eliminating a time synchronization error. For a more easy understanding of the design of the embodiments of the present invention, reference is first made to a simple example:
as shown in fig. 5, it is assumed that an additional voice signal source is provided at each target sound source. The additional signal source broadcasts a voice chirp reference signal. Because the target signal and the chirp signal travel around the same ground direction, the relative delays of the two signals to reach the respective microphones are the same. Thus, the chirp signal is mainly aligned, and the target signal can be aligned. The problem of calculating the relative delay is translated into the problem of detecting the chirp signal.
The reason why the chirp signal is selected as the reference signal is that the chirp signal is very sensitive to the misalignment of the signal. Specifically, chirp is a sinusoidal signal whose frequency varies rapidly linearly with time. Misalignment of the two chirp signals in the time domain will cause a sharp drop in the cross-power spectral strength, resulting in very narrow peaks in the cross-power spectrum, facilitating accurate alignment of the chirp signals.
Now consider the more general problem that the chirp signal is not co-located with the target signal. Without loss of generality, the two microphone scenario is used to introduce chormimo ics (see fig. 6). In this example, the target signal and the chirp signal are located at two different locations. The two microphone nodes will receive the target signal at different times.
By xA(t) and xB(t) represents the speech signals received by microphones a and B, respectively.
Figure BDA0002250460360000151
And
Figure BDA0002250460360000152
which represents the signal after alignment of the voice chirp signal.
Figure BDA0002250460360000153
Index hereAAnd indexBRespectively, indicating chirp signals at xA(t) and xBSampling point location in (t), FsWhich represents the sampling rate of the microphone(s),
Figure BDA0002250460360000154
representing the relative delays of microphones a and B with respect to the chirp signal. From FIG. 6, it can be observed
Figure BDA0002250460360000155
It is emphasized that all that is required is to align the target audio source, i.e. to find the relative time delay of the microphone with respect to the target audio source
Figure BDA0002250460360000156
It can be understood from FIG. 6
Figure BDA0002250460360000157
According to the propagation speed of the signal, there are
Figure BDA0002250460360000158
Where c denotes the speed of sound. The same is as
Figure BDA0002250460360000161
The formula (3) and the formula (4) are substituted into the formula (2),
Figure BDA0002250460360000162
equation (1) illustrates that aligning the chirp signal allows cancellation
Figure BDA0002250460360000163
So long as Δ can be obtainedABCan find out
Figure BDA0002250460360000164
However, in practice the position of the microphone and chirp is not accurate and even the speed of sound varies with temperature, so thatIt is difficult to accurately calculate
Figure BDA0002250460360000165
However, it is noted that even if it cannot be calculated accurately
Figure BDA0002250460360000166
But can determine
Figure BDA0002250460360000167
To thereby assist in finalizing the steps following embodiments of the present invention
Figure BDA0002250460360000168
ΔABHas an upper error limit of
Figure BDA0002250460360000169
E hereindUpper bound for distance measurement error, cminAnd cmaxRepresenting the minimum and maximum possible values of the speed of sound. It can be estimated that, in a room with a length and width of 20 m,
Figure BDA00022504603600001610
is less than 20 milliseconds.
(2) Fine grain alignment
The method of accurately determining the relative delay is described below. Consider the speech signals received by microphones A and B as defined by equation (1)
Figure BDA00022504603600001611
And
Figure BDA00022504603600001612
the cross-correlation function (CCF) of the two is defined as
Figure BDA00022504603600001613
It is obvious that when p is equal to deltaABThe two signals will be perfectly aligned, and
Figure BDA00022504603600001614
where there is a peak. Naturally, Δ can be obtained by the following formulaAB
Figure BDA00022504603600001615
Herein, the
Figure BDA0002250460360000171
Figure BDA0002250460360000172
To substitute coarse distance and speed of sound into ΔABA rough estimate obtained by the calculation formula, and
Figure BDA0002250460360000173
is ΔABMaximum error of estimation, will
Figure BDA0002250460360000174
Is called deltaABIs otherwise called
Figure BDA0002250460360000175
Referred to as the maximum error window.
However, in an actual sound field, there are multiple sound sources, and there may be multiple peaks within the maximum error window, so that it is difficult for the chord mics to directly determine which peak corresponds to the target signal. Specifically, it is assumed that there are two sound sources (target sound source s)D(t) and interfering sound source sI(t)), the signal received by a certain microphone ω ∈ Ω ═ { a, B } can be expressed as
Figure BDA0002250460360000176
Herein, the
Figure BDA0002250460360000177
Representing the attenuation coefficients of the two signals,
Figure BDA0002250460360000178
and
Figure BDA0002250460360000179
representing the propagation delays, n, of two sources to the microphone omegaω(t) represents noise.
Signal xA(t) and xBCCF between (t) is
Figure BDA00022504603600001710
And the CCF of the aligned signal is
Figure BDA00022504603600001711
From the above formula, Cor can be seenABWill be at
Figure BDA00022504603600001712
And
Figure BDA00022504603600001713
two locations appear to peak. It is obvious that
Figure BDA00022504603600001714
Fall into
Figure BDA00022504603600001715
Is estimated by the window
Figure BDA00022504603600001716
It is difficult to directly judge in
Figure BDA00022504603600001717
Which is the peak of the target signal.
Figure BDA00022504603600001718
Expression determination sD(t) an autocorrelation function;
Figure BDA00022504603600001719
expression determination sI(t) autocorrelation function.
To solve this problem, embodiments of the present invention propose a method of continuous disambiguation. The method fully excavates the set diversity of the distributed microphones and determines the peak position of the target signal in an iterative mode. Specifically, taking FIG. 7 as an example, the target signal is
Figure BDA00022504603600001720
And
Figure BDA00022504603600001721
the time instants reach the microphones A, B and C, respectively. The relative time delays of microphones a and B with respect to the target signal are then
Figure BDA0002250460360000181
Further can obtain
Figure BDA0002250460360000182
Similarly, the relative delay with respect to the chirp signal is
Figure BDA0002250460360000183
Formula (12) minus formula (13) to
ΔAB=ΔACBC (14)
The above formula reveals a very important relationship between the relative delays (the above formula is not superscripted, meaning that it applies to all sources). By using this relationship, the target signal can be determined: as long as there is only in the maximum error window of the CCF of the two microphonesIf there is a peak, the peak corresponding to the target signal between other microphones can be iteratively found. Fig. 7 is a specific example (in which the CCF has been normalized by a corresponding coarse estimate of the target signal): looking at the CCFs of microphones B and C, it can be seen that there is only one peak within the maximum error window (delay time difference estimation window). Since fine-grained alignment ensures that there must be a peak in the target signal within the maximum error window, it can be concluded that
Figure BDA0002250460360000184
The peak in the maximum error window is the peak of the target signal and can thus be determined
Figure BDA0002250460360000185
The exact value of (c). Further, according to
Figure BDA0002250460360000186
Can determine
Figure BDA0002250460360000187
And
Figure BDA0002250460360000188
in that
Figure BDA0002250460360000189
And
Figure BDA00022504603600001810
of (5), only the peak satisfying the formula (14) can be determined as
Figure BDA00022504603600001811
And
Figure BDA00022504603600001812
this process can be extended for multiple microphones as well. Fig. 7 is only an example for explaining the calculation process of the delay time difference, and it should be noted that the delay time difference estimation windows corresponding to the cross-correlation functions for two different microphones are not necessarily the same.
In summary, the method of successive disambiguation is summarized as follows:
1. calculating CCF of every two microphone signals;
2. finding out the CCF with only one peak in the maximum error range, and determining the peak as a target peak;
3. the peaks of the other target signals are iteratively found by equation (14).
The hardware for realizing ChordMics mainly comprises: wireless node, microphone sensor and server. One embodiment is: 6 Raspberry pies (Raspberry Pi 3 Model B +) equipped with WiFi modules were used as wireless nodes. Each raspberry pi is connected to two microphones (12 microphones in total) via USB interfaces. These 12 microphones were randomly distributed in a room of 10m x 12 m. A plurality of JBL sounds are used as a target sound source and an interference sound source. All microphones and sound are commercially available, inexpensive devices. The raspberry pi transmits the signal collected by the microphone to the server in a streaming mode. All signal detection, alignment and enhancement is performed at the server.
On one hand, the chord mics system introduces an additional chirp voice signal to realize clock synchronization among distributed nodes, and by referring to the chirp signal, the chord mics can eliminate clock errors among the nodes. On the other hand, by calculating the relative time of the signals received by each microphone and combining the geometrical diversity characteristics, ChordMics can accurately find out the relative delay of the target sound source among the microphones, so that the voice signals of the microphones can be accurately aligned, and the signals are coherently superposed, thereby realizing the enhancement of the target sound source and the elimination of interference.
Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention. As shown in fig. 8, the server 1 includes a target sound source determining module 10, a voice chirp signal aligning module 20, a first delay time difference obtaining module 30, a second delay time difference obtaining module 40, a target sound source delay time obtaining module 50, and a target sound source voice signal aligning enhancing module 60, wherein: the target audio source determination module 10 is configured to: determining a target sound source to be subjected to voice signal enhancement; the voice chirp signal alignment module 20 is configured to: aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; the first delay time difference obtaining module 30 is configured to: acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; the second delay time difference obtaining module 40 is configured to: iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones; the target sound source delay time acquisition module 50 is configured to: acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; the target source speech signal alignment enhancement module 60 is configured to: and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.
The embodiment of the invention overcomes the defects of the existing centralized microphone array by adopting the deployment mode of the distributed microphone array, realizes the clock synchronization of the distributed microphone array by utilizing the voice chirp signal assistance, and effectively realizes the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source.
Further, based on the above embodiment, the target sound source determining module 10 is specifically configured to: and determining the sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.
On the basis of the above embodiment, according to the embodiment of the present invention, the sound source corresponding to the sound source icon closest to the click position is determined as the target sound source according to the click position on the display screen, so that convenience in obtaining the target sound source is improved.
Further, based on the above embodiment, the server further includes a sound signal obtaining module, where the sound signal obtaining module is configured to: and acquiring sound signals received by all microphones in the distributed microphone array.
On the basis of the above embodiments, the embodiments of the present invention provide a basis for performing multi-source speech signal enhancement by obtaining the sound signals received by each microphone in the distributed microphone array.
Fig. 9 is a schematic structural diagram of a distributed microphone-based speech signal enhancement system according to an embodiment of the present invention. As shown in fig. 9, the system includes: the system comprises a wireless node 2, a distributed microphone array 3, at least one sound source 4, a chirp voice signal source 5 and a server 1; wherein the wireless node 2 is connected to at least one microphone of the microphone array 3 for transmitting sound signals received by the connected microphone to the server 1.
The embodiment of the invention overcomes the defects of the existing centralized microphone array by adopting the deployment mode of the distributed microphone array, realizes the clock synchronization of the distributed microphone array by utilizing the voice chirp signal assistance, and effectively realizes the alignment of the voice signals of the distributed microphone array and the signal enhancement of the target sound source.
The device provided by the embodiment of the present invention is used for the method, and specific functions may refer to the above method flow, which is not described herein again.
Fig. 10 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may call logic instructions in memory 1030 to perform the following method: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones; acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.
Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method provided by the foregoing embodiments, for example, including: determining a target sound source to be subjected to voice signal enhancement; aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance; acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window; iteratively acquiring the delay time differences of other microphones according to a relation of the delay time differences corresponding to every two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the microphones; acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference; and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for distributed microphone-based speech signal enhancement, comprising:
determining a target sound source to be subjected to voice signal enhancement;
aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance;
acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window;
iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones;
acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference;
and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.
2. The distributed microphone based speech signal enhancement method of claim 1, wherein the delay time differences for every two of the three microphones are expressed by the following relation:
Figure FDA0002250460350000011
wherein the content of the first and second substances,
Figure FDA0002250460350000021
representing the difference of the delay times of the microphone a and the microphone B with respect to the voice chirp signal and with respect to the target sound source;
Figure FDA0002250460350000022
representing the difference of the delay times of the microphone a and the microphone C with respect to the voice chirp signal and with respect to the target sound source;
Figure FDA0002250460350000023
which represents the difference in delay time of the microphone B and the microphone C with respect to the speech chirp signal and with respect to the target sound source.
3. The distributed microphone based speech signal enhancement method of claim 2, wherein the expression of the difference between the delay times of any two microphones with respect to the speech chirp signal and with respect to the target sound source is:
Figure FDA0002250460350000024
wherein the content of the first and second substances,
Figure FDA0002250460350000025
representing the difference in delay times of microphone a and microphone B with respect to the speech chirp signal and with respect to the target audio source,
Figure FDA0002250460350000026
represents the distance between microphone a and the source of the chirp voice signal,
Figure FDA0002250460350000027
representing the distance between microphone a and the target audio source,
Figure FDA0002250460350000028
represents the distance between microphone B and the source of the chirp voice signal,
Figure FDA0002250460350000029
representing the distance between the microphone B and the target audio source.
4. The distributed microphone based speech signal enhancement method of claim 3, wherein the expression of the absolute value of the error of the difference of the delay times of any two microphones with respect to the speech chirp signal and with respect to the target sound source is:
Figure FDA00022504603500000210
wherein the content of the first and second substances,
Figure FDA00022504603500000211
absolute value representing an error of the difference between the delay times of the microphone a and the microphone B with respect to the voice chirp signal and with respect to the target sound source, edRepresenting an upper bound of the distance measurement error, cminAnd cmaxRespectively representing the minimum value and the maximum value of the sound velocity.
5. The distributed microphone-based speech signal enhancement method of claim 1, wherein the determining a target sound source for which speech signal enhancement is to be performed comprises:
and determining the sound source corresponding to the sound source icon closest to the click position as the target sound source according to the click position on the display screen.
6. The distributed microphone-based speech signal enhancement method of claim 5, wherein prior to the determining the target audio source for speech signal enhancement, the method further comprises: and acquiring sound signals received by all microphones in the distributed microphone array.
7. A server, comprising:
a target audio source determination module to: determining a target sound source to be subjected to voice signal enhancement;
a voice chirp signal alignment module configured to: aligning voice chirp signals of sound signals received by every two microphones in a distributed microphone array, and then solving a cross-correlation function of the sound signals received by every two microphones; the voice chirp signal is sent by a chirp voice signal source which is put into a sound field in advance;
a first delay time difference acquisition module configured to: acquiring peak value information of each cross-correlation function in a corresponding delay time difference estimation window, wherein if any delay time difference estimation window has a unique peak value, the unique peak value corresponds to the delay time difference of the two microphones; wherein the delay time difference is a difference between a first delay time of the two microphones with respect to the voice chirp signal and a second delay time with respect to the target sound source; the delay time difference estimation window is a window corresponding to the estimation range of the delay time difference, and the delay time difference corresponds to a peak value in the delay time difference estimation window;
a second delay time difference obtaining module, configured to: iteratively acquiring the delay time differences of other two microphones according to a relation of the delay time differences corresponding to the two microphones in the three microphones and peak information in the delay time difference estimation window on the basis of the determined delay time differences of the two microphones;
a target audio source delay time acquisition module for: acquiring the first delay time of each two microphones in the distributed microphone array relative to the voice chirp signal, and acquiring the second delay time of each two microphones relative to the target sound source based on the first delay time and the delay time difference;
a target source speech signal alignment enhancement module for: and aligning and enhancing the sound received by each microphone according to the second delay time between every two microphones of the distributed microphone array.
8. A voice signal enhancement system based on distributed microphones is characterized by comprising a wireless node, a distributed microphone array, at least one sound source, a chirp voice signal source and a server; wherein the wireless node is connected with at least one microphone in the microphone array and is used for transmitting the sound signals received by the connected microphone to the server.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the distributed microphone based speech signal enhancement method according to any of claims 1 to 6 when executing the computer program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the distributed microphone based speech signal enhancement method according to any one of claims 1 to 6.
CN201911032121.3A 2019-10-28 2019-10-28 Voice signal enhancement method, server and system based on distributed microphone Active CN112735459B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911032121.3A CN112735459B (en) 2019-10-28 2019-10-28 Voice signal enhancement method, server and system based on distributed microphone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911032121.3A CN112735459B (en) 2019-10-28 2019-10-28 Voice signal enhancement method, server and system based on distributed microphone

Publications (2)

Publication Number Publication Date
CN112735459A true CN112735459A (en) 2021-04-30
CN112735459B CN112735459B (en) 2024-03-26

Family

ID=75588832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911032121.3A Active CN112735459B (en) 2019-10-28 2019-10-28 Voice signal enhancement method, server and system based on distributed microphone

Country Status (1)

Country Link
CN (1) CN112735459B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409817A (en) * 2021-06-24 2021-09-17 浙江松会科技有限公司 Audio signal real-time tracking comparison method based on voiceprint technology
WO2023206686A1 (en) * 2022-04-29 2023-11-02 青岛海尔科技有限公司 Control method for smart device, and storage medium and electronic apparatus

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102347028A (en) * 2011-07-14 2012-02-08 瑞声声学科技(深圳)有限公司 Double-microphone speech enhancer and speech enhancement method thereof
CN102800325A (en) * 2012-08-31 2012-11-28 厦门大学 Ultrasonic-assisted microphone array speech enhancement device
JP2014174393A (en) * 2013-03-11 2014-09-22 Research Organization Of Information & Systems Apparatus and method for voice signal processing
US9208794B1 (en) * 2013-08-07 2015-12-08 The Intellisis Corporation Providing sound models of an input signal using continuous and/or linear fitting
CN107017003A (en) * 2017-06-02 2017-08-04 厦门大学 A kind of microphone array far field speech sound enhancement device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102347028A (en) * 2011-07-14 2012-02-08 瑞声声学科技(深圳)有限公司 Double-microphone speech enhancer and speech enhancement method thereof
CN102800325A (en) * 2012-08-31 2012-11-28 厦门大学 Ultrasonic-assisted microphone array speech enhancement device
JP2014174393A (en) * 2013-03-11 2014-09-22 Research Organization Of Information & Systems Apparatus and method for voice signal processing
US9208794B1 (en) * 2013-08-07 2015-12-08 The Intellisis Corporation Providing sound models of an input signal using continuous and/or linear fitting
CN107017003A (en) * 2017-06-02 2017-08-04 厦门大学 A kind of microphone array far field speech sound enhancement device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
林彦, 王秀坛, 彭应宁, 许稼, 张, 夏香根: "基于MCMC的线性调频信号最大似然参数估计", 清华大学学报(自然科学版), no. 04, pages 511 - 514 *
武明勤;于凤芹;: "基于Chirp原子分解的语音信号时频结构分析", 江南大学学报(自然科学版), no. 06, pages 685 - 687 *
武明勤;于凤芹;韩;: "一种基于Chirp原子分解的语音增强方法", 微电子学与计算机, no. 12, pages 74 - 76 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409817A (en) * 2021-06-24 2021-09-17 浙江松会科技有限公司 Audio signal real-time tracking comparison method based on voiceprint technology
CN113409817B (en) * 2021-06-24 2022-05-13 浙江松会科技有限公司 Audio signal real-time tracking comparison method based on voiceprint technology
WO2023206686A1 (en) * 2022-04-29 2023-11-02 青岛海尔科技有限公司 Control method for smart device, and storage medium and electronic apparatus

Also Published As

Publication number Publication date
CN112735459B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
CN104429100A (en) Systems and methods for surround sound echo reduction
EP2810453B1 (en) Audio source position estimation
KR101415026B1 (en) Method and apparatus for acquiring the multi-channel sound with a microphone array
Ono et al. Blind alignment of asynchronously recorded signals for distributed microphone array
EP1600791B1 (en) Sound source localization based on binaural signals
EP2976898B1 (en) Method and apparatus for determining a position of a microphone
JP2008236077A (en) Target sound extracting apparatus, target sound extracting program
KR20210091034A (en) Multiple-source tracking and voice activity detections for planar microphone arrays
JP6741004B2 (en) Sound source position detecting device, sound source position detecting method, sound source position detecting program, and storage medium
CN112735459A (en) Voice signal enhancement method, server and system based on distributed microphones
CN102428717A (en) A system and method for estimating the direction of arrival of a sound
CN103561387A (en) Indoor positioning method and system based on TDoA
US20150172842A1 (en) Sound processing apparatus, sound processing method, and sound processing program
KR20140126788A (en) Position estimation system using an audio-embedded time-synchronization signal and position estimation method using thereof
JP2010212818A (en) Method of processing multi-channel signals received by a plurality of microphones
CN114788302B (en) Signal processing device, method and system
EP3182734B1 (en) Method for using a mobile device equipped with at least two microphones for determining the direction of loudspeakers in a setup of a surround sound system
KR100730297B1 (en) Sound source localization method using Head Related Transfer Function database
KR20110109620A (en) Microphone module, apparatus for measuring location of sound source using the module and method thereof
JP2007017415A (en) Method of measuring time difference of impulse response
EP2214420A1 (en) Sound emission and collection device
CN111505583B (en) Sound source positioning method, device, equipment and readable storage medium
JP6433630B2 (en) Noise removing device, echo canceling device, abnormal sound detecting device, and noise removing method
KR20190013264A (en) Location determination system and method of smart device using non-audible sound wave
KR20160127259A (en) Configuration method of planar array sensor for underwater sound detection and underwater sound measurement system using thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant