WO2020037555A1 - Procédé, dispositif, appareil et système d'évaluation de la cohérence d'un réseau de microphones - Google Patents

Procédé, dispositif, appareil et système d'évaluation de la cohérence d'un réseau de microphones Download PDF

Info

Publication number
WO2020037555A1
WO2020037555A1 PCT/CN2018/101766 CN2018101766W WO2020037555A1 WO 2020037555 A1 WO2020037555 A1 WO 2020037555A1 CN 2018101766 W CN2018101766 W CN 2018101766W WO 2020037555 A1 WO2020037555 A1 WO 2020037555A1
Authority
WO
WIPO (PCT)
Prior art keywords
microphone
microphones
signal
reference microphone
difference
Prior art date
Application number
PCT/CN2018/101766
Other languages
English (en)
Chinese (zh)
Inventor
李国梁
罗朝洪
程树青
Original Assignee
深圳市汇顶科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市汇顶科技股份有限公司 filed Critical 深圳市汇顶科技股份有限公司
Priority to PCT/CN2018/101766 priority Critical patent/WO2020037555A1/fr
Priority to CN202310466643.4A priority patent/CN116437280A/zh
Priority to CN201880001199.6A priority patent/CN109313909B/zh
Publication of WO2020037555A1 publication Critical patent/WO2020037555A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • H04R29/001Monitoring arrangements; Testing arrangements for loudspeakers

Definitions

  • the present application relates to the field of voice communication and voice intelligent interaction, and more particularly, to a method, a device, a device, and a system for evaluating the consistency of a microphone array.
  • speech enhancement technology can improve people's hearing experience and improve the intelligibility of speech communication.
  • speech intelligent interactive applications speech enhancement technology can improve the accuracy of speech recognition and enhance the user experience. Therefore, speech enhancement technology It is vital in both traditional voice communication and voice interaction.
  • the speech enhancement technology is divided into single-channel speech enhancement technology and multi-channel speech enhancement technology.
  • single-channel speech enhancement technology can eliminate steady-state noise and cannot eliminate non-steady-state noise, and the improvement of the signal ratio is at the expense of speech damage and signal-to-noise. The more the ratio is increased, the greater the speech damage; the multi-channel speech enhancement technology uses a microphone array to collect multiple signals, and uses phase information and coherent information between the multiple microphone signals to eliminate noise, which can eliminate non-steady-state noise and reduce speech damage. small.
  • the consistency between different microphones in the microphone array directly affects the performance of the algorithm.
  • the existing scheme proposes an improved algorithm for the multi-channel enhancement technology, which increases the robustness of the algorithm and simultaneously The requirement for performance is reduced. However, when the consistency between microphones is very low, the algorithm performance will still be affected, which will affect the user experience.
  • the present application provides a method, device, device and system for evaluating the consistency of a microphone array, which can evaluate the consistency between different microphones in the microphone array, thereby guiding the calibration of the microphone array and evaluating the multi-channel enhancement algorithm based on the consistency evaluation result. Robustness improves user experience.
  • a method for assessing the consistency of a microphone array including:
  • a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone are determined, and the reference microphone is among the N microphones. Any one of the microphones;
  • a consistency evaluation is performed on the N microphones.
  • the consistency evaluation of the N microphones can be used to guide the microphone distribution in the microphone array, or to redesign the microphone distribution in the microphone array, or to redesign the microphone array, or to evaluate multi-channels. Enhance the robustness of the algorithm.
  • the distribution of microphone 1 or microphone 2 in the microphone array can be guided, or the microphone 1 or microphone 2 can be redesigned.
  • the distribution of the microphone 1 in the microphone array can be guided, or the microphone 1 can be redesigned, or the microphone array can be redesigned.
  • a phase spectrum difference and / or a power spectrum difference between each microphone and a reference microphone are determined according to the N audio signals collected by the N microphones respectively, so as to perform consistency evaluation on the N microphones Eliminate the impact of consistency between microphones on multi-channel speech enhancement algorithms and improve user experience.
  • performing a consistency evaluation on the N microphones according to a phase spectrum difference value between each of the N microphones except the reference microphone and the reference microphone includes:
  • the phase consistency between the corresponding microphone and the reference microphone is evaluated.
  • the phase spectrum difference between the microphone 1 and the reference microphone is A, and the smaller A is, the better the phase consistency between the microphone 1 and the reference microphone is.
  • a threshold can be set. If the phase spectrum difference between the two microphones is smaller than this threshold, it means that the phase consistency between the two microphones meets the design requirements, and the consistency between the two microphones The effect on the multi-channel speech enhancement algorithm can be ignored, or the consistency between the two microphones has no effect on the multi-channel speech enhancement algorithm.
  • thresholds can be flexibly configured according to different multi-channel speech enhancement algorithms.
  • the method further includes:
  • phase spectrum difference values are calibrated respectively.
  • the fixed phase difference between microphone 1 and the reference microphone is A
  • the phase spectrum difference between microphone 1 and the reference microphone is B
  • the phase spectrum difference between microphone 1 and the reference microphone is C.
  • C BA
  • the calculating a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone according to the measured distance includes:
  • Y i ( ⁇ ) represents the frequency spectrum of the i-th microphone
  • Y 1 ( ⁇ ) represents the frequency spectrum reference microphone
  • represents the frequency
  • d i represents the distance from the i-th microphone and reference microphone to the sound source of the difference
  • c denotes the speed of sound
  • 2 ⁇ d i / c represents a fixed phase difference between the i-th microphone and the reference microphone.
  • performing a consistency evaluation on the N microphones according to a phase spectrum difference value between each of the N microphones except the reference microphone and the reference microphone includes:
  • the amplitude consistency between the corresponding microphone and the reference microphone is evaluated.
  • the power spectrum difference between the microphone 1 and the reference microphone is A, and the smaller the A, the better the amplitude consistency between the microphone 1 and the reference microphone.
  • a threshold can be set. If the power spectrum difference between the two microphones is smaller than this threshold, it means that the amplitude consistency between the two microphones meets the design requirements and the consistency between the two microphones The effect on the multi-channel speech enhancement algorithm can be ignored, or the consistency between the two microphones has no effect on the multi-channel speech enhancement algorithm.
  • thresholds can be flexibly configured according to different multi-channel speech enhancement algorithms.
  • the N audio signals are signals collected in an environment in which the frequency-sweep signal data is played.
  • the N audio signals are signals collected in an environment where Gaussian white noise data or frequency-sweep signal data is played.
  • the frequency sweep signal is any one of a linear frequency sweep signal, a logarithmic frequency sweep signal, a linear step frequency sweep signal, and a logarithmic step frequency sweep signal.
  • determining a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone include:
  • a phase spectrum difference value and / or a power spectrum difference value between each of the N microphones except the reference microphone and the reference microphone are determined.
  • K represents the total number of frames of the signal collected by each microphone.
  • each of the K signal frames may be processed by adding a Hamming window.
  • any two adjacent signal frames in the K signal frames overlap by R%, and R> 0.
  • R is 25 or 50.
  • the signal amplitude remains unchanged after overlapping and windowing.
  • each frame of the signal after the overlap has a component of the previous frame to prevent discontinuity between the two frames.
  • x i (t) [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T
  • x i (t) represents the i-th audio signal
  • K represents the total number of frames collected by each microphone
  • [] T represents the transpose of a vector or a matrix.
  • the phase spectrum between each of the N microphones except the reference microphone and the reference microphone is determined according to the K target signal frames corresponding to each audio signal. Differences, including:
  • imag () means take the imaginary part
  • ln () means take the natural logarithm
  • Represents the j-th target signal frame of the i-th microphone Indicates the main frequency.
  • the power spectrum between each of the N microphones except the reference microphone and the reference microphone is determined according to the K target signal frames corresponding to each audio signal. Differences, including:
  • a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone is determined.
  • determining the power spectrum of each audio signal according to the K target signal frames corresponding to each audio signal includes:
  • P i ( ⁇ ) represents the power spectrum of the i-th audio signal
  • Yi j ( ⁇ ) represents the j-th target signal frame in the i-th audio signal
  • K represents the total frame of the signal received by each microphone Number
  • represents frequency
  • the determining a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the power spectrum of each audio signal includes:
  • PD i ( ⁇ ) represents the power spectrum difference between the i-th microphone and the reference microphone
  • P 1 ( ⁇ ) represents the power spectrum of the reference microphone
  • P i ( ⁇ ) represents the power spectrum of the i-th microphone
  • the acquiring the N audio signals respectively acquired by the N microphones includes:
  • the frequency sweep signal data is composed of M + 1 segments of equal length and different frequencies.
  • the number of FFT points N fft is an even number, generally 32,64,128, ..., 1024, etc., the more the number of points, the greater the savings in the amount of calculations.
  • f i represents the frequency of the i-th stage signal
  • F s represents the sampling frequency
  • N fft represents the number of FFT points
  • S i (t) represents the i-th stage signal
  • the frequency sweep signal data played by the speaker can be written in the following vector form:
  • S (t) represents the frequency sweep signal data played by the speaker
  • S i (t) represents the i-th segment signal
  • [] T represents the transpose of a vector or matrix.
  • the N microphones respectively collect N audio signals
  • the audio signal collected by the i-th microphone is represented as x i (t)
  • x i (t) can be written as the following vector form:
  • x i (t) [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T
  • x i (t) represents the audio signal collected by the i-th microphone
  • K represents the total number of frames of the signal collected by each microphone
  • [] T represents the transpose of the vector or matrix.
  • the acquiring the N audio signals respectively acquired by the N microphones includes:
  • N microphones Placing the N microphones in a test room, with speakers arranged in the test room, the N microphones being located directly in front of the speakers;
  • the test room has an anechoic room environment
  • the speaker is an artificial mouth dedicated for audio testing
  • the artificial mouth is calibrated with a standard microphone before use.
  • the method before controlling the speaker to play Gaussian white noise data or frequency sweep signal data, the method further includes:
  • a device for evaluating the consistency of a microphone array including:
  • An obtaining unit configured to obtain N audio signals respectively collected by N microphones, where the N microphones form a microphone array, and N ⁇ 2;
  • a processing unit configured to determine a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the N audio signals, and
  • the reference microphone is any one of the N microphones;
  • the processing unit is further configured to perform an analysis on the N based on a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone. Microphones for consistency assessment.
  • the processing unit is specifically configured to:
  • the phase consistency between the corresponding microphone and the reference microphone is evaluated.
  • the processing unit is further configured to:
  • the corresponding phase spectrum difference values thereof are respectively calibrated.
  • the processing unit is specifically configured to:
  • Y i ( ⁇ ) represents the frequency spectrum of the i-th microphone
  • Y 1 ( ⁇ ) represents the frequency spectrum reference microphone
  • represents the frequency
  • d i represents the distance from the i-th microphone and reference microphone to the sound source of the difference
  • c denotes the speed of sound
  • 2 ⁇ d i / c represents a fixed phase difference between the i-th microphone and the reference microphone.
  • the processing unit is specifically configured to:
  • An amplitude consistency between a corresponding microphone and the reference microphone is evaluated according to a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone.
  • the N audio signals are signals collected in an environment in which the frequency sweep signal data is played.
  • the N audio signals are signals collected in an environment in which Gaussian white noise data or frequency-sweep signal data is played.
  • the frequency sweep signal is any one of a linear frequency sweep signal, a logarithmic frequency sweep signal, a linear step frequency sweep signal, and a logarithmic step frequency sweep signal.
  • the processing unit is specifically configured to:
  • any two adjacent signal frames in the K signal frames overlap by R%, and R> 0.
  • the R is 25 or 50.
  • x i (t) [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T
  • x i (t) represents the i-th audio signal
  • K represents the total number of frames collected by each microphone
  • [] T represents the transpose of a vector or a matrix.
  • the processing unit is specifically configured to:
  • imag () means take the imaginary part
  • ln () means take the natural logarithm
  • Represents the j-th target signal frame of the i-th microphone Indicates the main frequency.
  • the processing unit is specifically configured to:
  • a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone is determined.
  • the processing unit is specifically configured to:
  • P i ( ⁇ ) represents the power spectrum of the i-th audio signal
  • Yi j ( ⁇ ) represents the j-th target signal frame in the i-th audio signal
  • K represents the total frame of the signal collected by each microphone Number
  • represents frequency
  • the processing unit is specifically configured to:
  • PD i ( ⁇ ) represents the power spectrum difference between the i-th microphone and the reference microphone
  • P 1 ( ⁇ ) represents the power spectrum of the reference microphone
  • P i ( ⁇ ) represents the power spectrum of the i-th microphone
  • the processing unit is specifically configured to:
  • the sampling frequency F s and FFT points N fft of the N microphones during audio signal collection using a speaker to play Gaussian white noise data or frequency sweep signal data, and controlling the N microphones to acquire the N audio signals, wherein, if the data played by the speaker is frequency-sweep signal data, the frequency-sweep signal data is composed of M + 1 segments of equal length and different frequencies.
  • the processing unit is further configured to:
  • f i represents the frequency of the i-th stage signal
  • F s represents the sampling frequency
  • N fft represents the number of FFT points
  • S i (t) represents the i-th stage signal
  • the frequency sweep signal data played by the speaker is written in the following vector form:
  • S (t) represents the frequency sweep signal data played by the speaker
  • S i (t) represents the i-th segment signal
  • [] T represents the transpose of a vector or matrix.
  • the N microphones respectively collect N audio signals
  • the audio signal collected by the i-th microphone is represented as x i (t)
  • x i (t) can be written as the following vector form :
  • x i (t) [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T
  • x i (t) represents the audio signal collected by the i-th microphone
  • K represents the total number of frames of the signal collected by each microphone
  • [] T represents the transpose of the vector or matrix.
  • the obtaining unit is specifically configured to:
  • the test room has an anechoic room environment
  • the speaker is an artificial mouth dedicated for audio testing
  • the artificial mouth is calibrated with a standard microphone before use.
  • the processing unit controls the speaker to play Gaussian white noise data or frequency sweep signal data
  • the obtaining unit is further configured to:
  • Triggering the processing unit according to a formula Calculate the signal-to-noise ratio SNR, and ensure that the SNR is greater than a first threshold.
  • a device for evaluating the consistency of a microphone array including:
  • a processor configured to call and run programs and data stored in the memory
  • the apparatus is configured to perform the method in the first aspect described above or any possible implementation thereof.
  • a system for assessing the consistency of a microphone array including:
  • N microphones forming a microphone array N ⁇ 2;
  • At least one audio source At least one audio source
  • the device comprises a memory for storing programs and data and a processor for calling and running the programs and data stored in the memory, and the device is configured as the method in the first aspect or any possible implementation thereof.
  • a computer storage medium stores program code, and the program code may be used to instruct execution of the method in the first aspect or any possible implementation manner thereof.
  • a computer program product containing instructions which, when run on a computer, causes the computer to execute the method in the first aspect or any possible implementation thereof.
  • FIG. 1 is a schematic flowchart of a method for evaluating consistency of a microphone array according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a test environment according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of calculating a phase spectrum difference according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of calculating a power spectrum difference according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a phase spectrum difference between two microphones according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a phase spectrum difference value after calibration between two microphones according to an embodiment of the present application.
  • FIG. 7a is a schematic diagram of a power spectrum of two microphones according to an embodiment of the present application.
  • FIG. 7b is a schematic diagram of a power spectrum difference between two microphones according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a device for evaluating consistency of a microphone array according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an apparatus for evaluating consistency of a microphone array according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a system for evaluating consistency of a microphone array according to an embodiment of the present application.
  • Microphone array refers to a system composed of a certain number of microphones (acoustic sensors) that is used to sample and process the spatial characteristics of the sound field. The difference between the phases of the sound waves received by the two microphones is used to filter the sound waves, which can eliminate the ambient background sound to the maximum, leaving only the required sound waves.
  • the multi-channel speech enhancement technology algorithm assumes that the target speech components of multiple microphones in the microphone array are highly correlated, and the target speech is not related to non-target interference, so the consistency between different microphones in the microphone array directly affects the algorithm performance.
  • Quantitative evaluation of microphone consistency can be used to guide the design of microphones and the design of microphone arrays.
  • Microphone array circuits, electronic components, and acoustic structures all affect the consistency of microphones.
  • various factors can be tested item by item. The effect of consistency, so that the design of microphone consistency meets the system requirements.
  • Quantitative evaluation of microphone consistency can be used to compare the robustness of different algorithms. The lower the requirement for consistency indicators, the better the algorithm's robustness when the premise of achieving the same speech enhancement performance is achieved.
  • consistency is measured from two aspects: amplitude spectrum difference and phase spectrum difference, which has objectivity and accuracy, and the quantitative consistency evaluation method can objectively guide the design of the microphone array and can also objectively Comparing the robustness of multi-channel speech enhancement algorithms.
  • FIG. 1 is a schematic flowchart of a method for evaluating consistency of a microphone array according to an embodiment of the present application. It should be understood that FIG. 1 shows steps or operations of the method, but these steps or operations are merely examples, and other operations or variations of each operation in FIG. 1 may be performed in the embodiment of the present application.
  • the method may be executed by a device for evaluating the consistency of the microphone array, where the device for evaluating the consistency of the microphone array may be a mobile phone, a tablet computer, a portable computer, a Personal Digital Assistant (PDA), or the like.
  • PDA Personal Digital Assistant
  • S110 Obtain N audio signals collected by N microphones respectively, where the N microphones form a microphone array, and N ⁇ 2.
  • a microphone array 201 composed of the N microphones is placed in a test room 202, and a speaker 203 is disposed in the test room 202.
  • the microphone array 201 is located directly in front of the speaker 203.
  • the microphone array 201 is connected to the speaker 203, such as a computer control device 204.
  • the control device 204 can control the speaker 203 to play specific audio data, for example, to play Gaussian white noise data or frequency-sweep signal data.
  • the control device 204 can obtain the N microphones from the microphone array 201. Audio signals.
  • the microphone consistency evaluation requires that the signal-to-noise ratio of the collected audio signal is sufficiently high and the background noise is sufficiently weak, so the test environment is required to be in a quiet environment.
  • an anechoic room environment is required in the test room 202.
  • the speaker 203 requires a high signal-to-noise ratio and a flat frequency response curve.
  • the speaker uses an artificial mouth dedicated for audio testing, and is calibrated with a standard microphone before use.
  • the microphone array 201 is placed directly in front of the speaker 203, and in particular, it is required to be placed at a position calibrated by a standard microphone.
  • SNR signal-to-noise ratio
  • first audio data X 1 collected by the N microphones within a first duration T 1 is acquired. (n); then, in the environment where Gaussian white noise data or frequency-sweep signal data is played (that is, the control device 204 controls the speaker 203 to play Gaussian white noise data or frequency-sweep signal data), obtain the N microphones at the second
  • the second audio data X 2 (n) collected within the duration T 2 is calculated according to the following formula 1; finally, when the SNR is greater than a set threshold, the detection passes, otherwise the detection fails.
  • T 1 represents the first duration
  • T 2 represents the second duration
  • X 1 (n) represents the first audio data
  • X 2 (n) represents the second audio data.
  • test fails, the above test environment needs to be adjusted or calibrated to eliminate some factors that may affect the sexual noise ratio, until the SNR calculated according to the above formula 1 is greater than a set threshold.
  • acquiring audio signals by using the test environment shown in FIG. 2 described above may specifically include:
  • the sampling frequency F s and the number of FFT points N fft of the N microphones during audio signal collection are determined, and Gaussian white noise data or frequency-sweep signal data is played using a speaker, and the N microphones collect the N audio signals.
  • the frequency-sweep signal data is composed of M + 1 segments of equal length and different frequencies.
  • each signal in the M + 1 segment signal can be calculated according to the following formula 2
  • each signal in the M + 1 segment signal can be calculated according to the following formula 3.
  • f i is the frequency of the ith signal
  • F s is the sampling frequency
  • N fft is the number of FFT points.
  • S i (t) represents the signal paragraph i
  • f i is the i-th frequency band signal.
  • the frequency sweep signal data played by the speaker can be written in the following vector form:
  • S (t) represents the frequency sweep signal data played by the speaker
  • S i (t) represents the i-th segment signal
  • [] T represents the transpose of a vector or matrix.
  • the N microphones respectively acquire N audio signals
  • the audio signal collected by the i-th microphone is represented as x i (t)
  • x i (t) can be written as the following vector form:
  • x i (t) [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T
  • x i (t) represents the audio signal collected by the i-th microphone
  • K represents the total number of frames of the signal collected by each microphone
  • [] T represents the transpose of the vector or matrix.
  • the reference microphones are the N microphones. Any one of the microphones.
  • the audio signals may be framed, and the audio signals of each frame may be windowed, and the windowed signals of each frame may be FFT-transformed to obtain different microphones. Phase spectrum difference between the two.
  • the N audio signals are x 1 (t), x 2 (t), ..., x N (t), and each of the N audio signals is divided into Frames, to obtain K signal frames of equal length, K ⁇ 2.
  • frame the i-th audio signal to obtain K signal frames of equal length and write the following vector form:
  • x i (t) [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T
  • x i (t) represents the i-th audio signal
  • K represents the total number of frames collected by each microphone
  • [] T represents the transpose of a vector or a matrix
  • K windowed signal frames For example, window the j-th frame x i, j of the i-th audio signal to obtain the i-th audio frame.
  • the j-th windowed signal frame of the signal y i, j x i, j ⁇ Win;
  • imag () means take the imaginary part
  • ln () means take the natural logarithm
  • Represents the j-th target signal frame of the i-th microphone Indicates the main frequency.
  • the first microphone is used as the reference microphone, that is, the phase spectrum difference between each microphone except the first microphone and the first microphone is calculated separately, and
  • the first microphone corresponds to the audio signal x 1 (t)
  • the second microphone corresponds to the audio signal x 2 (t)
  • ... the Nth microphone corresponds to the audio signal x N (t).
  • K represents the total number of frames of signals received by each microphone.
  • each of the K signal frames may be processed by adding a Hamming window.
  • any two adjacent signal frames in the K signal frames overlap by R%, and R> 0.
  • R is 25 or 50.
  • any two adjacent signal frames in the K signal frames overlap by 25% or 50%.
  • the signal amplitude remains unchanged after overlapping and windowing.
  • each frame of the signal after the overlap has a component of the previous frame to prevent discontinuity between the two frames.
  • the N audio signals are signals collected in an environment where the frequency sweep signal data is played.
  • the N audio signals are signals collected in an environment where the frequency sweep signal data is played.
  • phase difference of any frequency ⁇ can be calculated, that is, the phase spectrum difference PDiff i ( ⁇ ) between the i-th microphone and the reference microphone, that is, the above
  • the audio signals may be framed, and each frame of the audio signal is windowed, and the windowed signal of each frame is subjected to FFT transformation. After the FFT transformation is obtained, The power spectrum of each frame of the signal, find the power spectrum difference between different microphones.
  • the N audio signals are x 1 (t), x 2 (t), ..., x N (t), and each of the N audio signals is divided.
  • frame the i-th audio signal to obtain K signal frames of equal length and write the following vector form:
  • x i (t) [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T
  • x i (t) represents the i-th audio signal
  • K represents the total number of frames received by each microphone
  • [] T represents the transpose of a vector or a matrix
  • K windowed signal frames For example, window the j-th frame x i, j of the i-th audio signal to obtain the i-th audio frame.
  • the j-th windowed signal frame of the signal y i, j x i, j ⁇ Win;
  • each audio signal determines the power spectrum difference between each of the N microphones except the reference microphone and the reference microphone. For example, calculate the i-th microphone and The power spectrum difference between the reference microphones.
  • P i ( ⁇ ) represents the power spectrum of the i-th audio signal
  • Yi j ( ⁇ ) represents the j-th target signal frame in the i-th audio signal
  • represents the frequency
  • K represents the data collected by each microphone. The total number of frames of the signal.
  • PD i ( ⁇ ) represents the power spectrum difference between the i-th microphone and the reference microphone
  • P 1 ( ⁇ ) represents the power spectrum of the reference microphone
  • P i ( ⁇ ) represents the power spectrum of the i-th microphone
  • the first microphone is used as the reference microphone, that is, the power spectrum difference between each microphone except the first microphone and the first microphone is calculated separately,
  • the first microphone corresponds to the audio signal x 1 (t)
  • the second microphone corresponds to the audio signal x 2 (t)
  • ... the Nth microphone corresponds to the audio signal x N (t).
  • each of the K signal frames may be processed by adding a Hamming window.
  • any two adjacent signal frames in the K signal frames overlap by R%, and R> 0.
  • R is 25 or 50.
  • any two adjacent signal frames in the K signal frames overlap by 25% or 50%.
  • the signal amplitude remains unchanged after overlapping and windowing.
  • each frame of the signal after the overlap has a component of the previous frame to prevent discontinuity between the two frames.
  • the N audio signals are signals collected in an environment in which Gaussian white noise data or frequency-sweep signal data is played.
  • the N audio signals are signals collected in an environment where Gaussian white noise data or frequency-sweep signal data is played.
  • S130 Perform a consistency evaluation on the N microphones according to a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone.
  • phase spectrum difference value is used for phase consistency evaluation
  • power spectrum difference value is used for amplitude consistency evaluation
  • a corresponding microphone and the reference microphone are evaluated according to a phase spectrum difference value between each of the N microphones except the reference microphone and the reference microphone. Phase consistency between.
  • the phase spectrum difference between the microphone 1 and the reference microphone is A, and the smaller A is, the better the phase consistency between the microphone 1 and the reference microphone is.
  • a threshold can be set. If the phase spectrum difference between the two microphones is smaller than this threshold, it means that the phase consistency between the two microphones meets the design requirements, and the consistency between the two microphones The effect on the multi-channel speech enhancement algorithm can be ignored, or the consistency between the two microphones has no effect on the multi-channel speech enhancement algorithm.
  • thresholds can be flexibly configured according to different multi-channel speech enhancement algorithms.
  • the phase spectrum difference value may be calibrated by using a fixed phase difference.
  • a distance D i represents the i-th microphone and reference microphone to the sound source of the difference
  • a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone is calculated.
  • the i-th microphone and the reference microphone can be calculated according to the following formula 7.
  • phase spectrum difference values are calibrated respectively.
  • Y i ( ⁇ ) represents the frequency spectrum of the i-th microphone
  • Y 1 ( ⁇ ) represents the frequency spectrum reference microphone
  • represents the frequency
  • d i represents the distance from the i-th microphone and reference microphone to the sound source of the difference
  • c denotes the speed of sound
  • 2 ⁇ d i / c represents a fixed phase difference between the i-th microphone and the reference microphone.
  • the linear phase can be used to determine the fixed phase difference.
  • the fixed phase difference between the microphone 1 and the reference microphone is A
  • the phase spectrum difference between the microphone 1 and the reference microphone is B
  • the straight line represents the fitting between the microphone 1 and the reference microphone.
  • the amplitude between the corresponding microphone and the reference microphone is evaluated based on the power spectrum difference between each of the N microphones except the reference microphone and the reference microphone. consistency.
  • FIG. 7a shows the power spectrum of the microphone 1 and the power spectrum of the reference microphone
  • FIG. 7b shows the power spectrum difference between the microphone 1 and the reference microphone.
  • ⁇ 1 decibel (dB) the maximum value of the power spectrum difference
  • a threshold can be set. If the power spectrum difference between the two microphones is smaller than this threshold, it means that the amplitude consistency between the two microphones meets the design requirements and the consistency between the two microphones The effect on the multi-channel speech enhancement algorithm can be ignored, or the consistency between the two microphones has no effect on the multi-channel speech enhancement algorithm.
  • thresholds can be flexibly configured according to different multi-channel speech enhancement algorithms.
  • the influence of factors such as the circuit, electronic components, and acoustic structure of the microphone array on the consistency of the microphone can be tested item by item to guide the calibration of the microphone array.
  • the phase spectrum difference and / or power spectrum difference between each microphone and the reference microphone may be determined according to the N audio signals collected by the N microphones, so as to make the N microphones consistent. Performance evaluation to eliminate the impact of consistency between microphones on multi-channel speech enhancement algorithms and improve user experience.
  • an embodiment of the present application provides a device 800 for evaluating the consistency of a microphone array, including:
  • the obtaining unit 810 is configured to obtain N audio signals collected by N microphones respectively, where the N microphones form a microphone array, and N ⁇ 2;
  • a processing unit 820 configured to determine a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the N audio signals,
  • the reference microphone is any one of the N microphones;
  • the processing unit 820 is further configured to perform, according to a phase spectrum difference value and / or a power spectrum difference value between each of the N microphones except the reference microphone, and the reference microphone, N microphones were evaluated for consistency.
  • processing unit 820 is specifically configured to:
  • the phase consistency between the corresponding microphone and the reference microphone is evaluated.
  • processing unit 820 is further configured to:
  • the corresponding phase spectrum difference values thereof are respectively calibrated.
  • processing unit 820 is specifically configured to:
  • Y i ( ⁇ ) represents the frequency spectrum of the i-th microphone
  • Y 1 ( ⁇ ) represents the frequency spectrum reference microphone
  • represents the frequency
  • d i represents the distance from the i-th microphone and reference microphone to the sound source of the difference
  • c denotes the speed of sound
  • 2 ⁇ d i / c represents a fixed phase difference between the i-th microphone and the reference microphone.
  • processing unit 820 is specifically configured to:
  • An amplitude consistency between a corresponding microphone and the reference microphone is evaluated according to a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone.
  • the N audio signals are signals collected in an environment where the frequency sweep signal data is played.
  • the N audio signals are signals collected in an environment where Gaussian white noise data or frequency-sweep signal data is played.
  • the frequency sweep signal is any one of a linear frequency sweep signal, a logarithmic frequency sweep signal, a linear step frequency sweep signal, and a logarithmic step frequency sweep signal.
  • processing unit 820 is specifically configured to:
  • any two adjacent signal frames of the K signal frames overlap by R%, and R> 0.
  • the R is 25 or 50.
  • x i (t) [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T
  • x i (t) represents the i-th audio signal
  • K represents the total number of frames collected by each microphone
  • [] T represents the transpose of a vector or a matrix.
  • processing unit 820 is specifically configured to:
  • imag () means take the imaginary part
  • ln () means take the natural logarithm
  • Represents the j-th target signal frame of the i-th microphone Indicates the main frequency.
  • processing unit 820 is specifically configured to:
  • a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone is determined.
  • processing unit 820 is specifically configured to:
  • P i ( ⁇ ) represents the power spectrum of the i-th audio signal
  • Yi j ( ⁇ ) represents the j-th target signal frame in the i-th audio signal
  • K represents the total frame of the signal collected by each microphone Number
  • represents frequency
  • processing unit 820 is specifically configured to:
  • PD i ( ⁇ ) represents the power spectrum difference between the i-th microphone and the reference microphone
  • P 1 ( ⁇ ) represents the power spectrum of the reference microphone
  • P i ( ⁇ ) represents the power spectrum of the i-th microphone
  • processing unit 820 is specifically configured to:
  • the sampling frequency F s and FFT points N fft of the N microphones during audio signal collection using a speaker to play Gaussian white noise data or frequency sweep signal data, and controlling the N microphones to acquire the N audio signals, wherein, if the data played by the speaker is frequency-sweep signal data, the frequency-sweep signal data is composed of M + 1 segments of equal length and different frequencies.
  • processing unit 820 is further configured to:
  • f i represents the frequency of the i-th stage signal
  • F s represents the sampling frequency
  • N fft represents the number of FFT points
  • S i (t) represents the i-th stage signal
  • the frequency sweep signal data played by the speaker is written in the following vector form:
  • S (t) represents the frequency sweep signal data played by the speaker
  • S i (t) represents the i-th segment signal
  • [] T represents the transpose of a vector or matrix.
  • the N microphones respectively acquire N audio signals, where the audio signal collected by the i-th microphone is represented as x i (t), and x i (t) can be written in the following vector form:
  • x i (t) [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T
  • x i (t) represents the audio signal collected by the i-th microphone
  • K represents the total number of frames of the signal collected by each microphone
  • [] T represents the transpose of the vector or matrix.
  • the obtaining unit 810 is specifically configured to:
  • the test room has an anechoic room environment
  • the speaker is an artificial mouth dedicated for audio testing
  • the artificial mouth is calibrated with a standard microphone before use.
  • the processing unit 820 controls the speaker to play Gaussian white noise data or frequency sweep signal data
  • the obtaining unit 810 is further configured to:
  • Trigger the processing unit 820 according to a formula Calculate the signal-to-noise ratio SNR, and ensure that the SNR is greater than a first threshold.
  • an embodiment of the present application provides a device 900 for evaluating consistency of a microphone array, including:
  • a memory 910 for storing programs and data
  • a processor 920 configured to call and run a program and data stored in the memory
  • the device 900 is configured to perform the methods shown in FIGS. 1 to 7 described above.
  • an embodiment of the present application provides a system 1000 for evaluating consistency of a microphone array, including:
  • At least one audio source 1020 At least one audio source 1020;
  • the device 1030 includes a memory 1031 for storing programs and data and a processor 1032 for calling and running the programs and data stored in the memory, and the device 1030 is configured as the method shown in FIGS. 1 to 7 described above.
  • the size of the sequence numbers of the above processes does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not deal with the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices, and methods may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or may Integration into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, which may be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or in the form of software functional unit.
  • the integrated unit When the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of this application is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
  • the foregoing storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes .

Landscapes

  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

Des modes de réalisation de la présente invention concernent un procédé, un dispositif, un appareil et un système d'évaluation de la cohérence d'un réseau de microphones, capables d'évaluer la cohérence entre différents microphones dans un réseau de microphones, ce qui permet de guider l'étalonnage du réseau de microphones et d'évaluer la robustesse d'un algorithme d'amélioration multicanaux en fonction du résultat d'évaluation de cohérence, et d'améliorer l'expérience de l'utilisateur. Le procédé consiste à: obtenir N signaux audio collectés par N microphones respectivement, les N microphones formant le réseau de microphones, et N étant supérieur ou égal à 2; déterminer, selon les N signaux audio, une valeur de différence de spectre de phase et/ou une valeur de différence de spectre de puissance entre les microphones à l'exception d'un microphone de référence dans les N microphones et le microphone de référence, le microphone de référence étant un microphone quelconque parmi les N microphones; et effectuer une évaluation de cohérence sur les N microphones selon la valeur de différence de spectre de phase et/ou la valeur de différence de spectre de puissance entre les microphones à l'exception du microphone de référence dans les N microphones et le microphone de référence.
PCT/CN2018/101766 2018-08-22 2018-08-22 Procédé, dispositif, appareil et système d'évaluation de la cohérence d'un réseau de microphones WO2020037555A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2018/101766 WO2020037555A1 (fr) 2018-08-22 2018-08-22 Procédé, dispositif, appareil et système d'évaluation de la cohérence d'un réseau de microphones
CN202310466643.4A CN116437280A (zh) 2018-08-22 2018-08-22 评估麦克风阵列一致性的方法、设备、装置和系统
CN201880001199.6A CN109313909B (zh) 2018-08-22 2018-08-22 评估麦克风阵列一致性的方法、设备、装置和系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/101766 WO2020037555A1 (fr) 2018-08-22 2018-08-22 Procédé, dispositif, appareil et système d'évaluation de la cohérence d'un réseau de microphones

Publications (1)

Publication Number Publication Date
WO2020037555A1 true WO2020037555A1 (fr) 2020-02-27

Family

ID=65221692

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/101766 WO2020037555A1 (fr) 2018-08-22 2018-08-22 Procédé, dispositif, appareil et système d'évaluation de la cohérence d'un réseau de microphones

Country Status (2)

Country Link
CN (2) CN109313909B (fr)
WO (1) WO2020037555A1 (fr)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110111807B (zh) * 2019-04-27 2022-01-11 南京理工大学 一种基于麦克风阵列的室内声源跟随与增强方法
CN110636432A (zh) * 2019-09-29 2019-12-31 深圳市火乐科技发展有限公司 麦克风测试方法及相关设备
CN111065036B (zh) * 2019-12-26 2021-08-31 北京声智科技有限公司 一种麦克风阵列的频响测试方法及装置
CN112672265B (zh) * 2020-10-13 2022-06-28 珠海市杰理科技股份有限公司 检测麦克风阵一致性的方法及系统、计算机可读存储介质
WO2022150950A1 (fr) * 2021-01-12 2022-07-21 华为技术有限公司 Procédé et appareil d'évaluation de la cohérence d'un réseau de microphones
CN113259830B (zh) * 2021-04-26 2023-03-21 歌尔股份有限公司 一种多麦克一致性测试系统及方法
CN114390421A (zh) * 2021-12-03 2022-04-22 伟创力电子技术(苏州)有限公司 一种麦克风矩阵和喇叭的自动测试方法
CN114222234A (zh) * 2021-12-31 2022-03-22 思必驰科技股份有限公司 麦克风阵列一致性的检测方法、电子设备和存储介质
CN114449434B (zh) * 2022-04-07 2022-08-16 北京荣耀终端有限公司 麦克风校准方法及电子设备
CN115776626B (zh) * 2023-02-10 2023-05-02 杭州兆华电子股份有限公司 一种麦克风阵列的频响校准方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103871420A (zh) * 2012-12-13 2014-06-18 华为技术有限公司 麦克风阵列的信号处理方法及装置
CN106161751A (zh) * 2015-04-14 2016-11-23 电信科学技术研究院 一种噪声抑制方法及装置
US20170078819A1 (en) * 2014-05-05 2017-03-16 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. System, apparatus and method for consistent acoustic scene reproduction based on adaptive functions
CN107864444A (zh) * 2017-11-01 2018-03-30 大连理工大学 一种麦克风阵列频响校准方法

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006033734A (ja) * 2004-07-21 2006-02-02 Sanyo Electric Co Ltd 電気製品の音検査方法及び電気製品の音検査装置
CN1756444B (zh) * 2004-09-30 2011-09-28 富迪科技股份有限公司 电声系统的自我检测校正方法
US8126156B2 (en) * 2008-12-02 2012-02-28 Hewlett-Packard Development Company, L.P. Calibrating at least one system microphone
US8620672B2 (en) * 2009-06-09 2013-12-31 Qualcomm Incorporated Systems, methods, apparatus, and computer-readable media for phase-based processing of multichannel signal
WO2011057346A1 (fr) * 2009-11-12 2011-05-19 Robert Henry Frater Réseaux de postes téléphoniques à haut-parleur et/ou de microphones et procédés et systèmes d'utilisation associés
CN102111697B (zh) * 2009-12-28 2015-03-25 歌尔声学股份有限公司 一种麦克风阵列降噪控制方法及装置
CN102075848B (zh) * 2011-02-17 2014-05-21 深圳市豪恩声学股份有限公司 阵列麦克风的测试方法、系统及转动装置
EP2565667A1 (fr) * 2011-08-31 2013-03-06 Friedrich-Alexander-Universität Erlangen-Nürnberg Évaluation de direction d'arrivée à l'aide de signaux audio filigranés et réseaux de microphone
US9609141B2 (en) * 2012-10-26 2017-03-28 Avago Technologies General Ip (Singapore) Pte. Ltd. Loudspeaker localization with a microphone array
CN103247298B (zh) * 2013-04-28 2015-09-09 华为技术有限公司 一种灵敏度校准方法和音频设备
CN103559330B (zh) * 2013-10-10 2017-04-12 上海华为技术有限公司 数据一致性检测方法及系统
WO2016209098A1 (fr) * 2015-06-26 2016-12-29 Intel Corporation Correction de réponse en phase inadaptée pour de multiples microphones
CN105554674A (zh) * 2015-12-28 2016-05-04 努比亚技术有限公司 一种麦克风校准方法、装置及移动终端

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103871420A (zh) * 2012-12-13 2014-06-18 华为技术有限公司 麦克风阵列的信号处理方法及装置
US20170078819A1 (en) * 2014-05-05 2017-03-16 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. System, apparatus and method for consistent acoustic scene reproduction based on adaptive functions
CN106161751A (zh) * 2015-04-14 2016-11-23 电信科学技术研究院 一种噪声抑制方法及装置
CN107864444A (zh) * 2017-11-01 2018-03-30 大连理工大学 一种麦克风阵列频响校准方法

Also Published As

Publication number Publication date
CN109313909B (zh) 2023-05-12
CN116437280A (zh) 2023-07-14
CN109313909A (zh) 2019-02-05

Similar Documents

Publication Publication Date Title
WO2020037555A1 (fr) Procédé, dispositif, appareil et système d'évaluation de la cohérence d'un réseau de microphones
CN106486131B (zh) 一种语音去噪的方法及装置
CN109831733B (zh) 音频播放性能的测试方法、装置、设备和存储介质
JP6889698B2 (ja) 音声を増幅する方法及び装置
CN110880329B (zh) 一种音频识别方法及设备、存储介质
CN108766454A (zh) 一种语音噪声抑制方法及装置
US20140337021A1 (en) Systems and methods for noise characteristic dependent speech enhancement
US11069366B2 (en) Method and device for evaluating performance of speech enhancement algorithm, and computer-readable storage medium
KR20120116442A (ko) 노이즈 억제 시스템을 위한 왜곡 측정
WO2021135547A1 (fr) Procédé de détection de voix humaine, appareil, dispositif et support de stockage
CN109256139A (zh) 一种基于Triplet-Loss的说话人识别方法
WO2021000498A1 (fr) Procédé, dispositif et appareil de reconnaissance de parole composite et support d'informations lisible par ordinateur
US20140321655A1 (en) Sensitivity Calibration Method and Audio Device
CN110290453B (zh) 无线播放设备的延时测试方法及系统
US11915718B2 (en) Position detection method, apparatus, electronic device and computer readable storage medium
CN110169082A (zh) 组合音频信号输出
Enzinger et al. Mismatched distances from speakers to telephone in a forensic-voice-comparison case
CN110875037A (zh) 语音数据处理方法、装置及电子设备
Raikar et al. Effect of Microphone Position Measurement Error on RIR and its Impact on Speech Intelligibility and Quality.
CN106710602B (zh) 一种声学混响时间估计方法和装置
US20200388275A1 (en) Voice processing device and voice processing method
CN111885474A (zh) 麦克风测试方法及装置
CN113593604A (zh) 检测音频质量方法、装置及存储介质
CN113314127A (zh) 基于空间方位的鸟鸣识别方法、系统、计算机设备与介质
WO2022150950A1 (fr) Procédé et appareil d'évaluation de la cohérence d'un réseau de microphones

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18930587

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18930587

Country of ref document: EP

Kind code of ref document: A1