WO2020037555A1

WO2020037555A1 - Method, device, apparatus, and system for evaluating microphone array consistency

Info

Publication number: WO2020037555A1
Application number: PCT/CN2018/101766
Authority: WO
Inventors: 李国梁; 罗朝洪; 程树青
Original assignee: 深圳市汇顶科技股份有限公司
Priority date: 2018-08-22
Filing date: 2018-08-22
Publication date: 2020-02-27
Also published as: CN109313909A; CN116437280A; CN109313909B

Abstract

Embodiments of the present application provide a method, device, apparatus, and system for evaluating microphone array consistency, being capable of evaluating the consistency among different microphones in a microphone array, thereby guiding calibration of the microphone array and evaluating robustness of a multi-channel enhancement algorithm according to the consistency evaluation result, and improving the user experience. The method comprises: obtaining N audio signals collected by N microphones respectively, the N microphones forming the microphone array, and N being greater than and equal to 2; determining, according to the N audio signals, a phase spectrum difference value and/or a power spectrum difference value between the microphones except for a reference microphone in the N microphones and the reference microphone, the reference microphone being any one microphone in the N microphones; and performing consistency evaluation on the N microphones according to the phase spectrum difference value and/or the power spectrum difference value between the microphones except for the reference microphone in the N microphones and the reference microphone.

Description

Method, equipment, device and system for evaluating microphone array consistency

Technical field

The present application relates to the field of voice communication and voice intelligent interaction, and more particularly, to a method, a device, a device, and a system for evaluating the consistency of a microphone array.

Background technique

In speech communication applications, speech enhancement technology can improve people's hearing experience and improve the intelligibility of speech communication. In speech intelligent interactive applications, speech enhancement technology can improve the accuracy of speech recognition and enhance the user experience. Therefore, speech enhancement technology It is vital in both traditional voice communication and voice interaction. The speech enhancement technology is divided into single-channel speech enhancement technology and multi-channel speech enhancement technology. Among them, single-channel speech enhancement technology can eliminate steady-state noise and cannot eliminate non-steady-state noise, and the improvement of the signal ratio is at the expense of speech damage and signal-to-noise. The more the ratio is increased, the greater the speech damage; the multi-channel speech enhancement technology uses a microphone array to collect multiple signals, and uses phase information and coherent information between the multiple microphone signals to eliminate noise, which can eliminate non-steady-state noise and reduce speech damage. small.

In the multi-channel speech enhancement technology, the consistency between different microphones in the microphone array directly affects the performance of the algorithm. The existing scheme proposes an improved algorithm for the multi-channel enhancement technology, which increases the robustness of the algorithm and simultaneously The requirement for performance is reduced. However, when the consistency between microphones is very low, the algorithm performance will still be affected, which will affect the user experience.

Summary of the Invention

The present application provides a method, device, device and system for evaluating the consistency of a microphone array, which can evaluate the consistency between different microphones in the microphone array, thereby guiding the calibration of the microphone array and evaluating the multi-channel enhancement algorithm based on the consistency evaluation result. Robustness improves user experience.

In a first aspect, a method for assessing the consistency of a microphone array is provided, including:

Obtain N audio signals collected by N microphones respectively, and the N microphones form a microphone array, N≥2;

According to the N audio signals, a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone are determined, and the reference microphone is among the N microphones. Any one of the microphones;

According to the phase spectrum difference and / or power spectrum difference between each of the N microphones except the reference microphone and the reference microphone, a consistency evaluation is performed on the N microphones.

It should be noted that the consistency evaluation of the N microphones can be used to guide the microphone distribution in the microphone array, or to redesign the microphone distribution in the microphone array, or to redesign the microphone array, or to evaluate multi-channels. Enhance the robustness of the algorithm.

For example, when the evaluation results show that the consistency between microphone 1 and microphone 2 is poor, the distribution of microphone 1 or microphone 2 in the microphone array can be guided, or the microphone 1 or microphone 2 can be redesigned.

For another example, when the evaluation result shows that the consistency between the microphone 1 and multiple microphones is poor, the distribution of the microphone 1 in the microphone array can be guided, or the microphone 1 can be redesigned, or the microphone array can be redesigned.

In the embodiment of the present application, a phase spectrum difference and / or a power spectrum difference between each microphone and a reference microphone are determined according to the N audio signals collected by the N microphones respectively, so as to perform consistency evaluation on the N microphones Eliminate the impact of consistency between microphones on multi-channel speech enhancement algorithms and improve user experience.

In some possible implementation manners, performing a consistency evaluation on the N microphones according to a phase spectrum difference value between each of the N microphones except the reference microphone and the reference microphone includes:

According to the phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone, the phase consistency between the corresponding microphone and the reference microphone is evaluated.

It should be noted that the smaller the phase spectrum difference between the two microphones, the better the phase consistency between the two microphones.

For example, the phase spectrum difference between the microphone 1 and the reference microphone is A, and the smaller A is, the better the phase consistency between the microphone 1 and the reference microphone is.

Optionally, a threshold can be set. If the phase spectrum difference between the two microphones is smaller than this threshold, it means that the phase consistency between the two microphones meets the design requirements, and the consistency between the two microphones The effect on the multi-channel speech enhancement algorithm can be ignored, or the consistency between the two microphones has no effect on the multi-channel speech enhancement algorithm.

It should be noted that the above thresholds can be flexibly configured according to different multi-channel speech enhancement algorithms.

In some possible implementation manners, the method further includes:

Separately measure the distance difference between each of the N microphones except the reference microphone and the reference microphone to the sound source;

Calculating a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone according to the measured distance difference;

According to a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone, the corresponding phase spectrum difference values are calibrated respectively.

For example, the fixed phase difference between microphone 1 and the reference microphone is A, and the phase spectrum difference between microphone 1 and the reference microphone is B. After calibration, the phase spectrum difference between microphone 1 and the reference microphone is C. In this case, C = BA.

In some possible implementation manners, the calculating a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone according to the measured distance includes:

According to formula

Calculate a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone,

Wherein, Y _i (ω) represents the frequency spectrum of the i-th microphone, Y ₁ (ω) represents the frequency spectrum reference microphone, ω represents the frequency, d _i represents the distance from the i-th microphone and reference microphone to the sound source of the difference, c denotes the speed of sound , 2πωd _i / c represents a fixed phase difference between the i-th microphone and the reference microphone.

According to the power spectrum difference between each of the N microphones except the reference microphone and the reference microphone, the amplitude consistency between the corresponding microphone and the reference microphone is evaluated.

It should be noted that the smaller the power spectrum difference between the two microphones, the better the amplitude consistency between the two microphones.

For example, the power spectrum difference between the microphone 1 and the reference microphone is A, and the smaller the A, the better the amplitude consistency between the microphone 1 and the reference microphone.

Optionally, a threshold can be set. If the power spectrum difference between the two microphones is smaller than this threshold, it means that the amplitude consistency between the two microphones meets the design requirements and the consistency between the two microphones The effect on the multi-channel speech enhancement algorithm can be ignored, or the consistency between the two microphones has no effect on the multi-channel speech enhancement algorithm.

In some possible implementation manners, when performing phase consistency evaluation, the N audio signals are signals collected in an environment in which the frequency-sweep signal data is played.

In some possible implementation manners, when performing amplitude consistency evaluation, the N audio signals are signals collected in an environment where Gaussian white noise data or frequency-sweep signal data is played.

In some possible implementation manners, the frequency sweep signal is any one of a linear frequency sweep signal, a logarithmic frequency sweep signal, a linear step frequency sweep signal, and a logarithmic step frequency sweep signal.

In some possible implementation manners, according to the N audio signals, determining a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone. ,include:

Frame each of the N audio signals to obtain K signal frames of equal length, K≥2;

Perform windowing on each of the K signal frames to obtain K windowed signal frames;

Perform a Fast Fourier Transform (FFT) transformation on each of the K windowed signal frames to obtain K target signal frames;

According to the K target signal frames corresponding to each audio signal, a phase spectrum difference value and / or a power spectrum difference value between each of the N microphones except the reference microphone and the reference microphone are determined.

Optionally, K represents the total number of frames of the signal collected by each microphone.

It should be noted that the windowing process is used to eliminate the truncation effect brought by the framing. Optionally, each of the K signal frames may be processed by adding a Hamming window.

In some possible implementation manners, any two adjacent signal frames in the K signal frames overlap by R%, and R> 0. For example, R is 25 or 50.

Optionally, the signal amplitude remains unchanged after overlapping and windowing.

It should be understood that each frame of the signal after the overlap has a component of the previous frame to prevent discontinuity between the two frames.

In some possible implementation manners, frame the i-th audio signal to obtain K signal frames of equal length and write the following vector forms:

x _i (t) = [x _{i, 1} (t), x _{i, 2} (t), ..., x _{i, K} (t)] ^T

Among them, x _i (t) represents the i-th audio signal, K represents the total number of frames collected by each microphone, and [] ^T represents the transpose of a vector or a matrix.

In some possible implementation manners, the phase spectrum between each of the N microphones except the reference microphone and the reference microphone is determined according to the K target signal frames corresponding to each audio signal. Differences, including:

According to formula

Determine a phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone,

Among them, imag () means take the imaginary part, ln () means take the natural logarithm,

Represents the phase spectrum difference between the i-th microphone and the reference microphone,

Represents the j-th target signal frame of the reference microphone,

Represents the j-th target signal frame of the i-th microphone,

Indicates the main frequency.

In some possible implementation manners, the power spectrum between each of the N microphones except the reference microphone and the reference microphone is determined according to the K target signal frames corresponding to each audio signal. Differences, including:

Determine the power spectrum of each audio signal according to the K target signal frames corresponding to each audio signal;

According to the power spectrum of each audio signal, a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone is determined.

In some possible implementation manners, determining the power spectrum of each audio signal according to the K target signal frames corresponding to each audio signal includes:

According to formula

Calculate the power spectrum of each audio signal,

Among them, P _i (ω) represents the power spectrum of the i-th audio signal, Yi _{, j} (ω) represents the j-th target signal frame in the i-th audio signal, and K represents the total frame of the signal received by each microphone Number, ω represents frequency.

In some possible implementation manners, the determining a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the power spectrum of each audio signal includes:

Calculate the power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the formula PD _i (ω) = P ₁ (ω) -P _i (ω),

Among them, PD _i (ω) represents the power spectrum difference between the i-th microphone and the reference microphone, P ₁ (ω) represents the power spectrum of the reference microphone, and P _i (ω) represents the power spectrum of the i-th microphone.

In some possible implementation manners, the acquiring the N audio signals respectively acquired by the N microphones includes:

Determine the sampling frequency F _s and FFT points N _{fft of} the N microphones during audio signal collection, use the speaker to play Gaussian white noise data or frequency sweep signal data, and the N microphones collect the N audio signals, where if the The data played by the speaker is frequency sweep signal data. The frequency sweep signal data is composed of M + 1 segments of equal length and different frequencies.

It should be noted that the number of FFT points N _fft is an even number, generally 32,64,128, ..., 1024, etc., the more the number of points, the greater the savings in the amount of calculations.

In some possible implementations, according to the formula

Calculate the frequency of each signal in the M + 1 segment signal, and

Calculate each signal in the M + 1 segment signal according to the formula S _i (t) = sin (2πf _i t),

Among them, f _i represents the frequency of the i-th stage signal, F _s represents the sampling frequency, N _fft represents the number of FFT points, S _i (t) represents the i-th stage signal, and the length of S ₁ (t) is an integer multiple of the period T, T = 1 / f ₁ .

In some possible implementations, the frequency sweep signal data played by the speaker can be written in the following vector form:

S (t) = [S ₀ (t), S ₁ (t), ..., S _M (t)] ^T

Among them, S (t) represents the frequency sweep signal data played by the speaker, and S _i (t) represents the i-th segment signal,

[] ^T represents the transpose of a vector or matrix.

In some possible implementation manners, the N microphones respectively collect N audio signals, and the audio signal collected by the i-th microphone is represented as x _i (t), and x _i (t) can be written as the following vector form:

x _i (t) = [x _{i, 1} (t), x _{i, 2} (t), ..., x _{i, K} (t)] ^T

Among them, x _i (t) represents the audio signal collected by the i-th microphone, K represents the total number of frames of the signal collected by each microphone, and [] ^T represents the transpose of the vector or matrix.

Placing the N microphones in a test room, with speakers arranged in the test room, the N microphones being located directly in front of the speakers;

Controlling the speaker to play Gaussian white noise data or frequency sweep signal data, and controlling the N microphones to respectively acquire the N audio signals.

In some possible implementations, the test room has an anechoic room environment, the speaker is an artificial mouth dedicated for audio testing, and the artificial mouth is calibrated with a standard microphone before use.

In some possible implementation manners, before controlling the speaker to play Gaussian white noise data or frequency sweep signal data, the method further includes:

In a quiet environment, acquiring first audio data X ₁ (n) collected by the N microphones within a first duration T ₁ ;

Acquiring the second audio data X ₂ (n) collected by the N microphones within the second duration T ₂ under the environment of playing Gaussian white noise data or frequency sweep signal data;

According to formula

Calculate the signal-to-noise ratio SNR, and ensure that the SNR is greater than the first threshold.

In a second aspect, a device for evaluating the consistency of a microphone array is provided, including:

An obtaining unit, configured to obtain N audio signals respectively collected by N microphones, where the N microphones form a microphone array, and N≥2;

A processing unit, configured to determine a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the N audio signals, and The reference microphone is any one of the N microphones;

The processing unit is further configured to perform an analysis on the N based on a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone. Microphones for consistency assessment.

In some possible implementation manners, the processing unit is specifically configured to:

In some possible implementation manners, the processing unit is further configured to:

Separately measure a distance difference between each of the N microphones except the reference microphone and the reference microphone to a sound source;

According to a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone, the corresponding phase spectrum difference values thereof are respectively calibrated.

According to formula

Respectively calculating a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone,

An amplitude consistency between a corresponding microphone and the reference microphone is evaluated according to a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone.

In some possible implementation manners, the N audio signals are signals collected in an environment in which the frequency sweep signal data is played.

In some possible implementation manners, the N audio signals are signals collected in an environment in which Gaussian white noise data or frequency-sweep signal data is played.

Framing each of the N audio signals to obtain K signal frames of equal length, K≥2;

Performing windowing processing on each of the K signal frames to obtain K windowed signal frames;

Perform FFT transformation on each of the K windowed signal frames to obtain K target signal frames;

Determine, according to the K target signal frames corresponding to each audio signal, a phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone and / or Power spectrum difference.

In some possible implementation manners, any two adjacent signal frames in the K signal frames overlap by R%, and R> 0.

In some possible implementation manners, the R is 25 or 50.

x _i (t) = [x _{i, 1} (t), x _{i, 2} (t), ..., x _{i, K} (t)] ^T

According to formula

Determining a phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone,

Represents the j-th target signal frame of the reference microphone,

Represents the j-th target signal frame of the i-th microphone,

Indicates the main frequency.

Determining a power spectrum of each audio signal according to the K target signal frames corresponding to each audio signal;

According to formula

Calculating a power spectrum of each audio signal,

Among them, P _i (ω) represents the power spectrum of the i-th audio signal, Yi _{, j} (ω) represents the j-th target signal frame in the i-th audio signal, and K represents the total frame of the signal collected by each microphone Number, ω represents frequency.

Determining the sampling frequency F _s and FFT points N _{fft of} the N microphones during audio signal collection, using a speaker to play Gaussian white noise data or frequency sweep signal data, and controlling the N microphones to acquire the N audio signals, Wherein, if the data played by the speaker is frequency-sweep signal data, the frequency-sweep signal data is composed of M + 1 segments of equal length and different frequencies.

According to formula

Calculating the frequency of each signal in the M + 1 segment signal, and

Calculate each signal in the M + 1 segment signals according to the formula S _i (t) = sin (2πf _i t),

In some possible implementation manners, the frequency sweep signal data played by the speaker is written in the following vector form:

S (t) = [S ₀ (t), S ₁ (t), ..., S _M (t)] ^T

[] ^T represents the transpose of a vector or matrix.

In some possible implementation manners, the N microphones respectively collect N audio signals, and the audio signal collected by the i-th microphone is represented as x _i (t), and x _i (t) can be written as the following vector form :

x _i (t) = [x _{i, 1} (t), x _{i, 2} (t), ..., x _{i, K} (t)] ^T

In some possible implementation manners, the obtaining unit is specifically configured to:

Placing the N microphones in a test room, where speakers are arranged in the test room, and the N microphones are located directly in front of the speakers;

Controlling the speaker to play Gaussian white noise data or frequency sweep signal data, and controlling the N microphones to collect the N audio signals, respectively.

In some possible implementation manners, the test room has an anechoic room environment, the speaker is an artificial mouth dedicated for audio testing, and the artificial mouth is calibrated with a standard microphone before use.

In some possible implementation manners, before the processing unit controls the speaker to play Gaussian white noise data or frequency sweep signal data, the obtaining unit is further configured to:

Acquiring second audio data X ₂ (n) collected by the N microphones within a second duration T ₂ in an environment where Gaussian white noise data or frequency sweep signal data is played;

Triggering the processing unit according to a formula

Calculate the signal-to-noise ratio SNR, and ensure that the SNR is greater than a first threshold.

In a third aspect, a device for evaluating the consistency of a microphone array is provided, including:

Memory for storing programs and data; and

A processor, configured to call and run programs and data stored in the memory;

The apparatus is configured to perform the method in the first aspect described above or any possible implementation thereof.

In a fourth aspect, a system for assessing the consistency of a microphone array is provided, including:

N microphones forming a microphone array, N≥2;

At least one audio source;

The device comprises a memory for storing programs and data and a processor for calling and running the programs and data stored in the memory, and the device is configured as the method in the first aspect or any possible implementation thereof.

According to a fifth aspect, a computer storage medium is provided, and the computer storage medium stores program code, and the program code may be used to instruct execution of the method in the first aspect or any possible implementation manner thereof.

According to a sixth aspect, a computer program product containing instructions is provided, which, when run on a computer, causes the computer to execute the method in the first aspect or any possible implementation thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic flowchart of a method for evaluating consistency of a microphone array according to an embodiment of the present application.

FIG. 2 is a schematic diagram of a test environment according to an embodiment of the present application.

FIG. 3 is a schematic diagram of calculating a phase spectrum difference according to an embodiment of the present application.

FIG. 4 is a schematic diagram of calculating a power spectrum difference according to an embodiment of the present application.

FIG. 5 is a schematic diagram of a phase spectrum difference between two microphones according to an embodiment of the present application.

6 is a schematic diagram of a phase spectrum difference value after calibration between two microphones according to an embodiment of the present application.

FIG. 7a is a schematic diagram of a power spectrum of two microphones according to an embodiment of the present application.

FIG. 7b is a schematic diagram of a power spectrum difference between two microphones according to an embodiment of the present application.

FIG. 8 is a schematic structural diagram of a device for evaluating consistency of a microphone array according to an embodiment of the present application.

FIG. 9 is a schematic structural diagram of an apparatus for evaluating consistency of a microphone array according to an embodiment of the present application.

FIG. 10 is a schematic structural diagram of a system for evaluating consistency of a microphone array according to an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application.

Microphone array refers to a system composed of a certain number of microphones (acoustic sensors) that is used to sample and process the spatial characteristics of the sound field. The difference between the phases of the sound waves received by the two microphones is used to filter the sound waves, which can eliminate the ambient background sound to the maximum, leaving only the required sound waves.

The multi-channel speech enhancement technology algorithm assumes that the target speech components of multiple microphones in the microphone array are highly correlated, and the target speech is not related to non-target interference, so the consistency between different microphones in the microphone array directly affects the algorithm performance.

Quantitative evaluation of microphone consistency can be used to guide the design of microphones and the design of microphone arrays. Microphone array circuits, electronic components, and acoustic structures all affect the consistency of microphones. When designing a microphone array, various factors can be tested item by item. The effect of consistency, so that the design of microphone consistency meets the system requirements.

Quantitative evaluation of microphone consistency can be used to compare the robustness of different algorithms. The lower the requirement for consistency indicators, the better the algorithm's robustness when the premise of achieving the same speech enhancement performance is achieved.

In the embodiments of the present application, consistency is measured from two aspects: amplitude spectrum difference and phase spectrum difference, which has objectivity and accuracy, and the quantitative consistency evaluation method can objectively guide the design of the microphone array and can also objectively Comparing the robustness of multi-channel speech enhancement algorithms.

Hereinafter, a method for evaluating the consistency of a microphone array according to an embodiment of the present application will be described in detail with reference to FIGS. 1 to 7.

FIG. 1 is a schematic flowchart of a method for evaluating consistency of a microphone array according to an embodiment of the present application. It should be understood that FIG. 1 shows steps or operations of the method, but these steps or operations are merely examples, and other operations or variations of each operation in FIG. 1 may be performed in the embodiment of the present application. The method may be executed by a device for evaluating the consistency of the microphone array, where the device for evaluating the consistency of the microphone array may be a mobile phone, a tablet computer, a portable computer, a Personal Digital Assistant (PDA), or the like.

S110: Obtain N audio signals collected by N microphones respectively, where the N microphones form a microphone array, and N ≧ 2.

When performing consistency evaluation on N microphones, it is necessary to limit the environment in which the N microphones are located, that is, the N audio signals are collected in a special test environment.

Specifically, as shown in FIG. 2, a microphone array 201 composed of the N microphones is placed in a test room 202, and a speaker 203 is disposed in the test room 202. The microphone array 201 is located directly in front of the speaker 203. The microphone array 201 is connected to the speaker 203, such as a computer control device 204. The control device 204 can control the speaker 203 to play specific audio data, for example, to play Gaussian white noise data or frequency-sweep signal data. At the same time, the control device 204 can obtain the N microphones from the microphone array 201. Audio signals.

It should be noted that the microphone consistency evaluation requires that the signal-to-noise ratio of the collected audio signal is sufficiently high and the background noise is sufficiently weak, so the test environment is required to be in a quiet environment. In particular, an anechoic room environment is required in the test room 202. The speaker 203 requires a high signal-to-noise ratio and a flat frequency response curve. In particular, the speaker uses an artificial mouth dedicated for audio testing, and is calibrated with a standard microphone before use. The microphone array 201 is placed directly in front of the speaker 203, and in particular, it is required to be placed at a position calibrated by a standard microphone.

Optionally, before performing formal audio signal acquisition, it is also necessary to perform signal-to-noise ratio (SNR) detection on the above-mentioned test environment.

Specifically, in the test environment shown in FIG. 2, first, in a quiet environment (that is, the speaker 203 is turned off), first audio data X ₁ collected by the N microphones within a first duration T ₁ is acquired. (n); then, in the environment where Gaussian white noise data or frequency-sweep signal data is played (that is, the control device 204 controls the speaker 203 to play Gaussian white noise data or frequency-sweep signal data), obtain the N microphones at the second The second audio data X ₂ (n) collected within the duration T ₂ ; then, the SNR is calculated according to the following formula 1; finally, when the SNR is greater than a set threshold, the detection passes, otherwise the detection fails.

T ₁ represents the first duration, T ₂ represents the second duration, X ₁ (n) represents the first audio data, and X ₂ (n) represents the second audio data.

It should be noted that if the test fails, the above test environment needs to be adjusted or calibrated to eliminate some factors that may affect the sexual noise ratio, until the SNR calculated according to the above formula 1 is greater than a set threshold.

Optionally, in the embodiment of the present application, acquiring audio signals by using the test environment shown in FIG. 2 described above may specifically include:

The sampling frequency F _s and the number of FFT points N _{fft of} the N microphones during audio signal collection are determined, and Gaussian white noise data or frequency-sweep signal data is played using a speaker, and the N microphones collect the N audio signals.

Optionally, the number of FFT points N _fft is an even number, generally 32, 64, 128, ..., 1024, etc., the more the number of points, the greater the savings in the amount of computation.

It should be noted that if the data played by the speaker is frequency-sweep signal data, the frequency-sweep signal data is composed of M + 1 segments of equal length and different frequencies.

Optionally, the frequency of each signal in the M + 1 segment signal can be calculated according to the following formula 2, and each signal in the M + 1 segment signal can be calculated according to the following formula 3.

Among them, f _i is the frequency of the ith signal, F _s is the sampling frequency, and N _fft is the number of FFT points.

S _i (t) = sin (2πf _i t) Equation 3

Where, S _i (t) represents the signal paragraph i, f _i is the i-th frequency band signal.

It should be noted that the length of the first segment signal S ₁ (t) is an integer multiple of the period T, and T = 1 / f ₁ .

Optionally, the frequency sweep signal data played by the speaker can be written in the following vector form:

S (t) = [S ₀ (t), S ₁ (t), ..., S _M (t)] ^T

[] ^T represents the transpose of a vector or matrix.

Optionally, the N microphones respectively acquire N audio signals, and the audio signal collected by the i-th microphone is represented as x _i (t), and x _i (t) can be written as the following vector form:

x _i (t) = [x _{i, 1} (t), x _{i, 2} (t), ..., x _{i, K} (t)] ^T

S120. Determine a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the N audio signals. The reference microphones are the N microphones. Any one of the microphones.

Optionally, in the embodiment of the present application, after the N audio signals are collected, the audio signals may be framed, and the audio signals of each frame may be windowed, and the windowed signals of each frame may be FFT-transformed to obtain different microphones. Phase spectrum difference between the two.

Specifically, as shown in FIG. 3, it is assumed that the N audio signals are x ₁ (t), x ₂ (t), ..., x _N (t), and each of the N audio signals is divided into Frames, to obtain K signal frames of equal length, K≥2. For example, frame the i-th audio signal to obtain K signal frames of equal length and write the following vector form:

x _i (t) = [x _{i, 1} (t), x _{i, 2} (t), ..., x _{i, K} (t)] ^T

Among them, x _i (t) represents the i-th audio signal, K represents the total number of frames collected by each microphone, and [] ^T represents the transpose of a vector or a matrix;

Perform windowing on each of the K signal frames to obtain K windowed signal frames, for example, window the j-th frame x _{i, j} of the i-th audio signal to obtain the i-th audio frame. The j-th windowed signal frame of the signal y _{i, j} = x _{i, j} × Win;

Perform FFT transformation on each of the K windowed signal frames to obtain K target signal frames, for example, the jth windowed signal frame y _{i, j} (t) of the i-th audio signal FFT transform to get the j-th target signal frame Y _{i, j} (ω) of the i-th audio signal;

Determine the phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the K target signal frames corresponding to each audio signal, for example, assuming the jth target The main frequency of the signal frame is

Then, the main frequency of the i-th microphone and the reference microphone can be calculated according to the following formula 4.

Phase spectrum difference at.

Represents the j-th target signal frame of the reference microphone,

Represents the j-th target signal frame of the i-th microphone,

Indicates the main frequency.

It should be noted that in FIG. 3 above, the first microphone is used as the reference microphone, that is, the phase spectrum difference between each microphone except the first microphone and the first microphone is calculated separately, and The first microphone corresponds to the audio signal x ₁ (t), the second microphone corresponds to the audio signal x ₂ (t), ..., and the Nth microphone corresponds to the audio signal x _N (t).

Optionally, K represents the total number of frames of signals received by each microphone.

In some possible implementation manners, any two adjacent signal frames in the K signal frames overlap by R%, and R> 0. For example, R is 25 or 50. In other words, any two adjacent signal frames in the K signal frames overlap by 25% or 50%.

Optionally, in the embodiment of the present application, when the phase consistency evaluation is performed, the N audio signals are signals collected in an environment where the frequency sweep signal data is played. In other words, when calculating the above-mentioned phase spectrum difference value, the N audio signals are signals collected in an environment where the frequency sweep signal data is played.

Therefore, the phase difference of any frequency ω can be calculated, that is, the phase spectrum difference PDiff _i (ω) between the i-th microphone and the reference microphone, that is, the above

Optionally, in the embodiment of the present application, after the N audio signals are collected, the audio signals may be framed, and each frame of the audio signal is windowed, and the windowed signal of each frame is subjected to FFT transformation. After the FFT transformation is obtained, The power spectrum of each frame of the signal, find the power spectrum difference between different microphones.

Specifically, as shown in FIG. 4, it is assumed that the N audio signals are x ₁ (t), x ₂ (t), ..., x _N (t), and each of the N audio signals is divided. Frames, to obtain K signal frames of equal length, K≥2. For example, frame the i-th audio signal to obtain K signal frames of equal length and write the following vector form:

x _i (t) = [x _{i, 1} (t), x _{i, 2} (t), ..., x _{i, K} (t)] ^T

Among them, x _i (t) represents the i-th audio signal, K represents the total number of frames received by each microphone, and [] ^T represents the transpose of a vector or a matrix;

Determine the power spectrum of each audio signal according to the K target signal frames corresponding to each audio signal, for example, calculate the power spectrum of the i-th audio signal according to the following formula 5;

According to the power spectrum of each audio signal, determine the power spectrum difference between each of the N microphones except the reference microphone and the reference microphone. For example, calculate the i-th microphone and The power spectrum difference between the reference microphones.

Among them, P _i (ω) represents the power spectrum of the i-th audio signal, Yi _{, j} (ω) represents the j-th target signal frame in the i-th audio signal, ω represents the frequency, and K represents the data collected by each microphone. The total number of frames of the signal.

PD _i (ω) = P ₁ (ω)-P _i (ω) Equation 6

It should be noted that in FIG. 4 above, the first microphone is used as the reference microphone, that is, the power spectrum difference between each microphone except the first microphone and the first microphone is calculated separately, The first microphone corresponds to the audio signal x ₁ (t), the second microphone corresponds to the audio signal x ₂ (t), ..., and the Nth microphone corresponds to the audio signal x _N (t).

Optionally, in the embodiment of the present application, when the amplitude consistency evaluation is performed, the N audio signals are signals collected in an environment in which Gaussian white noise data or frequency-sweep signal data is played. In other words, when calculating the above power spectrum difference value, the N audio signals are signals collected in an environment where Gaussian white noise data or frequency-sweep signal data is played.

S130: Perform a consistency evaluation on the N microphones according to a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone.

Specifically, the phase spectrum difference value is used for phase consistency evaluation, and the power spectrum difference value is used for amplitude consistency evaluation.

Optionally, in the embodiment of the present application, a corresponding microphone and the reference microphone are evaluated according to a phase spectrum difference value between each of the N microphones except the reference microphone and the reference microphone. Phase consistency between.

It should be noted that since the distances between different microphones and the sound source are difficult to be completely consistent when collecting data, there is a fixed phase difference between the different microphones.

Optionally, in the embodiment of the present application, the phase spectrum difference value may be calibrated by using a fixed phase difference.

Specifically, were measured for each of the N microphones other than the microphone and reference microphone from the reference microphone difference to the sound source, e.g., a distance D _i represents the i-th microphone and reference microphone to the sound source of the difference;

According to the measured distance difference, a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone is calculated. For example, the i-th microphone and the reference microphone can be calculated according to the following formula 7. Fixed phase difference between

It should be noted that the fixed phase difference satisfies a linear relationship with the signal frequency. Therefore, the linear phase can be used to determine the fixed phase difference.

For example, the fixed phase difference between the microphone 1 and the reference microphone is A, and the phase spectrum difference between the microphone 1 and the reference microphone is B. As shown in FIG. 5, the straight line represents the fitting between the microphone 1 and the reference microphone. The phase difference between the fixed phase difference between the microphone 1 and the reference microphone, the overall performance, as the frequency increases from 0Hz to 8000Hz, the phase spectrum difference between the microphone 1 and the reference microphone from 0 The radian is reduced to -2 radians. After calibration, the phase spectrum difference between microphone 1 and the reference microphone is C, as shown in the curve in Figure 6, at this time, C = BA, which shows that as the frequency increases from 0Hz to 8000Hz, microphone 1 and The phase spectrum difference between the reference microphones fluctuates between 0 radians and ± 0.5 radians.

It can be seen from the comparison between FIG. 5 and FIG. 6 that the fixed phase difference will greatly affect the phase spectrum difference between the two microphones. Therefore, when the amplitude consistency evaluation is performed on the two microphones, it is necessary to eliminate the The effect of a fixed phase difference.

Optionally, in the embodiment of the present application, the amplitude between the corresponding microphone and the reference microphone is evaluated based on the power spectrum difference between each of the N microphones except the reference microphone and the reference microphone. consistency.

For example, as shown in FIG. 7, specifically, FIG. 7a shows the power spectrum of the microphone 1 and the power spectrum of the reference microphone, and FIG. 7b shows the power spectrum difference between the microphone 1 and the reference microphone. There is not much difference in the power spectrum between the microphones, and the maximum value of the power spectrum difference is <± 1 decibel (dB).

Optionally, in the embodiments of the present application, the influence of factors such as the circuit, electronic components, and acoustic structure of the microphone array on the consistency of the microphone can be tested item by item to guide the calibration of the microphone array. Design and microphone array design to evaluate the robustness of the multi-channel enhancement algorithm.

Therefore, in the embodiment of the present application, the phase spectrum difference and / or power spectrum difference between each microphone and the reference microphone may be determined according to the N audio signals collected by the N microphones, so as to make the N microphones consistent. Performance evaluation to eliminate the impact of consistency between microphones on multi-channel speech enhancement algorithms and improve user experience.

Optionally, as shown in FIG. 8, an embodiment of the present application provides a device 800 for evaluating the consistency of a microphone array, including:

The obtaining unit 810 is configured to obtain N audio signals collected by N microphones respectively, where the N microphones form a microphone array, and N≥2;

A processing unit 820, configured to determine a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the N audio signals, The reference microphone is any one of the N microphones;

The processing unit 820 is further configured to perform, according to a phase spectrum difference value and / or a power spectrum difference value between each of the N microphones except the reference microphone, and the reference microphone, N microphones were evaluated for consistency.

Optionally, the processing unit 820 is specifically configured to:

Optionally, the processing unit 820 is further configured to:

Optionally, the processing unit 820 is specifically configured to:

According to formula

Optionally, the processing unit 820 is specifically configured to:

Optionally, the N audio signals are signals collected in an environment where the frequency sweep signal data is played.

Optionally, the N audio signals are signals collected in an environment where Gaussian white noise data or frequency-sweep signal data is played.

Optionally, the frequency sweep signal is any one of a linear frequency sweep signal, a logarithmic frequency sweep signal, a linear step frequency sweep signal, and a logarithmic step frequency sweep signal.

Optionally, the processing unit 820 is specifically configured to:

Optionally, any two adjacent signal frames of the K signal frames overlap by R%, and R> 0.

Optionally, the R is 25 or 50.

Optionally, frame the i-th audio signal to obtain K signal frames of equal length and write the following vector forms:

x _i (t) = [x _{i, 1} (t), x _{i, 2} (t), ..., x _{i, K} (t)] ^T

Optionally, the processing unit 820 is specifically configured to:

According to formula

Represents the j-th target signal frame of the reference microphone,

Represents the j-th target signal frame of the i-th microphone,

Indicates the main frequency.

Optionally, the processing unit 820 is specifically configured to:

According to formula

Calculating a power spectrum of each audio signal,

Optionally, the processing unit 820 is specifically configured to:

Optionally, the processing unit 820 is further configured to:

According to formula

Calculating the frequency of each signal in the M + 1 segment signal, and

Optionally, the frequency sweep signal data played by the speaker is written in the following vector form:

S (t) = [S ₀ (t), S ₁ (t), ..., S _M (t)] ^T

[] ^T represents the transpose of a vector or matrix.

Optionally, the N microphones respectively acquire N audio signals, where the audio signal collected by the i-th microphone is represented as x _i (t), and x _i (t) can be written in the following vector form:

x _i (t) = [x _{i, 1} (t), x _{i, 2} (t), ..., x _{i, K} (t)] ^T

Optionally, the obtaining unit 810 is specifically configured to:

Optionally, the test room has an anechoic room environment, the speaker is an artificial mouth dedicated for audio testing, and the artificial mouth is calibrated with a standard microphone before use.

Optionally, before the processing unit 820 controls the speaker to play Gaussian white noise data or frequency sweep signal data, the obtaining unit 810 is further configured to:

Trigger the processing unit 820 according to a formula

Optionally, as shown in FIG. 9, an embodiment of the present application provides a device 900 for evaluating consistency of a microphone array, including:

A memory 910 for storing programs and data; and

A processor 920, configured to call and run a program and data stored in the memory;

The device 900 is configured to perform the methods shown in FIGS. 1 to 7 described above.

Optionally, as shown in FIG. 10, an embodiment of the present application provides a system 1000 for evaluating consistency of a microphone array, including:

N microphones constituting the microphone array 1010, N≥2;

At least one audio source 1020;

The device 1030 includes a memory 1031 for storing programs and data and a processor 1032 for calling and running the programs and data stored in the memory, and the device 1030 is configured as the method shown in FIGS. 1 to 7 described above.

It should be understood that, in the various embodiments of the present application, the size of the sequence numbers of the above processes does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not deal with the embodiments of the present application. The implementation process constitutes any limitation.

Those of ordinary skill in the art may realize that the units and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices, and units described above can refer to the corresponding processes in the foregoing method embodiments, and are not repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division. For example, multiple units or components may be combined or may Integration into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, which may be electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software functional unit.

When the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The foregoing storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes .

The above is only a specific implementation of this application, but the scope of protection of this application is not limited to this. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in this application. It should be covered by the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of this claim.

Claims

A method for evaluating the consistency of a microphone array, comprising:

Acquiring N audio signals respectively collected by N microphones, the N microphones forming a microphone array, N≥2;

Determine a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the N audio signals, where the reference microphone is Any one of the N microphones;

Perform consistency evaluation on the N microphones according to a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone.
The method according to claim 1, wherein the N microphones are based on a phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone. Conduct a conformance assessment, including:

According to the phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone, the phase consistency between the corresponding microphone and the reference microphone is evaluated.
The method according to claim 2, further comprising:

Separately measure a distance difference between each of the N microphones except the reference microphone and the reference microphone to a sound source;

Calculating a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone according to the measured distance difference;

According to a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone, the corresponding phase spectrum difference values thereof are respectively calibrated.
The method according to claim 3, wherein, according to the measured distances, a fixed phase between each of the N microphones except the reference microphone and the reference microphone is calculated separately. Poor, including:

According to formula
Respectively calculating a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone,

Wherein, Y i (ω) represents the frequency spectrum of the i-th microphone, Y 1 (ω) represents the frequency spectrum reference microphone, ω represents the frequency, d i represents the distance from the i-th microphone and reference microphone to the sound source of the difference, c denotes the speed of sound , 2πωd i / c represents a fixed phase difference between the i-th microphone and the reference microphone.
The method according to any one of claims 1 to 4, wherein, according to a phase spectrum difference value between each of the N microphones except the reference microphone and the reference microphone, Performing consistency evaluation on the N microphones, including:

An amplitude consistency between a corresponding microphone and the reference microphone is evaluated according to a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone.
The method according to any one of claims 2 to 4, wherein the N audio signals are signals collected in an environment in which frequency-sweep signal data is played.
The method according to claim 5, wherein the N audio signals are signals collected in an environment where Gaussian white noise data or frequency-sweep signal data is played.
The method according to claim 6 or 7, wherein the frequency sweep signal is any one of a linear frequency sweep signal, a logarithmic frequency sweep signal, a linear step frequency sweep signal, and a logarithmic step frequency sweep signal. Species.
The method according to any one of claims 1 to 8, wherein, according to the N audio signals, determining each of the N microphones except the reference microphone and the reference microphone Phase spectral difference and / or power spectral difference between, including:

Framing each of the N audio signals to obtain K signal frames of equal length, K≥2;

Performing windowing processing on each of the K signal frames to obtain K windowed signal frames;

Perform FFT transformation on each of the K windowed signal frames to obtain K target signal frames;

Determine, according to the K target signal frames corresponding to each audio signal, a phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone and / or Power spectrum difference.
The method according to claim 9, wherein any two adjacent signal frames of the K signal frames overlap by R%, and R> 0.
The method according to claim 10, wherein the R is 25 or 50.
The method according to any one of claims 9 to 11, wherein the i-th audio signal is framed to obtain K signal frames of equal length and written into the following vector form:

x i (t) = [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T

Among them, x i (t) represents the i-th audio signal, K represents the total number of frames collected by each microphone, and [] T represents the transpose of a vector or a matrix.
The method according to any one of claims 9 to 12, characterized in that, according to the K target signal frames corresponding to each audio signal, determining that the reference microphones are excluded from the N microphones The phase spectrum difference between each microphone other than the reference microphone includes:

According to formula
Determining a phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone,

Among them, imag () means taking the imaginary part, ln () means taking the natural logarithm,
Represents the phase spectrum difference between the i-th microphone and the reference microphone,
Represents the j-th target signal frame of the reference microphone,
Represents the j-th target signal frame of the i-th microphone,
Indicates the main frequency.
The method according to any one of claims 9 to 13, characterized in that, according to the K target signal frames corresponding to each audio signal, determining that the reference microphones are excluded from the N microphones The difference in power spectrum between each microphone other than the reference microphone includes:

Determining a power spectrum of each audio signal according to the K target signal frames corresponding to each audio signal;

According to the power spectrum of each audio signal, a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone is determined.
The method according to claim 14, wherein determining the power spectrum of each audio signal according to the K target signal frames corresponding to each audio signal comprises:

According to formula
Calculating a power spectrum of each audio signal,

Among them, P i (ω) represents the power spectrum of the i-th audio signal, Yi , j (ω) represents the j-th target signal frame in the i-th audio signal, and K represents the total frame of the signal collected by each microphone Number, ω represents frequency.
The method according to claim 14 or 15, wherein, according to a power spectrum of each audio signal, determining each of the N microphones except the reference microphone and the reference Difference in power spectrum between microphones, including:

Calculate the power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the formula PD i (ω) = P 1 (ω) -P i (ω),

Among them, PD i (ω) represents the power spectrum difference between the i-th microphone and the reference microphone, P 1 (ω) represents the power spectrum of the reference microphone, and P i (ω) represents the power spectrum of the i-th microphone.
The method according to any one of claims 1 to 16, wherein the acquiring N audio signals collected by each of the N microphones comprises:

Determine the sampling frequency F s and FFT points N fft of the N microphones during audio signal collection, use a speaker to play Gaussian white noise data or frequency sweep signal data, and the N microphones collect the N audio signals, where If the data played by the speaker is frequency-sweep signal data, the frequency-sweep signal data is composed of M + 1 segments of equal length and different frequencies,
The method according to claim 17, wherein:

According to formula
Calculating the frequency of each signal in the M + 1 segment signal, and

Calculate each signal in the M + 1 segment signals according to the formula S i (t) = sin (2πf i t),

Among them, f i represents the frequency of the i-th stage signal, F s represents the sampling frequency, N fft represents the number of FFT points, S i (t) represents the i-th stage signal, and the length of S 1 (t) is an integer multiple of the period T, T = 1 / f 1 .
The method according to claim 18, wherein the frequency sweep signal data played by the speaker is written in the following vector form:

S (t) = [S 0 (t), S 1 (t), ..., S M (t)] T

Among them, S (t) represents the frequency sweep signal data played by the speaker, and S i (t) represents the i-th segment signal,
[] T represents the transpose of a vector or matrix.
The method according to any one of claims 1 to 19, wherein the N microphones respectively acquire N audio signals, and the audio signal collected by the i-th microphone is represented as x i (t), And x i (t) can be written as the following vector:

x i (t) = [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T

Among them, x i (t) represents the audio signal collected by the i-th microphone, K represents the total number of frames of the signal collected by each microphone, and [] T represents the transpose of the vector or matrix.
The method according to any one of claims 1 to 20, wherein the acquiring N audio signals respectively collected by N microphones comprises:

Placing the N microphones in a test room, where speakers are arranged in the test room, and the N microphones are located directly in front of the speakers;

Controlling the speaker to play Gaussian white noise data or frequency sweep signal data, and controlling the N microphones to collect the N audio signals, respectively.
The method according to claim 21, wherein the test room has an anechoic room environment, the speaker is an artificial mouth dedicated for audio testing, and the artificial mouth is calibrated with a standard microphone before use.
The method according to claim 21 or 22, wherein before controlling the speaker to play Gaussian white noise data or frequency sweep signal data, the method further comprises:

In a quiet environment, acquiring first audio data X 1 (n) collected by the N microphones within a first duration T 1 ;

Acquiring second audio data X 2 (n) collected by the N microphones within a second duration T 2 in an environment where Gaussian white noise data or frequency sweep signal data is played;

According to formula
Calculate the signal-to-noise ratio SNR, and ensure that the SNR is greater than a first threshold.
A device for evaluating the consistency of a microphone array, comprising:

An obtaining unit, configured to obtain N audio signals respectively collected by N microphones, where the N microphones form a microphone array, and N≥2;

A processing unit, configured to determine a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the N audio signals, and The reference microphone is any one of the N microphones;

The processing unit is further configured to perform an analysis on the N based on a phase spectrum difference and / or a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone. Microphones for consistency assessment.
The device according to claim 24, wherein the processing unit is specifically configured to:

According to the phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone, the phase consistency between the corresponding microphone and the reference microphone is evaluated.
The device according to claim 25, wherein the processing unit is further configured to:

Separately measure a distance difference between each of the N microphones except the reference microphone and the reference microphone to a sound source;

Calculating a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone according to the measured distance difference;

According to a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone, the corresponding phase spectrum difference values thereof are respectively calibrated.
The device according to claim 26, wherein the processing unit is specifically configured to:

According to formula
Respectively calculating a fixed phase difference between each of the N microphones except the reference microphone and the reference microphone,

Wherein, Y i (ω) represents the frequency spectrum of the i-th microphone, Y 1 (ω) represents the frequency spectrum reference microphone, ω represents the frequency, d i represents the distance from the i-th microphone and reference microphone to the sound source of the difference, c denotes the speed of sound , 2πωd i / c represents a fixed phase difference between the i-th microphone and the reference microphone.
The device according to any one of claims 24 to 27, wherein the processing unit is specifically configured to:

An amplitude consistency between a corresponding microphone and the reference microphone is evaluated according to a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone.
The device according to any one of claims 25 to 27, wherein the N audio signals are signals collected in an environment in which the frequency sweep signal data is played.
The device according to claim 28, wherein the N audio signals are signals collected in an environment in which Gaussian white noise data or frequency-sweep signal data is played.
The device according to claim 29 or 30, wherein the frequency sweep signal is any one of a linear frequency sweep signal, a logarithmic frequency sweep signal, a linear step frequency sweep signal, and a logarithmic step frequency sweep signal. Species.
The device according to any one of claims 24 to 31, wherein the processing unit is specifically configured to:

Framing each of the N audio signals to obtain K signal frames of equal length, K≥2;

Performing windowing processing on each of the K signal frames to obtain K windowed signal frames;

Perform FFT transformation on each of the K windowed signal frames to obtain K target signal frames;

Determine, according to the K target signal frames corresponding to each audio signal, a phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone and / or Power spectrum difference.
The device according to claim 32, wherein any two adjacent signal frames of the K signal frames overlap by R%, and R> 0.
The device according to claim 33, wherein the R is 25 or 50.
The device according to any one of claims 32 to 34, wherein the i-th audio signal is framed to obtain K signal frames of equal length and written into the following vector form:

x i (t) = [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T

Among them, x i (t) represents the i-th audio signal, K represents the total number of frames collected by each microphone, and [] T represents the transpose of a vector or a matrix.
The device according to any one of claims 32 to 35, wherein the processing unit is specifically configured to:

According to formula
Determining a phase spectrum difference between each of the N microphones except the reference microphone and the reference microphone,

Among them, imag () means taking the imaginary part, ln () means taking the natural logarithm,
Represents the phase spectrum difference between the i-th microphone and the reference microphone,
Represents the j-th target signal frame of the reference microphone,
Represents the j-th target signal frame of the i-th microphone,
Indicates the main frequency.
The device according to any one of claims 32 to 36, wherein the processing unit is specifically configured to:

Determining a power spectrum of each audio signal according to the K target signal frames corresponding to each audio signal;

According to the power spectrum of each audio signal, a power spectrum difference between each of the N microphones except the reference microphone and the reference microphone is determined.
The device according to claim 37, wherein the processing unit is specifically configured to:

According to formula
Calculating a power spectrum of each audio signal,

Among them, P i (ω) represents the power spectrum of the i-th audio signal, Yi , j (ω) represents the j-th target signal frame in the i-th audio signal, and K represents the total frame of the signal collected by each microphone Number, ω represents frequency.
The device according to claim 37 or 38, wherein the processing unit is specifically configured to:

Calculate the power spectrum difference between each of the N microphones except the reference microphone and the reference microphone according to the formula PD i (ω) = P 1 (ω) -P i (ω),

Among them, PD i (ω) represents the power spectrum difference between the i-th microphone and the reference microphone, P 1 (ω) represents the power spectrum of the reference microphone, and P i (ω) represents the power spectrum of the i-th microphone.
The device according to any one of claims 24 to 39, wherein the processing unit is specifically configured to:

Determining the sampling frequency F s and FFT points N fft of the N microphones during audio signal collection, using a speaker to play Gaussian white noise data or frequency sweep signal data, and controlling the N microphones to acquire the N audio signals, Wherein, if the data played by the speaker is frequency-sweep signal data, the frequency-sweep signal data is composed of M + 1 segments of equal length and different frequencies.
The device according to claim 40, wherein the processing unit is further configured to:

According to formula
Calculating the frequency of each signal in the M + 1 segment signal, and

Calculate each signal in the M + 1 segment signals according to the formula S i (t) = sin (2πf i t),

Among them, f i represents the frequency of the i-th stage signal, F s represents the sampling frequency, N fft represents the number of FFT points, S i (t) represents the i-th stage signal, and the length of S 1 (t) is an integer multiple of the period T, T = 1 / f 1 .
The device according to claim 41, wherein the frequency sweep signal data played by the speaker is written in the following vector form:

S (t) = [S 0 (t), S 1 (t), ..., S M (t)] T

Among them, S (t) represents the frequency sweep signal data played by the speaker, and S i (t) represents the i-th segment signal,
[] T represents the transpose of a vector or matrix.
The device according to any one of claims 24 to 42, wherein the N microphones respectively collect N audio signals, and the audio signal collected by the i-th microphone is represented as x i (t) And x i (t) can be written as the following vector:

x i (t) = [x i, 1 (t), x i, 2 (t), ..., x i, K (t)] T

Among them, x i (t) represents the audio signal collected by the i-th microphone, K represents the total number of frames of the signal collected by each microphone, and [] T represents the transpose of the vector or matrix.
The device according to any one of claims 24 to 43, wherein the obtaining unit is specifically configured to:

Placing the N microphones in a test room, where speakers are arranged in the test room, and the N microphones are located directly in front of the speakers;

Controlling the speaker to play Gaussian white noise data or frequency sweep signal data, and controlling the N microphones to collect the N audio signals, respectively.
The device according to claim 44, wherein the test room has an anechoic room environment, the speaker is an artificial mouth for audio testing, and the artificial mouth is calibrated with a standard microphone before use.
The device according to claim 44 or 45, wherein before the processing unit controls the speaker to play Gaussian white noise data or frequency sweep signal data, the obtaining unit is further configured to:

In a quiet environment, acquiring first audio data X 1 (n) collected by the N microphones within a first duration T 1 ;

Acquiring second audio data X 2 (n) collected by the N microphones within a second duration T 2 in an environment where Gaussian white noise data or frequency sweep signal data is played;

Triggering the processing unit according to a formula
Calculate the signal-to-noise ratio SNR, and ensure that the SNR is greater than a first threshold.
A device for evaluating the consistency of a microphone array, comprising:

Memory for storing programs and data; and

A processor, configured to call and run programs and data stored in the memory;

The device is configured to perform the method according to any one of claims 1 to 23.
A system for evaluating the consistency of a microphone array is characterized in that it includes:

N microphones forming a microphone array, N≥2;

At least one audio source;

An apparatus including a memory for storing programs and data and a processor for calling and running the programs and data stored in the memory, the apparatus is configured to:

The method according to any one of claims 1 to 23 is performed.