WO2012079459A1

WO2012079459A1 - Method and apparatus for audio mixing of multiple microphones

Info

Publication number: WO2012079459A1
Application number: PCT/CN2011/083165
Authority: WO
Inventors: 彭远疆
Original assignee: 中兴通讯股份有限公司
Priority date: 2010-12-17
Filing date: 2011-11-29
Publication date: 2012-06-21
Also published as: CN102056053A; CN102056053B

Abstract

The present invention relates to the field of audio information processing. Disclosed are a method and apparatus for the audio mixing of multiple microphones. The method comprises: making a statistic analysis of the signal strength of each input channel during the current period, choosing on the basis of this at least two input channels with higher signal strength for audio testing, identifying the tested input channel with audio as the audio input channel; when there are at least two audio input channels, identifying the signal similarity between the signals of each audio input channel, then controlling the strobe of the audio input channel on the basis of this, and making the signal of the strobe audio input channel for weighted audio mixing output. The method and apparatus in the present invention are capable of reducing the erroneous judgment rate of input channel strobe and improving the audio quality after audio mixing.

Description

Multi-microphone mixing method and device

The present invention relates to the field of audio information processing, and in particular, to a multi-microphone mixing method and apparatus. Background technique

In the video conferencing system, a microphone is needed to collect the voice of the local speaker. The sound is audio-encoded and transmitted to the far end. After decoding in the remote system, the sound is output to the speaker for playback. In order to reduce the effects of room reverberation and background noise, directional microphones are generally used in video conferencing systems to collect sound (ie, pickup). Since the directional microphone has the best sound pickup effect in the direction facing the microphone, in order to ensure good sound pickup effect when the speakers speak in different directions, it is generally required to use multiple directional microphones to collect the voices of speakers of different orientations. The pickup method is called distributed pickup. Figure 1 shows a schematic diagram of a distributed pickup system. Figure 1 depicts a typical conference room layout in a video conferencing system. Each participant uses a separate microphone as the pickup device. Distributed Pickup In order to prevent crosstalk from the speech signals collected by adjacent microphones, each microphone is required to be close to one or several speakers, and the spacing between the microphones is generally greater than the distance between the microphones and the corresponding speakers. Sometimes in order to reduce the total number of microphones, the array microphone is also used in the video conferencing system for centralized pickup. As shown in FIG. 2, which is a schematic diagram of a centralized pickup mode, FIG. 2 depicts a centralized sound collection scheme using an array microphone in a video conference system, in which all participants collectively use an array microphone as a sound pickup device. The array microphone is used to assemble a plurality of sound pickup units into a whole device in a certain layout. The array microphones are mostly in the shape of a disk or a polygon, and each of the sound pickup units is generally placed on the outer edge of the device and pointed outward. The spacing between adjacent sound pickup units in an array microphone is typically much smaller than the distance from the array microphone device to the speaker. When a single array microphone cannot effectively cover the entire room, multiple array microphones can be used to sub-regionally pick up the sound. Figure 3 shows a centralized pickup using multiple array microphones. Schematic diagram. Figure 3 depicts the use of multiple array microphone pickups in a larger room, each array microphone being responsible for the pickup of a single area.

Considering the complexity of codec, transmission bandwidth, system compatibility, etc., it is necessary to mix multi-channel signals collected by multiple microphones (sound pickup units) into single-channel or dual-channel stereo signals, and then do single-channel/stereo coding. And transmission. The evaluation of multi-microphone mixing technology is mainly the signal-to-noise ratio, sound quality and sound stability of the output voice after mixing. For stereo systems, the fidelity of the image (phase) information is also an important measure.

Traditional video conferencing systems use a simple mixing method based on signal strength (short-time energy or signal amplitude) to mix and output the speech signals collected by multiple microphones. Typical mixing methods are:

1. Direct mixing method: Simply add and mix the input signals of each channel and output them to a single channel. It is determined that the background noise becomes larger after mixing, the signal-to-noise ratio (SNR) is significantly reduced, and the reverberation seriously leads to speech. Vague, poor sound quality.

2. First microphone priority mixing method: Count the signal strength of each input channel to find out the signal intensity. The largest voice channel directly serves as the output channel. This method does not reduce the signal-to-noise ratio, but the disadvantage is that when two or more people in different positions speak at the same time, there is a clear channel switching feeling, and the volume of the voice and background noise will change significantly.

3. Dynamic Weighted Mixing Method: The signal strength of each voice channel is counted and sorted according to size. Only the channels with the highest signal strength are weighted and mixed, and other channels do not participate in the mix. This method can alleviate the channel switching when the speakers at different positions speak at the same time, but the disadvantage is that since only the intensity information of the signal is utilized, a single person can also open two or more channels physically adjacent to each other, resulting in a letter. The noise ratio is reduced, the reverberation is aggravated and the speech is vague.

The above mixing method is based on the signal strength to judge the channel gating. In many application scenarios, the performance is low, and it is prone to misjudgment:

1) In a typical array microphone application, as shown in Figure 2, when remote from the array microphone When a spokesperson spoke, the difference in signal intensity collected by each microphone in the array microphone device was 4艮, which led to misjudgment when mixing.

2) Even in the application of the microphone, due to the reflection of the desktop, whiteboard, wall surface, etc., as shown in Fig. 4, the schematic diagram of the distributed sound collection method containing the reflector, the method based on the signal strength It is prone to misjudgment, which causes the channel with loud reflection/reverberation to be erroneously gated, which seriously affects the voice quality after mixing.

In stereo/multi-channel systems, in addition to considering the energy mixing of different channels, the mixing signal still needs to maintain the orientation (position) information of the original sound source. The microphones in different positions often correspond to different positions of the sound source. Incorrect gating can result in a sudden change in the position of the sound image, causing greater interference to the far-end listener. Summary of the invention

The invention provides a multi-microphone mixing method and device, which can reduce the false positive rate of the input channel strobe and improve the audio quality after mixing.

In order to achieve the above object, the technical solution of the present invention is achieved as follows:

A multi-microphone mixing method, including:

Counting the signal strength of each input channel in the current time period, and selecting at least two input channels with high signal strength to perform voice detection, and determining the detected input channel with voice as a voice input channel;

When the voice input channels are at least two, determining a signal similarity between the signals of the voice input channels, thereby controlling the gating of the voice input channel, and weighting the signals of the gated voice input channels Output.

The method further includes: when there is only one voice input channel, directly controlling the voice input channel strobe.

The method for controlling the gating of the voice input channel according to each signal similarity is: For any two voice input channels, if the signal similarity of the two voice input channels is small When the first threshold is equal to the preset, the two input channels are controlled to be strobed.

Among them, it also includes:

If there are two voice input channels and the signal similarity is greater than and/or equal to the preset first threshold, according to the signal strength of the two voice input channels and the delay of the two signals corresponding to the signal similarity, Controls the gating of the two speech input channels.

The process of controlling the gating of the two voice input channels according to the signal strength of the two voice input channels and the delay of the two signals corresponding to the signal similarity includes:

Controlling one of the two voice input channels when the signal strength difference value of the two voice input channels is greater than a set value;

When the signal strength difference value of the two voice input channels is less than the set value, the signal similarity of the two voice input channels is determined, and the relative delays of the two signals corresponding to the maximum value of the similarity function are taken, according to which the strobe One or two of two voice input channels;

Controlling one of the two voice input channels when the signal strength difference between the two voice input channels is equal to the set value; or determining the signal similarity of the two voice input channels, and taking the maximum value of the similarity function The relative delay of the corresponding two signals, according to which one or two of the two voice input channels are gated;

The process of strobing the voice input channel according to the relative delay includes: if one of the two signals has a relative delay greater than a set duration, controlling one of the two voice input channels; if the relative delay of the two signals If it is less than the set duration, then both voice input channels are controlled to be strobed; if the relative delay of the two signals is equal to the set duration, one of the two voice input channels is controlled to be strobed or strobed.

The method for controlling one of the two voice input channels is:

Controls the voice input channel strobe of the two voice input channels with high signal strength.

The process of determining signal similarity between signals of each voice input channel includes: performing band pass filtering preprocessing on signals of each voice input channel; In all channels after preprocessing, the signal similarity is determined for each of the two signals using a normalized cross-correlation function or an average amplitude difference function.

A multi-microphone mixing device comprising:

a statistical module, configured to count signal strength of each input channel in the current time period, and thereby selecting at least two input channels with high signal strength for voice detection;

a similarity determination module, configured to determine, by the statistical module, a voice input channel as a voice input channel, and determine a signal similarity between signals of each voice input channel when the voice input channel is at least two ;

a gating module, configured to control gating of the voice input channel according to each signal similarity determined by the similarity determining module;

And a mixing module, configured to perform a weighted mixing output on a signal input by the channel input channel of the gating module.

The strobe module is further configured to directly control the voice input channel strobe when there is only one voice input channel.

The strobe module is configured to: if any two voice input channels have signal similarity equal to or less than a preset first threshold, control the two input channels to be strobed.

The multi-microphone mixing method provided by the embodiment of the present invention considers the signal strength of each input channel and the signal similarity between the channels when the strobing discrimination is performed on the input channel, so that the probability of channel mis-singing is greatly reduced. Small, which greatly improves the voice quality after mixing. DRAWINGS

Figure 1 is a schematic diagram of a distributed sound pickup mode;

Figure 2 is a schematic diagram of a centralized pickup method;

Figure 3 is a schematic diagram of a centralized pickup method using a plurality of array microphones;

Figure 4 is a schematic diagram of a distributed sound collection method containing a reflector; FIG. 5 is a flowchart of a multi-microphone mixing method according to an embodiment of the present invention;

6 is a flowchart of a multi-microphone mixing method according to Embodiment 1 of the present invention;

7 is a flowchart of a multi-microphone mixing method according to Embodiment 2 of the present invention;

FIG. 8 is a structural diagram of a multi-microphone mixing device according to an embodiment of the present invention. detailed description

An embodiment of the present invention provides a multi-microphone mixing method, as shown in FIG. 5, including:

5501. Count the signal strength of each input channel in the current time period, and select at least two input channels with the highest signal strength to perform voice detection;

In this step, the input channel with the highest signal strength is selected for voice detection, at least two. When there are too many input channels selected, it will be more complicated in the subsequent mixing calculation process. Therefore, generally select 2~4 One.

5502. Determine the detected input channel with voice as a voice input channel, and detect the number of voice input channels. If the number of voice input channels is at least two, perform step S503, if the number of voice input channels If it is one, step S505 is performed, if the number of voice input channels is 0, step S506 is performed;

5503. When the number of voice input channels is at least two, determining a signal similarity between signals of each voice input channel;

When there are only two voice input channels, there is only one signal similarity. When there are more than two voice input channels, there is signal similarity between every two voice input channels.

S504. Control a strobe of the voice input channel according to each signal similarity, and perform weighted mixing output on the signal of the strobed voice input channel.

Specifically:

1) if the signal similarity of the two voice input channels is less than or equal to the first threshold, controlling the two input channels to be pre-strobed;

Wherein, when the signal similarity of the two voice input channels is equal to the first threshold, Step 2).

If the signal similarity of any two voice input channels is less than or equal to the first threshold, all channels are pre-strobed, and the pre-strobe channel can be directly gated.

If the signal similarity of the two voice input channels is greater than the first threshold, step 1) is further performed on the basis of 1) to ensure the accuracy of the mixing. Of course, if the similarity of any two signals is greater than the first threshold, step 1) may be omitted, and only step 2) is performed.

2) if the signal similarity of the two voice input channels is greater than or equal to the first threshold, controlling the two voice input channels according to the signal strength of the two voice input channels and the delay of the two signals corresponding to the signal similarity The signal similarity is the maximum value of the similarity function of the two signals (the maximum value of the normalized cross-correlation function value or the minimum value of the average amplitude difference function), and the delay of the two signals corresponding to the signal similarity That is, the relative delay of the two signals corresponding to the most value of the similarity function.

Wherein: controlling the strobes of the two voice input channels according to the signal strength of the two voice input channels and the delay of the two signals corresponding to the signal similarity, specifically:

When the signal strength difference value of the two voice input channels is greater than or equal to the set value, one of the two input channels is controlled; when "equal", the following steps can also be performed.

When the signal strength difference value of the two voice input channels is less than the set value, determining the delay of the two signals corresponding to the signal similarity of the two voice input channels, if the delay of the two signals is greater than the set duration, then controlling One of the two voice input channels is strobed. If the delay of the two signals is less than or equal to the set duration, then both voice input channels are controlled to be strobed.

In the above steps: for example, the A, B, and C voice input channels, when the similarity between A and B is less than the first threshold, the similarity of A and C is less than the first threshold, and the similarity of ^ C is greater than the first threshold, according to A, B Similarity, A, C similarity, control A, B, C are strobed, and then control one of B and C according to B, C similarity, therefore, control A, (or person, B strobe.

The method for determining the similarity in step S503 is specifically: Performing band pass filtering preprocessing on the signals of each voice input channel;

The signal similarity is determined by using a normalized cross-correlation function for every two signals after preprocessing. When the signal similarity is determined using the normalization function, the signal similarity is the maximum value of the normalized cross-correlation function value.

Or use the average amplitude difference function to determine the similarity, specifically:

Performing band pass filtering preprocessing on the signals of each voice input channel;

The signal similarity is determined by using the average amplitude difference function for every two signals after preprocessing. When the signal amplitude similarity is determined by the average amplitude difference function, the signal similarity is the minimum value of the average amplitude difference function, and the signal similarity is greater than a certain first threshold, that is, the minimum value of the average amplitude difference function is smaller than the set second threshold. .

5505. When there is only one voice input channel, the voice input channel is directly controlled to be gated and output.

5506. When the number of voice input channels is 0, the strobe of the input channel is performed by using the last strobe.

When the number of voice input channels is 0, this time, the strobe discrimination of the channel is not re-executed, and the strobe of the current input channel is directly used for the strobe of the input channel, and output.

According to the method of the embodiment of the present invention, when the input channel is strobed, the signal strength of each input channel and the signal similarity between the channels are considered, so that the probability of channel mis-singing is greatly reduced, thereby The amplitude improves the quality of the sound after mixing.

The method of the embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

Embodiment 1

As shown in FIG. 6, a flowchart of a multi-microphone mixing method according to Embodiment 1 of the present invention includes:

S601. Count the signal strength of each input channel in the current time period, and select two input channels A and B with the largest signal strength to perform voice detection; 5602. When there is no voice in the input channels A and B, the previous determination result is directly used;

5603. When input channel A has voice, and B has no voice, that is, A is a voice input channel, and directly controls input channel A to strobe;

5604. When both input channels A and B have voice, that is, both human and B are voice input channels, and the signals of channel A and channel B are respectively preprocessed by a bandpass filter of 80 Hz to 800 Hz, and are preprocessed. The two signals calculate their normalized cross-correlation function (NCCF) and determine the maximum value of the normalized cross-correlation function (NCCF) value ^) and determine this time (ie the maximum value of the normalized cross-correlation function value) Corresponding signal delay between A and B

The definition and calculation method of NCCF are well known in the art and will not be described herein.

For each delay, determine the NCCF value to find the maximum value of the NCCF value and determine the delay corresponding to the maximum value;

5605, determining whether the maximum value is less than or equal to the set threshold value VI, if yes, executing step S608, if no, executing step S606;

5606. When the maximum value is greater than the set threshold value VI, determine the difference between the signal strengths of the two channels, and determine whether the signal strength difference between the Α and Β channels is less than or equal to the set value. If yes, execute Step S607, if no, step S609;

When the maximum value of the signal normalized cross-correlation function of the Α and Β channels is greater than the set threshold, it can be considered that only one speaker in the local language is speaking, and then the signal strength difference value of the two channels according to Α and 继续 is continued. The control gates of the Α and Β channels are extended.

Of course, in this step, when the difference value is equal to the set value, step S609 may also be performed. To judge the signal strength difference between A and B channels, you can directly use the signal strength of A's signal strength-B, or use the ratio of the signal strength of the two (signal strength is small / signal strength is large), or the difference between the two / The signal strength value of either one can be determined by various methods. The difference between the signal strengths of the A and B channels is smaller than the set value, indicating that the signal strengths of the two are not much different. 5607, determining whether the time delay f corresponding to the maximum value is less than or equal to the set time length, if yes, executing step S608, if no, executing step S609;

Of course, when the delay is equal to the set duration, step S609 can also be performed.

5608, the control channel person, B are strobed;

When the maximum value is less than or equal to the set threshold value VI, the control channel person and Β are all strobed; when the maximum value is less than or equal to the set threshold value VI, it is considered that there are different people in front of the corresponding channel of the channel person and Β Speak at the same time, so the channel person and Β should be open, output = Α * 0.5 + Β * 0.5;

Of course, when the maximum value is equal to the set threshold value VI, step S606 can also be performed. When the maximum value is greater than or equal to the set threshold value VI, it indicates that there is a speaker in front of the Α and Β microphones, when the signal strength difference value of the Α and Β channels is small, and the signal delay corresponding to the maximum value of the NCCF value Very small, it can be considered that the distance between the speakers and the microphones corresponding to the two channels is very close, and the channels Α, Β can be opened at the same time, and the output = Α * 0.5 + Β * 0.5;

5609. Control one of the A and B channels;

Control one of the A and B channels, and preferably, control the channel gating of the A and B channels with higher signal strength.

In step S606, when the maximum value is greater than or equal to the set threshold value VI, step S609 may be directly performed to control one of the strobes and strobes, or the mixing may be completed. Of course, the judgment of the signal strength difference value in step S606 and the judgment of the signal delay in S607, and the execution of S608, make the signal judgment more precise, and further improve the quality of the multi-microphone mix.

Embodiment 2

FIG. 7 is a flowchart of a multi-microphone mixing method according to Embodiment 2 of the present invention.

5701. Count the signal strength of each input channel in the current time period, and select two input channels A and B with the largest signal strength to perform voice detection;

5702. When there is no voice in the input channels A and B, the previous discrimination result is directly used;

5703. When input channel A has voice, and B has no voice, directly control input channel A. Gating

S704. When the input channels A and B have voice, the signals of the channel A and the channel B are respectively preprocessed by a bandpass filter of 80 Hz to 800 Hz, and the average amplitude difference function is calculated for the two signals after the preprocessing ( AMDF), and determine the minimum value W of the average amplitude difference function (AMDF) value, and determine the signal delay between A and B corresponding to this time (ie, the minimum value of the average amplitude difference function value);

The definition and calculation method of AMDF are well known in the art and will not be described herein.

For each delay r, determine the AMDF value ^τ , , find the minimum value of the AMDF value and determine the delay corresponding to the minimum value; S705, determine whether the minimum value is greater than or equal to the set threshold value, and if so, perform steps S708, if no, executing step S706;

5706. When the minimum value is less than the set threshold, determine the difference between the signal strengths of the two channels A and B, and determine whether the signal strength difference between the A and B channels is less than or equal to the set value. If yes, perform steps S707, if no, step S709 is performed;

When the minimum value of the average amplitude difference function of the A and B channels is less than the set threshold, it can be considered that only one speaker in the local is speaking, and then the signal strength difference value and the delay controller according to the two channels A and B are continued. , B channel strobe.

5707, determining whether the time delay r corresponding to the minimum value is less than the set duration, if yes, executing step S708, if no, executing step S709;

S708, control channel person, B are strobed; when ^^) minimum value is greater than or equal to the set threshold value, the control channel person and B are strobed; when the minimum value is greater than or equal to the set threshold value, Before the channel person and B correspond to the microphone, there are different people talking at the same time, so the channel person and B should be open, output = A*0.5+B*0.5; when the minimum value is less than the set threshold, A, B is considered. There is a speaker in front of the microphone When speaking, when the signal strength difference between the A and B channels is small, and the signal delay corresponding to the minimum value of the AMDF value is very d, it can be considered that the distance between the speakers and the microphones corresponding to the two channels is very close, and can be simultaneously opened. Channel A, B, output = A*0.5+B*0.5;

S709, controlling one of the A and B channels;

Control one of the A and B channels, and preferably, control the channel gating of the A and B channels with higher signal strength. In step S706, when the minimum value of ^^) is less than the set threshold, step S709 may be directly performed to control one of the A and B channels, or the mixing may be completed. Of course, the judgment of the signal strength difference value in step S706 and the judgment of the signal delay in S707, and the execution of S708, make the signal judgment more accurate, and further improve the quality of the multi-microphone mix.

It should be noted that the present invention does not limit the specific method for evaluating the signal similarity between different channels and the maximum number of channels allowed to be simultaneously opened, nor does it limit the evaluation of the mixing weight between different channels. As in the first embodiment, the specific method for judging the signal similarity between different channels is to use the NCCF function, allowing the maximum number of channels to be simultaneously opened to be 2, and the mixing weight between channels is fixed to (0.5, 0.5) in the mono system. In stereo systems, the mixing weights of different channels are related to the spatial position of their corresponding microphones, and will not be analyzed in detail here.

The embodiment of the present invention further provides a multi-microphone mixing device, as shown in FIG. 8, comprising: a statistical module 81, configured to count the signal strength of each input channel in the current time period, and select at least two input channels with the largest signal strength. Perform voice detection;

The similarity determination module 82 is configured to determine the detected voice input channel as a voice input channel, and determine a signal similarity between signals of each voice input channel when the voice input channel is at least two;

The gating module 83 is configured to control the gating of the voice input channel according to each signal similarity; the mixing module 84 is configured to perform weighted mixing output on the signal of the gated voice input channel. Preferably, the gating module 83 is further configured to directly control when there is only one voice input channel The voice input channel is strobed.

Preferably, the gating module 83 is specifically configured to control any two speech input channels. If the signal similarity of the two speech input channels is less than or equal to the first threshold, the two input channels are controlled to be strobed.

It should be noted that the operation of the signal strength difference value of the two voice input channels can be flexible, for example: when the signal strength difference values of the two voice input channels are greater than the set value, the two voice input channels are controlled. a strobe; when the signal strength difference value of the two voice input channels is less than the set value, determining the signal similarity of the two voice input channels, and taking the relative delay of the two signals corresponding to the maximum value of the similarity function, Subsequent operations related to the relative delay are performed.

Moreover, when the signal strength difference value of the two voice input channels is equal to the set value, the subsequent specific operation may be the same as the subsequent operation when the signal strength difference value is greater than the set value (ie, controlling the two voice inputs A strobe in the channel may also be the same as the subsequent operation when the signal strength difference value is less than the set value (ie, determining the signal similarity of the two voice input channels, and taking the two values corresponding to the maximum value of the similarity function) The relative delay of the signals, followed by subsequent operations involving the relative delays).

In addition, the related operations related to the relative delay described above may also be flexible, such as: if the relative delay of the two signals is greater than the set duration, then one of the two voice input channels is controlled; If the relative delay of the two signals is less than the set duration, then the two voice input channels are controlled to strobe.

Moreover, when the relative delay of the two signals is equal to the set duration, the subsequent specific operation may be the same as the subsequent operation when the relative delay of the two signals is greater than the set duration (ie, controlling two voice input channels) A strobe can also be the same as the subsequent operation when the relative delay of the two signals is less than the set duration (ie, control both voice input channels are strobed).

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Industrial scope

The invention relates to the field of audio information processing, and discloses a multi-microphone mixing method and device, which can count the signal strength of each input channel in the current time period, and thereby select at least two input channels with large signal strength for voice detection, and Determining the detected input channel with voice as a voice input channel; when the voice input channel is at least two, determining a signal similarity between signals of each voice input channel, thereby controlling the gating of the voice input channel And the signal of the strobed voice input channel is weighted and mixed. The method and device of the invention can reduce the false positive rate of the input channel strobe and improve the audio quality after mixing.

Claims

Claim

1. A multi-microphone mixing method, comprising:

2. The method according to claim 1, further comprising: directly controlling the voice input channel gating when there is only one voice input channel.

3. The method of claim 1, wherein the method of controlling the gating of the voice input channel according to each signal similarity is:

For any two voice input channels, if the signal similarity of the two voice input channels is less than or equal to the preset first threshold, both input channels are controlled to be strobed.

4. The method according to claim 1 or 3, further comprising:

5. The method according to claim 4, wherein the gating process of the two speech input channels is controlled according to a signal strength of the two speech input channels and a delay of two signals corresponding to the signal similarity Includes:

When the signal strength difference value of the two voice input channels is less than the set value, the signal similarity of the two voice input channels is determined, and the relative delays of the two signals corresponding to the maximum value of the similarity function are taken. When, according to this, one or two of the two voice input channels are strobed;

6. The method of claim 5, wherein the method of controlling one of the two voice input channels is:

7. The method according to claim 1, wherein the process of determining signal similarity between signals of each voice input channel comprises:

In all channels after preprocessing, the signal similarity is determined for each of the two signals using a normalized cross-correlation function or an average amplitude difference function.

8. A multi-microphone mixing device comprising:

a gating module, configured to determine, according to the similarity determination module, each signal similarity control determined by the module Gating the voice input channel;

The device of claim 8, wherein the gating module is further configured to directly control the voice input channel gating when there is only one voice input channel.

The device according to claim 8, wherein the gating module is configured to: if any two voice input channels have a signal similarity less than or equal to a preset first threshold, Control both input channels to strobe.