CN104219013A

CN104219013A - Method for multi-microphone sound mixing of video conference system

Info

Publication number: CN104219013A
Application number: CN201410441776.7A
Authority: CN
Inventors: 毕永建
Original assignee: Xiamen Yealink Network Technology Co Ltd
Current assignee: Xiamen Yealink Network Technology Co Ltd
Priority date: 2014-09-01
Filing date: 2014-09-01
Publication date: 2014-12-17
Anticipated expiration: 2034-09-01
Also published as: CN104219013B

Abstract

The invention provides a method for multi-microphone sound mixing of a video conference system. A maximum echo channel is decided and excluded through a delay time between sounds acquired by various microphone channels and loudspeaker playing sounds and energy capacity of the acquired sounds, and then a sound acquired by an optimum channel is selected as a sound mixing source and a sound mixing weight of the selected channel is determined based on frame energy values acquired by various optional channels and correlation of the acquired frame energy values, and finally, sound mixing output is performed according to the selected channel and the determined sound mixing weight. By the aid of the method, the sound acquired by the optimum channel is selected as an optimum sound mixing source, and all algorithms are designed bases on self-adaption, so the optimum sound mixing source can be adjust dynamically, and acquisition range and tone quality of the sound are guaranteed.

Description

A kind of multi-microphone sound mixing method of video conferencing system

Technical field

The present invention relates to a kind of mixed audio processing method, more particularly, relate to a kind of multi-microphone mixed audio processing method for video conference.

Background technology

Early stage video conferencing system generally all uses orientation or omnidirectional microphone to carry out sound collection, the sound collection scope of omnidirectional microphone is large, but reverberation is serious, tonequality is not good, and shotgun microphone tonequality is good, but sound collection narrow range, for solving the contradiction of tonequality and sound acquisition range, current video conferencing system has started the scheme adopting multi-microphone, by the combination of multiple shotgun microphone, ensures acquisition range and the tonequality of sound.And how to combine the data of multi-microphone collection, form best audio mixing (multi-path voice data mixing forms a road and exports voice) effect, become the key problem that multi-microphone scheme needs solution.

Self adaptation refers in process and analytic process, automatically processing method, processing sequence, process parameter, boundary condition or constraints is adjusted according to the data characteristics of deal with data, make the Statistical Distribution Characteristics of itself and handled data, architectural feature adapts, to obtain best treatment effect.

Summary of the invention

The technical problem to be solved in the present invention, be the multi-microphone sound mixing method that a kind of video conferencing system is provided, first judge maximum echo passage being got rid of, then select sound that optimal channel gathers as audio mixing source, finally determine the audio mixing weights output audio mixing of dynamic change.Select the sound of optimal channel collection as optimum audio mixing source, ensure that acquisition range and the acoustical quality of sound.

The present invention is achieved in that a kind of multi-microphone sound mixing method of video conferencing system, and described method comprises sound collection process, Delay Estima-tion process, echo passage decision process, audio mixing source selection course and audio mixing output procedure:

Described sound collection process is: loud speaker plays sound, and a plurality of microphone channel gathers sound; The sound that the sound play described loud speaker and each microphone channel gather frame by frame piecemeal stores, and obtain the broadcasting frame energy of loud speaker and the acquisition frame energy of each microphone channel, every frame length is a fixed value, and every frame comprises plurality of points;

Described Delay Estima-tion process: obtain the delay time between sound that each microphone channel gathers at every turn and the sound that loud speaker is play at every turn;

Described maximum echo passage decision process: according to the delay time of each microphone channel obtained, the sound that the sound each microphone channel gathered and loud speaker are play aligns frame by frame one by one, maximum echo passage is judged with the average of the ratio of each sub belt energy of loud speaker again according to each sub belt energy of microphone channel after alignment, namely the echo composition contained by this passage is the highest, select during audio mixing source, maximum echo passage to be got rid of, to ensure the echo of eliminating conference system;

Described audio mixing source selection course: after described maximum echo passage is got rid of, the sound that two microphone channel selecting acquisition frame energy in all the other microphone channel maximum gather carries out audio mixing as audio mixing source, calculates the audio mixing weights of described two every frames in audio mixing source;

Described audio mixing output procedure is: the output valve being obtained each point of the every frame of audio mixing by the audio mixing weights of described two every frames in audio mixing source, finally exports audio mixing by each microphone channel again.

Further, the computational process of described delay time is as follows:

Step 21, the acquisition frame energy of each microphone channel to be contrasted one by one with the broadcasting frame energy of corresponding loud speaker, obtains the delay time of each microphone channel, shown in the following formula 1 of computing formula:

cor (n) = Σ_{i = 0}^{N - 1} cap (i + n) play (i) - - - (1)

In formula (1), N represents the number of samples that sound maximum delay is corresponding, n is the positive integer of 1 to N, what i represented that the every frame of sound comprises counts, cap (i+n) represents the value of acquisition frame i-th+n point of microphone channel, play (i) represents the value of broadcasting frame i-th point of loud speaker, and cor (n) represents that the linear correlation values of passage play by microphone channel and loud speaker;

The corresponding described linear correlation values of each n in step 22, formula (1), is converted into t by n corresponding for maximum linear correlation values, obtains the delay time of microphone channel, shown in the following formula of conversion process (2):

T=n/ sample frequency (2)

Further, described sub belt energy refers to the energy magnitude of each the corresponding each point obtained after FFT on the every frame of sound, described maximum echo passage decision process, shown in formula specific as follows (3) and formula (4):

η (i, j) = \frac{P_{cap} (i, j)}{P_{piay} (j)} - - - (3)

η _mean(i)＝1/M∑η(i,j) (4)

In formula (3) and formula (4), i represents passage index, and j represents subband index, and M represents sub-band sum, P _cap(i, j) represents the i-th microphone channel jth sub belt energy, P _playj () is loud speaker jth sub belt energy, η (i, j) represents the attenuation ratio of the i-th microphone channel jth sub belt energy, η _meani () represents the attenuation ratio of the i-th microphone channel average sub band energy;

The value obtained by formula (3) substitutes into the attenuation ratio that can obtain average sub band energy corresponding to each microphone channel in formula (4), sort to the attenuation ratio of the average sub band energy of each microphone channel, the microphone channel that wherein attenuation ratio of maximum average sub band energy is corresponding is exactly maximum echo passage again.

Further, described audio mixing source comprises, the sound of a microphone channel collection in two microphone channel that any selection acquisition frame energy is maximum is designated as audio mixing source mix0, then the sound of another microphone channel collection is designated as audio mixing source mix1, the audio mixing weight computing process of described two every frames of audio mixing source mix0 and mix1, as shown in formula (5) and formula (6):

λ _mix0＝γ*λ _mix0+(1-γ)(P _cap(mix0)/(p _cap(mix0)+p _cap(mix1))) (5)

λ _mix1＝1-λ _mix0 (6)

In described formula (5) and formula (6), λ _mixowith λ _mix1represent the audio mixing weights of the every frame of audio mixing source mix0 and mix1 respectively, γ represents forgetting factor, is a fixed constant, P _cap(mix0) energy value of the every frame of audio mixing source mix0 is represented, P _cap(mix1) energy value of the every frame of audio mixing source mix1 is represented;

The audio mixing weights of the every frame of audio mixing source mix0 obtained by formula (5) substitute into the audio mixing weights that formula (6) can obtain the every frame of audio mixing source mix1, the audio mixing weights in two audio mixing sources are all that every frame upgrades once, and wherein the audio mixing weights of audio mixing source mix0 have initial value.

Further, the process of described audio mixing output is as shown in formula (7):

out[i]＝λ _mix0*cap _mixo[i]+λ _mix1*cap _mix1[i] (7)

In described formula (7), what i represented that audio mixing source mix0 and audio mixing source mix1 gather that the every frame of sound comprises counts, cap _mix0[i] represents the value of acquisition frame i-th point of audio mixing source mix0, cap _mix1[i] represents the value of acquisition frame i-th point of audio mixing source mix1, and out [i] represents the output valve of the every frame of audio mixing i-th point;

The audio mixing weights of the every frame of audio mixing source mix0 and mix1 described formula (5) and formula (6) obtained substitute into formula (7), just obtain the output valve of the every frame of audio mixing i-th point.

Tool of the present invention has the following advantages:

1, by each microphone channel gather sound and loud speaker play sound between delay time and gathers the energy size of sound, find out the passage that echo is maximum, this passage got rid of from audio mixing source, the echo cancellor effect of guarantee conference system;

2, according to each selectable channel acquisition frame energy value and the correlation between them, select sound that optimal channel gathers as audio mixing source, ensure that tonequality;

3, all algorithms are according to adaptive design, and dynamic adjusts optimum audio mixing source.

Accompanying drawing explanation

The present invention is further illustrated in conjunction with the embodiments with reference to the accompanying drawings.

Fig. 1 is the inventive method flowchart.

Fig. 2 is one embodiment of the invention sound collection process schematic.

Embodiment

Please refer to Fig. 1, the multi-microphone sound mixing method of a kind of video conferencing system of the present invention, described method comprises sound collection process, Delay Estima-tion process, echo passage decision process, audio mixing source selection course and audio mixing output procedure:

Described sound collection process is: loud speaker plays sound, and a plurality of microphone channel gathers sound; The sound that the sound play described loud speaker and each microphone channel gather frame by frame piecemeal stores, obtain the broadcasting frame energy of loud speaker and the acquisition frame energy of each microphone channel, every frame length is a fixed value, every frame comprises plurality of points: for sample frequency 8K, if the every frame length of sound is 10 millimeters, then every frame comprises 80 points, if the every frame length of sound is 20 millimeters, then every frame comprises 160 points;

As shown in Figure 2, the microphone of three microphone channel is provided with for one, loud speaker is TV, microphone is 2-3 rice to the distance d of TV, one of them microphone channel is just to TV, it is generally acknowledged that the echo contained in the sound that this microphone channel gathers is maximum, need to get rid of this microphone channel during actual selection audio mixing source;

Described Delay Estima-tion process: obtain the delay time between sound that each microphone channel gathers at every turn and the sound that loud speaker is play at every turn, the computational process of described delay time is as follows:

Step 21, the acquisition frame energy of each microphone channel to be contrasted one by one with the broadcasting frame energy of corresponding loud speaker, obtains the delay time of each microphone channel, shown in the following formula of computing formula (1):

cor (n) = Σ_{i = 0}^{N - 1} cap (i + n) play (i) - - - (1)

T=n/ sample frequency (2)

Described echo passage decision process: according to the delay time of each microphone channel obtained, the sound that the sound each microphone channel gathered and loud speaker are play aligns frame by frame one by one, maximum echo passage is judged with the average of the ratio of each sub belt energy of loud speaker again according to each sub belt energy of microphone channel after alignment, namely the echo composition contained by this passage is the highest, select during audio mixing source, maximum echo passage to be got rid of, to ensure the echo of eliminating conference system;

Described sub belt energy refers to the energy magnitude of each point corresponding each point obtained after FFT (fast Fourier transform) on the every frame of sound, described maximum echo passage decision process, shown in formula specific as follows (3) and formula (4):

η (i, j) = \frac{P_{cap} (i, j)}{P_{piay} (j)} - - - (3)

η _mean(i)＝1/M∑η(i,j) (4)

The value obtained by formula (3) substitutes into the attenuation ratio that can obtain average sub band energy corresponding to each microphone channel in formula (4), sort to the attenuation ratio of the average sub band energy of each microphone channel, the microphone channel that wherein attenuation ratio of maximum average sub band energy is corresponding is exactly maximum echo passage again;

Described audio mixing source selection course: after described maximum echo passage is got rid of, the sound that two microphone channel selecting acquisition frame energy in all the other microphone channel maximum gather carries out audio mixing as audio mixing source, the sound of one of them microphone channel collection of any selection is designated as audio mixing source mix0, then the sound of another microphone channel collection is designated as audio mixing source mix1, calculates the audio mixing weights of described two every frames in audio mixing source;

The audio mixing weight computing process of described two every frames in audio mixing source, as shown in formula (5) and formula (6):

λ _mix0＝γ*λ _mix0+(1-γ)(P _cap(mix0)/(p _cap(mix0)+p _cap(mix1))) (5)

λ _mix1＝1-λ _mix0 (6)

The audio mixing weights of the every frame of audio mixing source mix0 obtained by formula (5) substitute into the audio mixing weights that formula (6) can obtain the every frame of audio mixing source mix1, the audio mixing weights in two audio mixing sources are all that every frame upgrades once, and wherein the audio mixing weights of audio mixing source mix0 have initial value;

Described audio mixing output procedure is: the output valve being obtained each point of the every frame of audio mixing by the audio mixing weights of the every frame of described audio mixing source mix0 and mix1, finally exports audio mixing by each microphone channel again;

The process that described audio mixing exports is as shown in formula (7):

out[i]＝λ _mix0*cap _mixo[i]+λ _mix1*cap _mix1[i] (7)

In described formula (7), what i represented that audio mixing source mix0 and audio mixing source mix1 gather that the every frame of sound comprises counts, cap _mix0[i] represents the energy of acquisition frame i-th point of audio mixing source mix0, cap _mix1[i] represents the energy of acquisition frame i-th point of audio mixing source mix1, and out [i] represents the output valve of the every frame of audio mixing i-th point;

Although the foregoing describe the specific embodiment of the present invention; but be familiar with those skilled in the art to be to be understood that; specific embodiment described by us is illustrative; instead of for the restriction to scope of the present invention; those of ordinary skill in the art, in the modification of the equivalence done according to spirit of the present invention and change, should be encompassed in scope that claim of the present invention protects.

Claims

1. a multi-microphone sound mixing method for video conferencing system, is characterized in that, described method comprises sound collection process, Delay Estima-tion process, maximum echo passage decision process, audio mixing source selection course and audio mixing output procedure:

2. the multi-microphone sound mixing method of a kind of video conferencing system according to claim 1, it is characterized in that, the computational process of described delay time is as follows:

cor (n) = Σ_{i = 0}^{N - 1} cap (i + n) play (i) - - - (1)

T=n/ sample frequency (2)

3. the multi-microphone sound mixing method of a kind of video conferencing system according to claim 1, it is characterized in that, described sub belt energy refers to the energy magnitude of each the corresponding each point obtained after FFT on the every frame of sound, described maximum echo passage decision process, shown in formula specific as follows (3) and formula (4):

η (i, j) = \frac{P_{cap} (i, j)}{P_{piay} (j)} - - - (3)

η _mean(i)＝1/M∑η(i,j) (4)

4. the multi-microphone sound mixing method of a kind of video conferencing system according to claim 1, it is characterized in that, described audio mixing source comprises, the sound of a microphone channel collection in two microphone channel that any selection acquisition frame energy is maximum is designated as audio mixing source mix0, then the sound of another microphone channel collection is designated as audio mixing source mix1, the audio mixing weight computing process of described two every frames of audio mixing source mix0 and mix1, as shown in formula (5) and formula (6):

λ _mix0＝γ*λ _mix0+(1-γ)(P _cap(mix0)/(p _cap(mix0)+p _cap(mix1))) (5)

λ _mix1＝1-λ _mix0 (6)

5. the multi-microphone sound mixing method of a kind of video conferencing system according to claim 4, is characterized in that, the process that described audio mixing exports is as shown in formula (7):

out[i]＝λ _mix0*cap _mixo[i]+λ _mix1*cap _mix1[i] (7)