CN106504758A

CN106504758A - Mixer and sound mixing method

Info

Publication number: CN106504758A
Application number: CN201610939143.8A
Authority: CN
Inventors: 陈喆; 殷福亮; 呼德
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2016-10-25
Filing date: 2016-10-25
Publication date: 2017-03-15
Anticipated expiration: 2036-10-25
Also published as: CN106504758B

Abstract

The invention discloses a kind of mixer and sound mixing method, the mixer includes：Framing unit；The speech detection unit being connected with the framing unit；The speech detection unit is used for whether detecting framing Hou Ge roads signal containing voice signal；The time varing filter being connected with the speech detection unit；The loudness computing unit being connected with the speech detection unit；The weight calculation unit being connected with loudness computing unit；The downmixing unit being connected with the time varing filter and the weight calculation unit；The post-processing unit being connected with the downmixing unit；The present invention can improve voice quality, also embody the fairness to each participant.

Description

Mixer and sound mixing method

Technical field

The present invention relates to a kind of audio mixing technology, specially a kind of mixer and sound mixing method.

Background technology

Video conference and videoconference are the conferencing forms that is held by communication network, and they can be ginseng in strange land space Plus meeting person provides real-time voice exchange.Real meeting communication atmosphere is close to for obtaining, audio mixing technology is indispensable, and mixes Sound technology can directly influence the voice quality of meeting.Audio mixing technology is specifically divided into simulation audio mixing technology and digital audio mixing technology, Wherein, digital audio mixing technology due to high precision, dynamic range be big and the low advantage of noise and be used widely.Numeral is mixed Digital signal of the ultimate principle Shi Jiangge road voice signal of sound technology after analog digital conversion is overlapped mutually and is formed all the way Audio mixing output signal.

As audio digital signals there is a problem of quantifying upper and lower bound, superposition is likely to result in result spilling, So the demand of digital audio mixing technology shows following two aspects：1. ensure that the signal after audio mixing frequently will not overflow；With The increase of voice way, and spillover can more and more frequently, if directly saturation arithmetic is carried out to these spill overs, can Introduce noise so that the sound after audio mixing sounds discontinuous or explosion sound occurs.2. ensure each road voice quality；Each road language The size of sound, frequency are different, well ensure that quality of these signals after audio mixing is weigh digital audio mixing technology one Item major criterion.

Existing document of the author for Zhang Chuanyong《Audio mixing technology and its application in voip session system》In disclose A kind of weighting method audio mixing technology, its main thought be to calculating a weighted value per voice signal all the way, afterwards to weighting after Signal be overlapped；And the purpose for weighting is to reduce or eliminate spilling, so as to ensure voice quality.The weighting method audio mixing technology Specific implementation as follows：Assume there is N roads signal, have M sample per one frame of road signal, wherein f (i, j) is jth road signal I-th sample value, then its corresponding weighted value be：

Finally it is output as：

Wherein, weights of the weight (i, j) for i-th sample of jth road signal, Output (i) are that i-th sample is defeated Go out.There are the following problems for weighting method audio mixing technology disclosed in the existing document：Each road voice signal in audio mixing, get over by signal amplitude Little then its weight also can be less, becomes less, be easily caused larger distortion after which results in small-signal audio mixing；Secondly, typically The people that video conference is simultaneously made a speech not over 4, and this mode the line signal (containing noise) that does not speak without Any process directly participate in audio mixing, easily reduces the signal to noise ratio of the voice after audio mixing.

Author is the existing document of Zhou Jingli etc.《A kind of new multimedia conferencing real-time sound mixing scheme》In disclose one kind Automatic threshold audio mixing technology, its determined whether voice signal (i.e. according to its short-time energy each road signal before audio mixing Quiet detection), the circuit without voice is judged as " without floor status ", these " without floor status " signals will not be participated in mixed Sound；During audio mixing, this mode calculates its decay factor according to itself short-time energy size of voice data, when audio frequency short-time energy Decayed when exceeding some threshold value in certain proportion, and be less than threshold value and need not then be decayed, and then each road The weight of signal is only related to the short-time energy of oneself.Automatic threshold audio mixing technology disclosed in the existing document is present asks as follows Topic：Signal due to being judged as " without floor status " does not participate in audio mixing, so showing no sign of " nothing in the sound after audio mixing Speech state " signal so that the participant of these " without floor status " has no the presence of sense；Meanwhile, participant from silence to speak when Fluctuating occurs, affects audition；Secondly, although this mode ensure that weight >=1 of small-signal, but still cannot ensure small-signal Voice quality.

Therefore, in existing audio mixing technology, or each road signal (in spite of there is voice) is without any process Audio mixing is directly participated in, or the signal way for participating in audio mixing is reduced using quiet detection；If not adopting quiet detection, audio mixing mistake Cheng Zhonghui adds unnecessary noise, so as to affect voice quality.If reducing audio mixing way, " nothing using quiet detection The participant of floor status " can become have no sense of participation；In addition, in existing audio mixing technology, being made with the amplitude of voice signal Determine that for standard the weight of each road signal is overlapped again, but the loudness of voice is not exactly equal to its amplitude, it depends on The amplitude and frequency of voice signal.

Content of the invention

The present invention for problem above proposition, and develop one kind can improve voice quality, also embodied to respectively with The mixer and sound mixing method of the fairness of meeting person.

The technological means of the present invention are as follows：

A kind of mixer, including：

Framing unit, carries out framing respectively for each road signal to participating in audio mixing；

The speech detection unit being connected with the framing unit；The speech detection unit is used for framing Hou Ge roads Whether signal is detected containing voice signal；Whether the speech detection unit is located by present frame in detection all the way signal In there is voice signal state, determine whether the road signal contains voice signal；

The time varing filter being connected with the speech detection unit；The time varing filter is used for being examined according to the voice The testing result of unit is surveyed, time-varying low-pass filtering treatment is carried out respectively to framing Hou Ge roads signal；When current in signal all the way In when having voice signal state, the passband width of the time varing filter gradually broadens frame, at present frame in signal all the way When without voice signal state, the passband width of the time varing filter becomes narrow gradually；

The loudness computing unit being connected with the speech detection unit；The loudness computing unit is used for according to institute's predicate The testing result of sound detector unit, to calculate mean loudness of the framing Hou Ge roads signal in current predetermined amount of time respectively；

The weight calculation unit being connected with loudness computing unit；The weight calculation unit is used for according to the program meter The result of calculation of unit is calculated, each sample weights included by present frame in each road signal for participate in audio mixing are calculated respectively；

The downmixing unit being connected with the time varing filter and the weight calculation unit；The downmixing unit is used for root According to each road signal respectively through the present frame output signal after time-varying low-pass filtering treatment, and present frame is wrapped in each road signal The each sample weights for including, obtain and export the current frame signal all the way after each road signal present frame audio mixing；

The post-processing unit being connected with the downmixing unit；The post-processing unit is used for according to downmixing unit output Current frame signal all the way, calculates each sample weights after audio mixing included by signal present frame and each output sample；

Further, the speech detection unit includes：

Power computation module, for calculating the power of present frame in the signal of framing Hou Ge roads respectively；

The minimum frame power determination module being connected with the power computation module；The minimum frame power determination module is used In the result of calculation according to the power computation module, to obtain framing Hou Ge roads signal respectively in current predetermined amount of time Minimum frame power；

The voice status being connected with the power computation module and the minimum frame power determination module know module；Institute Stating voice status knows module for by the ratio between frame power to the power of present frame in signal all the way and the minimum Relatively detecting in signal all the way whether contain voice signal；

Further,

The power computation module passes through formulaTo calculate the work(of present frame in signal all the way Rate, in formula：Pow represents that the power of present frame, x [i] represent that i-th input data of present frame, N represent the sample number in a frame Amount；

The current predetermined amount of time represents the duration T of the front r frames to present frame from present frame；The minimum frame work( Rate determining module by formula pow_min=min present frame power, 1 frame power before present frame, r frames before present frame Power } minimum in current predetermined amount of time the to obtain signal all the way frame power, in formula：Min { } represents in braces own The minima of data,Ceil (x) is represented and is close to x and the integer more than or equal to x, F_SRepresent sampling frequency Rate, N represent the sample size in a frame；

The voice status know that module whether contain in signal all the way voice signal by setting VAD to represent, and right VAD assigns initial value and causes VAD=1；As pow >=32 pow_min, and the voice status know that VAD is put by module during VAD=0 1, represent the road signal in there is voice signal state；As pow≤4 pow_min, and the voice status are known during VAD=1 VAD is set to 0 by module, represents that the road signal is in without voice signal state；Comparative result between pow and pow_min is which During its situation, the voice status know that VAD is kept constant by module；

Further, the time varing filter obtains signal all the way by formula f [i]=(1-b) * x [i]+b*f [i-1] I-th filtering output value of middle present frame, in formula：F [i] represents i-th filtering output value of present frame, x [i] in signal all the way Represent that i-th input data of present frame, N represent that the sample size in a frame, 0≤i ＜ N, b represent filter factor, in present frame In have voice signal state when,When present frame is in without voice signal state,As b ＜ 0.18, take b=0.18, as b ＞ 0.956, take b=0.956, p1 represent b from 0.956 fades to the sampling number in 0.18 time span, and p2 represents samplings of the b from 0.18 time span for fading to 0.956 Points；

Further,

The loudness computing unit includes：Loudness that DFT transform module is connected with DFT transform module obtain module and The mean loudness acquisition module that module is connected is obtained with loudness；

When in signal all the way present frame in have voice signal state when, by the DFT transform module to the road signal Middle present frame carries out DFT transform, obtains the loudness value that module calculates the present frame by the loudness afterwards, finally by institute State mean loudness acquisition module and calculate the mean loudness in the current predetermined amount of time of road signal；

When in signal all the way, present frame is in without voice signal state, average in the current predetermined amount of time of road signal Loudness is equal to the mean loudness in the upper predetermined amount of time containing voice signal before present frame；

Further,

The DFT transform module passes through formulaTo signal all the way Middle present frame carries out DFT transform, in formula,S represents that discrete frequency, x [i] represent that present frame is input into for i-th Data, X [s] represent that x [i] result that obtains after DFT transform, j represent imaginary unit, j²=-1；

The loudness obtains module and utilizes formula The loudness value of the present frame is calculated, in formula, loudness represents that the loudness value of present frame, X [s] represent that x [i] is passed through The result that obtains after DFT transform, Equal [s] represent have default value etc. loudness array, s₂₀=ceil (20*N/F_s)、 s₂₀₀₀₀=floor (20000*N/F_s), ceil (x) represent be close to x and the integer more than or equal to x, floor (x) represent be close to x and Integer, F less than or equal to x_SRepresent that sample frequency, N represent the sample size in a frame；

The mean loudness obtains module and passes through formula

To calculate the mean loudness in the current predetermined amount of time of signal all the way；Current predetermined amount of time represents the front r from present frame The duration T of frame to present frame；In formula：Ceil (x) is represented and is close to x and whole more than or equal to x Number, F_SRepresent that sample frequency, N represent the sample size in a frame；

Further,

The weight calculation unit passes through formulaDraw present frame power in the signal of kth road Weight, and pass through formula weight_k[i]=weight_k0≤i ＜ N obtain the weight of i-th sample of present frame in the signal of kth road； In formula：weight_kRepresent present frame weight in the signal of kth road, LOUD_kRepresent that kth road signal is flat in current predetermined amount of time Equal loudness, M represent the signal way for participating in audio mixing, k=1,2 ..., M, weight_k[i] represents present frame i-th in the signal of kth road The weight of individual sample；

The downmixing unit passes through formulaDraw each road Current frame signal all the way after signal present frame audio mixing, in formula：I-th of current frame signal all the way after Mix [i] expression audio mixings Sample, M represent participate in audio mixing signal way, f_k [i] represent in the signal of time-varying low-pass filtering treatment Houk road when I-th sample output signal of previous frame, weight_k[i] represents the weight of i-th sample in the signal of kth road included by present frame, K=1,2 ..., M；

The post-processing unit maximum sample in signal present frame and is believed after calculating audio mixing after being additionally operable to obtain audio mixing Number present frame weight；

Further,

The weight calculation unit is additionally operable to carry out smooth place to each frame weight in each road signal of participation audio mixing respectively Reason；

Further,

The weight calculation unit to signal all the way in the process of realizing that is smoothed of each frame weight be：

When in signal all the way, present frame is the first frame, the weight calculation unit passes through formula weight_k[i]= weight_k0≤i ＜ N obtain the weight of i-th sample of present frame in the signal of kth road；

When in signal all the way, present frame is not the first frame, the weight calculation unit passes through formula

Obtain the weight of i-th sample of present frame in the signal of kth road；Wherein, weight_k[i] represents present frame in the signal of kth road The weight of i-th sample, N represent that sample size in a frame, P represent weight_k[i] 1 frame weight from before present frame gradually becomes Sampling number to present frame weight；

The post-processing unit passes through formula signal_max=max | Mix [0] |, | Mix [1] |, | Mix [N-1] | the maximum sample in signal present frame after audio mixing is obtained, in formula：signal_maxAfter expression audio mixing included by signal present frame Maximum sample, max { } represent that the maximum of data, Mix [0] in braces represent the defeated of the 0th sample of signal present frame after audio mixing Go out signal after signal, Mix [1] represent the output signal of the 1st sample of signal present frame after audio mixing, Mix [N-1] represents audio mixing current The output signal of frame N-1 samples；

Work as signal_maxWhen≤32768, the post-processing unit calculates the present frame weight of signal after audio mixing weight_mix=1, work as signal_maxDuring ＞ 32768, the post-processing unit calculates the present frame weight of signal after audio mixing

When the present frame of signal after audio mixing is the first frame, the post-processing unit passes through formula weight_mix[i]= weight_mix, 0≤i ＜ N obtain the weight of signal i-th sample of present frame after audio mixing, and after audio mixing, the present frame of signal is not the During one frame, the post-processing unit passes through formula The weight of signal i-th sample of present frame after acquisition audio mixing, in formula：weight_mixI-th of signal present frame after [i] expression audio mixing The weight of sample, Q represent weight_mix[i] before present frame, 1 frame weight gradually fades to present frame weight after audio mixing from after audio mixing Sampling number；

The post-processing unit passes through formulaBelieve after drawing audio mixing I-th output sample y [i] of number present frame, in formula：Final [i]=Mix [i] * weight_mix[i], 0≤i ＜ N.

A kind of sound mixing method, comprises the steps：

Step 1：Framing is carried out respectively to each road signal for participating in audio mixing；

Step 2：Whether framing Hou Ge roads signal is detected containing voice signal；In by detection all the way signal Whether present frame determines whether the road signal contain voice signal in there is voice signal state；

Step 3：According to Speech signal detection result, framing Hou Ge roads signal is carried out at time-varying low-pass filtering respectively Reason：When in signal all the way present frame in have voice signal state when, passband width gradually broadens, when present frame in signal all the way When being in without voice signal state, passband width becomes narrow gradually；

Step 4：According to Speech signal detection result, to calculate framing Hou Ge roads signal respectively in current predetermined amount of time Interior mean loudness；

Step 5：According to mean loudness result of calculation, respectively in each road signal of calculating participation audio mixing included by present frame Each sample weights；

Step 6：According to each road signal respectively through the present frame output signal after time-varying low-pass filtering treatment, and each road Each sample weights in signal included by present frame, obtain and export the letter of present frame all the way after each road signal present frame audio mixing Number；

Step 7：According to the current frame signal all the way after each road signal present frame audio mixing, signal present frame institute after audio mixing is calculated Including each sample weights and each output sample.

The mixer provided as a result of above-mentioned technical proposal, the present invention and sound mixing method, there is provided a kind of new language Sound activates detection mode, i.e., judge by the mean power of voice signal whether present frame is voice signal；The present invention passes through The introducing of time varing filter, solves current audio mixing technology Zhong Ge road signal and directly participates in audio mixing and introduce asking for unnecessary noise Topic, is participated in audio mixing way and causes " without speech " participant to have no the presence of sense while avoiding and being reduced using quiet detection Problem；The loudness control strategies such as present invention employing, draw the weight of each road signal by the loudness of each road signal of calculating, finally The mean loudness of Shi Ge roads signal be close to identical, auditory effect also close to；Present invention achieves voice signal especially small-signal The raising of voice quality, has also embodied the fairness to each participant.

Description of the drawings

Fig. 1 is the structured flowchart of mixer of the present invention；

Fig. 2 is the structured flowchart of speech detection unit of the present invention；

Fig. 3 is the structured flowchart of loudness computing unit of the present invention；

Fig. 4 is the workflow diagram of mixer of the present invention；

Fig. 5 is the waveform diagram of the voice signal of three road of the invention different qualities；

Fig. 6 is the waveform of three road voice signals after loudness computing unit of the present invention and weight calculation unit are processed Schematic diagram；

Fig. 7 is the waveform diagram of mixer output signal of the present invention；

Fig. 8 is the flow chart of sound mixing method of the present invention.

Specific embodiment

A kind of mixer as shown in Figure 1, Figure 2, Figure 3 and Figure 4, including：Framing unit, for each road to participating in audio mixing Signal carries out framing respectively；The speech detection unit being connected with the framing unit；The speech detection unit is used for dividing Whether Zheng Houge roads signal is detected containing voice signal；The speech detection unit by detection all the way in signal when Whether previous frame determines whether the road signal contain voice signal in there is voice signal state；With the speech detection unit The time varing filter being connected；The time varing filter is used for according to the testing result of the speech detection unit, to framing after Each road signal carry out time-varying low-pass filtering treatment respectively；When in signal all the way present frame in have voice signal state when, institute The passband width for stating time varing filter gradually broadens, when in signal all the way, present frame is in without voice signal state, when described The passband width for becoming wave filter becomes narrow gradually；The loudness computing unit being connected with the speech detection unit；The program meter Calculating unit is used for the testing result according to the speech detection unit, makes a reservation for currently to calculate framing Hou Ge roads signal respectively Mean loudness in time period；The weight calculation unit being connected with loudness computing unit；The weight calculation unit is used for root According to the result of calculation of the loudness computing unit, each sample included by present frame in each road signal for participate in audio mixing is calculated respectively Weight；The downmixing unit being connected with the time varing filter and the weight calculation unit；The downmixing unit is used for basis Each road signal respectively through the present frame output signal after time-varying low-pass filtering treatment, and in each road signal included by present frame Each sample weights, obtain and export the current frame signal all the way after each road signal present frame audio mixing；With the downmixing unit phase The post-processing unit of connection；The post-processing unit is used for the current frame signal all the way according to downmixing unit output, calculates audio mixing Each sample weights afterwards included by signal present frame and each output sample；Further, the speech detection unit includes：Power Computing module, for calculating the power of present frame in the signal of framing Hou Ge roads respectively；It is connected with the power computation module Minimum frame power determination module；The minimum frame power determination module is used for being tied according to the calculating of the power computation module Really, the frame power minimum in current predetermined amount of time to obtain framing Hou Ge roads signal respectively；With the power calculation mould The voice status that block is connected with the minimum frame power determination module know module；The voice status know that module is used for leading to Cross between the power of present frame in signal all the way and the minimum frame power relatively detecting in signal all the way whether contain There is voice signal；Further, the power computation module passes through formulaTo calculate signal all the way The power of middle present frame, in formula：Pow represents that the power of present frame, x [i] represent that i-th input data of present frame, N represent a frame In sample size；

The current predetermined amount of time represents the duration T of the front r frames to present frame from present frame；The minimum frame work( Rate determining module by formula pow_min=min present frame power, 1 frame power before present frame, r frames before present frame Power } minimum in current predetermined amount of time the to obtain signal all the way frame power, in formula：Min { } represents in braces own The minima of data,Ceil (x) is represented and is close to x and the integer more than or equal to x, F_SRepresent sampling frequency Rate, N represent the sample size in a frame；The voice status know that whether module is represented in signal all the way by setting VAD Containing voice signal, and VAD=1 is caused to VAD tax initial values；As pow >=32 pow_min, and voice shape during VAD=0 State knows that VAD is put 1 by module, represents the road signal in there is voice signal state；As pow≤4 pow_min, and VAD=1 Shi Suoshu voice status know that VAD is set to 0 by module, represent that the road signal is in without voice signal state；In pow and pow_min Between comparative result be other situations when, the voice status know that VAD is kept constant by module；Further, when described Become the i-th filtering output that wave filter obtains present frame in signal all the way by formula f [i]=(1-b) * x [i]+b*f [i-1] Value, in formula：F [i] represents that i-th filtering output value of present frame, x [i] represent i-th input number of present frame in signal all the way Represent that according to, N the sample size in a frame, 0≤i ＜ N, b represent filter factor, present frame in have voice signal state when,When present frame is in without voice signal state, As b ＜ 0.18, b=0.18 is taken, as b ＞ 0.956, take b=0.956, p1 represents b from 0.956 time span for fading to 0.18 Sampling number, p2 represents sampling numbers of the b from 0.18 time span for fading to 0.956；Further, the loudness is calculated Unit includes：The loudness that DFT transform module is connected with DFT transform module obtains module and obtains module with loudness and is connected Mean loudness obtain module；When in signal all the way present frame in have voice signal state when, by the DFT transform module DFT transform is carried out to present frame in the road signal, the loudness value that module calculates the present frame is obtained by the loudness afterwards, Module is obtained finally by the mean loudness and calculates the mean loudness in the current predetermined amount of time of road signal；When signal all the way Middle present frame be in without voice signal state when, the mean loudness in the current predetermined amount of time of road signal be equal to present frame it Mean loudness in the front upper predetermined amount of time containing voice signal；Further, the DFT transform module passes through formulaDFT transform is carried out to present frame in signal all the way, in formula,S represents that discrete frequency, x [i] represent that i-th input data of present frame, X [s] represent that x [i] becomes through DFT The result that obtains after changing, j represent imaginary unit, j²=-1；

The loudness obtains module and utilizes formula The loudness value of the present frame is calculated, in formula, loudness represents that the loudness value of present frame, X [s] represent that x [i] is passed through The result that obtains after DFT transform, Equal [s] represent have default value etc. loudness array, s₂₀=ceil (20*N/F_s)、 s₂₀₀₀₀=floor (20000*N/F_s), ceil (x) represent be close to x and the integer more than or equal to x, floor (x) represent be close to x and Integer, F less than or equal to x_SRepresent that sample frequency, N represent the sample size in a frame；The mean loudness obtains module and passes through FormulaTo count Calculate the mean loudness in the current predetermined amount of time of signal all the way；Current predetermined amount of time is represented from the front r frames of present frame to current The duration T of frame；In formula：Ceil (x) is represented and is close to x and the integer more than or equal to x, F_SExpression is adopted Sample frequency, N represent the sample size in a frame；Further, the weight calculation unit passes through formulaPresent frame weight in the signal of kth road is drawn, and passes through formula weight_k[i]=weight_k0 ≤ i ＜ N obtain the weight of i-th sample of present frame in the signal of kth road；In formula：weight_kRepresent present frame in the signal of kth road Weight, LOUD_kRepresent that mean loudness of the kth road signal in current predetermined amount of time, M represent the signal way for participating in audio mixing, k =1,2 ..., M, weight_k[i] represents the weight of i-th sample of present frame in the signal of kth road；The downmixing unit is by public affairs FormulaDraw current all the way after each road signal present frame audio mixing Frame signal, in formula：I-th sample of the current frame signal all the way after Mix [i] expression audio mixings, M represent the signal road for participating in audio mixing Number, f_k [i] represent i-th sample output signal of present frame in the signal of time-varying low-pass filtering treatment Houk road, weight_k[i] represents the weight of i-th sample in the signal of kth road included by present frame, k=1,2 ..., M；The post processing Unit maximum sample in signal present frame and calculates the present frame weight of signal after audio mixing after being additionally operable to obtain audio mixing；Enter One step ground, the weight calculation unit are additionally operable to carry out smooth place to each frame weight in each road signal of participation audio mixing respectively Reason；Further, the process of realizing that each frame weight during the weight calculation unit is to signal all the way is smoothed is：When When in signal, present frame is the first frame all the way, the weight calculation unit passes through formula weight_k[i]=weight_k0≤i ＜ N Obtain the weight of i-th sample of present frame in the signal of kth road；When in signal all the way, present frame is not the first frame, the weight Computing unit passes through formula Obtain the weight of i-th sample of present frame in the signal of kth road；Wherein, weight_k[i] represents present frame i-th in the signal of kth road The weight of individual sample, N represent that sample size in a frame, P represent weight_k[i] 1 frame weight from before present frame is gradually faded to be worked as The sampling number of previous frame weight；The post-processing unit passes through formula signal_max=max | Mix [0] |, | Mix [1] |, | Mix [N-1] | } maximum sample in signal present frame after audio mixing is obtained, in formula：signal_maxBelieve after representing audio mixing Maximum sample, max { } number included by present frame represents that signal is worked as after the maximum of data, Mix [0] represent audio mixing in braces The output signal of the 1st sample of signal present frame, Mix [N-1] table after the output signal of the 0th sample of previous frame, Mix [1] expression audio mixings Show the output signal of signal present frame N-1 samples after audio mixing；Work as signal_maxWhen≤32768, the post-processing unit is calculated Go out present frame weight weight of signal after audio mixing_mix=1, work as signal_maxDuring ＞ 32768, the post-processing unit is calculated The present frame weight of signal after audio mixingWhen the present frame of signal after audio mixing is the first frame, institute Post-processing unit is stated by formula weight_mix[i]=weight_mix0≤i ＜ N obtain signal present frame i-th after audio mixing The weight of individual sample, when the present frame of signal after audio mixing is not the first frame, the post-processing unit passes through formulaObtain signal after audio mixing The weight of i-th sample of present frame, in formula：weight_mixThe weight of signal i-th sample of present frame, Q tables after [i] expression audio mixing Show weight_mix[i] before present frame, 1 frame weight gradually fades to the sampling number of present frame weight after audio mixing from after audio mixing；After described Processing unit passes through formulaDraw i-th of signal present frame after audio mixing Individual output sample y [i], in formula：Final [i]=Mix [i] * weight_mix[i], 0≤i ＜ N.

As shown in figure 8, present invention also offers a kind of sound mixing method, comprises the steps：

Further, the step 1 specifically includes following steps：

Step 11：The power of in framing Hou Ge road signal present frame is calculated respectively；

Step 12：According to the result of calculation of present frame power in each road signal, to obtain framing Hou Ge roads signal respectively Minimum frame power in current predetermined amount of time；

Step 13：By between the power to present frame in signal all the way and the minimum frame power relatively detecting Whether voice signal is contained in signal all the way；

Further,

In signal, the power of present frame passes through formula all the wayCalculated, in formula：Pow tables Show that the power of present frame, x [i] represent that i-th input data of present frame, N represent the sample size in a frame；

The current predetermined amount of time represents the duration T of the front r frames to present frame from present frame；By formula pow_ Min=min present frame power, and 1 frame power before present frame, r frames power before present frame } and obtaining signal all the way Minimum frame power in current predetermined amount of time, in formula：Min { } represent braces in all data minima,Ceil (x) is represented and is close to x and the integer more than or equal to x, F_SRepresent that sample frequency, N are represented in a frame Sample size；

Voice signal whether contain in signal all the way by setting VAD to be represented, and initial value is assigned to VAD and cause VAD=1； As pow >=32 pow_min, and VAD is put 1 during VAD=0, represent the road signal in there is voice signal state；When pow≤ 4 pow_min, and VAD is set to 0 during VAD=1, represent that the road signal is in without voice signal state；Pow and pow_min it Between comparative result be other situations when, VAD is kept constant；

Further, the step 2 is specially：

The i-th filtering output that present frame in signal all the way is obtained by formula f [i]=(1-b) * x [i]+b*f [i-1] Value, in formula：F [i] represents that i-th filtering output value of present frame, x [i] represent i-th input number of present frame in signal all the way Represent that according to, N the sample size in a frame, 0≤i ＜ N, b represent filter factor, present frame in have voice signal state when,When present frame is in without voice signal state, As b ＜ 0.18, b=0.18 is taken, as b ＞ 0.956, take b=0.956, p1 represents b from 0.956 time span for fading to 0.18 Sampling number, p2 represents sampling numbers of the b from 0.18 time span for fading to 0.956；

Further,

When in signal all the way present frame in have voice signal state when, the step 3 specifically includes following steps：

Step 31：DFT transform is carried out to present frame in the road signal；

Step 32：Calculate the loudness value of the present frame；

Step 33：Calculate the mean loudness in the current predetermined amount of time of road signal；

Further,

By formulaDFT is carried out to present frame in signal all the way Conversion, in formula,S represents that discrete frequency, x [i] represent that i-th input data of present frame, X [s] represent x Result that [i] is obtained after DFT transform, j represent imaginary unit, j²=-1；

Using formulaTo the present frame Loudness value calculated, in formula, loudness represents that the loudness value of present frame, X [s] represent x [i] after DFT transform The result that arrives, Equal [s] represent have default value etc. loudness array, s₂₀=ceil (20^*N/F_s)、s₂₀₀₀₀=floor (20000^*N/F_s), ceil (x) represents to be close to x and the integer more than or equal to x, floor (x) and represent and is close to x and less than or equal to x's Integer, F_SRepresent that sample frequency, N represent the sample size in a frame；

By formula To calculate the mean loudness in the current predetermined amount of time of signal all the way；Current predetermined amount of time represent from the front r frames of present frame to The duration T of present frame；In formula：Ceil (x) is represented and is close to x and the integer more than or equal to x, F_S Represent that sample frequency, N represent the sample size in a frame；

Further, by formulaPresent frame weight in the signal of kth road is drawn, and is passed through Formula weight_k[i]=weight_k0≤i ＜ N obtain the weight of i-th sample of present frame in the signal of kth road；In formula： weight_kRepresent present frame weight in the signal of kth road, LOUD_kRepresent average sound of the kth road signal in current predetermined amount of time Degree, M represent participate in audio mixing signal way, k=1,2 ..., M, weight_k[i] represents i-th sample of present frame in the signal of kth road This weight；

Further, by formulaDraw each road letter Current frame signal all the way after number present frame audio mixing, in formula：I-th sample of the current frame signal all the way after Mix [i] expression audio mixings This, M represents that the signal way for participating in audio mixing, f_k [i] represent current in the signal of time-varying low-pass filtering treatment Houk road I-th sample output signal of frame, weight_k[i] represents the weight of i-th sample in the signal of kth road included by present frame, k =1,2 ..., M；

Methods described also comprises the steps：Maximum sample and calculating audio mixing after acquisition audio mixing in signal present frame The present frame weight of signal afterwards；

Further, also there are following steps after the step 4：Respectively to participate in audio mixing each road signal in each Frame weight is smoothed；

Further, to signal all the way in the process of realizing that is smoothed of each frame weight be：

When in signal all the way, present frame is the first frame, by formula weight_k[i]=weight_k0≤i ＜ N obtain the The weight of i-th sample of present frame in the signal of k roads；

When in signal all the way, present frame is not the first frame, by formula

By formula signal_max=max { | Mix [0] |, | Mix [1] |, | Mix [N-1] | } is believed after obtaining audio mixing Maximum sample in number present frame, in formula：signal_maxMaximum sample, max { } after expression audio mixing included by signal present frame Represent that the maximum of data, Mix [0] in braces represent the output signal of the 0th sample of signal present frame, Mix [1] table after audio mixing Show that the output signal of the 1st sample of signal present frame, Mix [N-1] after audio mixing represent the defeated of signal present frame N-1 samples after audio mixing Go out signal；

Work as signal_maxWhen≤32768, present frame weight weight of signal after audio mixing is calculated_mix=1, when signal_maxDuring ＞ 32768, the present frame weight of signal after audio mixing is calculated

When the present frame of signal after audio mixing is the first frame, by formula weight_mix[i]=weight_mix0≤i ＜ N The weight of signal i-th sample of present frame after acquisition audio mixing, when the present frame of signal after audio mixing is not the first frame, by formulaObtain signal after audio mixing The weight of i-th sample of present frame, in formula：weight_mixThe weight of signal i-th sample of present frame, Q tables after [i] expression audio mixing Show weight_mix[i] before present frame, 1 frame weight gradually fades to the sampling number of present frame weight after audio mixing from after audio mixing；

By formulaDraw i-th of signal present frame after audio mixing Output sample y [i], in formula：Final [i]=Mix [i] * weight_mix[i], 0≤i ＜ N.

The upper limit of the passband width of time varing filter of the present invention is 20kHz, and the lower limit of passband width is 0.3kHz；Work as filtering When passband is more than the upper limit or less than lower limit so as to be maintained at bound；Before time-varying low-pass filtering treatment is carried out, when first pair Become wave filter initialized, specially make f [- 1]=0, b=0.18, now the passband width of time varing filter be 0～ 20kHz；Current predetermined amount of time of the invention can be with the current 4s of value；A upper predetermined amount of time refers to a upper 4s of current 4s；This Invention Equal [s] represent have default value etc. loudness array, the default value in the loudness array such as described according to etc. loudness Curve and obtain, its concrete numerical value such as table 1；

The numerical tabular of loudness array Equal such as table 1. [s].

With reference to the concrete acquisition process that table 1 illustrates Equal [s] value：1. s*N/F is calculated_sValue；2. according to s*N/F_s Value, search the frequency range corresponding to the value in table 1；3. according to obtained frequency range in table 1, obtain corresponding Equal [s] value；For example, work as s*N/F_sWhen=1, its value falls in the frequency range of table 1 (0.985～1.500), therefore Equal [s]=1.5.

Identical in order to ensure each road signal averaging loudness, need to calculate each road signal etc. loudness weight；Present invention warp Cross smoothing step so that the weight between each frame has the smoothing process of P point, it is ensured that weighted data is in each frame Between smooth change, and then voice signal sounds more smooth, is conducive to the guarantee of voice quality；If weight calculation unit pair In per road signal, each frame weight does not carry out preferred smoothing process, each sample in the signal of Zek roads included by present frame Weight (weight_k[i], 0,1,2 ... N-1 of i values) it is equal to present frame weight weight in the signal of kth road_k, specifically,If weight calculation unit carries out preferred smoothing processing mistake to each frame weight in every road signal Journey, then, when in signal all the way, present frame is the first frame, the weight calculation unit passes through formula weight_k[i]=weight_k 0≤i ＜ N obtain the weight of i-th sample of present frame in the signal of kth road；When in signal all the way, present frame is not the first frame, institute Weight calculation unit is stated by formula

Obtain the weight of i-th sample of present frame in the signal of kth road, i.e., after smoothing process, current in the signal of kth road Each sample weights included by frame are not congruent to present frame weight weight in the signal of kth road_k；In addition, mixed to each road signal After sound, for the frequent spillover for preventing multi-path voice signal to be likely to occur after being added, then can carry out corresponding rear place Reason, while can also ensure held stationary between the weighted data of each frame signal by post-processing operation, and then is conducive to audio mixing The flatness of voice afterwards.The present invention calculates its weight, Shi Ge roads voice signal using the loudness of each road voice signal as standard Mean loudness identical, finally carry out Overflow handling again, and then each road voice signal, after audio mixing, its loudness connect acoustically Closely identical, and spilling will not be frequent.

Audio mixing is carried out come further using mixer of the present invention below by the voice signal to three road different qualities Effectiveness of the invention, wherein, sample frequency F are described_S48kHz is taken, the sample number N in a frame signal takes 2048, current predetermined Time period takes current 4 seconds, frame number r take sampling number p1s of 100, the b from 0.956 time span for fading to 0.18 take 960, b from The 0.18 sampling number p2 faded in 0.956 time span takes 96000, weight_k[i] 1 frame weight is gradually from before present frame The sampling number P for fading to present frame weight takes 100, weight_mix[i] before present frame, 1 frame weight gradually fades to audio mixing from after audio mixing The sampling number Q of present frame weight takes 100 afterwards；

Fig. 5 shows the waveform diagram of the voice signal of three road of the invention different qualities, as shown in figure 5, in order to verify The audio mixing effect of small-signal, first via voice signal (signal 1) amplitude range are -3500～3500, much smaller than other two-way languages Message number；Due to loudness and amplitude proportional, so the loudness of first via voice signal is also much smaller than other two-way voice signals； In order to the characteristics of verifying the effectiveness of speech detection unit and time varing filter, the second road voice signal (signal 2) for " having language Sound " state and " without voice " state are alternateed and add uniform white noise；The characteristics of 3rd road voice signal (signal 3) For：Above a part of signal amplitude is less, and aft section signal amplitude is larger, and then can contrast the 3rd road voice signal after audio mixing The change of amplitude in front and back, so that analyze the change of loudness before and after which；Fig. 6 is through loudness computing unit of the present invention and weight calculation The waveform diagram of three road voice signals after cell processing, as shown in fig. 6, through speech detection unit, time varing filter, After the process of loudness computing unit and weight calculation unit, three road voice signals there occurs different changes；Wherein, the first via Voice signal (signal 1) amplitude is significantly increased, and its loudness also increases therewith；Continuously there is " nothing in second road voice signal (signal 2) During voice " state, the uniform white noise in signal is cut, and signal amplitude has also reduced；3rd road voice signal (signal 3) Above the amplitude of a part of signal has increased, and the amplitude of aft section signal has then reduced；Fig. 7 shows audio mixing of the present invention The waveform diagram of device output signal, as shown in fig. 7, three road voice signals are overlapped by downmixing unit, then after passing through Processing unit causes final output signal to overflow infrequently, and overflows supersaturation process.From above test result, three The different voice signal of road loudness, after mixer of the present invention such as carries out at the loudness control, its mean loudness is close to phase Deng；As three road voice signals have different qualities, the know clearly good robustness of the present invention and stability is also indicated that.

The invention provides a kind of new voice activation detection mode, i.e., judge to work as by the mean power of voice signal Whether previous frame is voice signal；The present invention solves current audio mixing technology Zhong Ge road signal straight by the introducing of time varing filter Connect participation audio mixing and introduce the problem of unnecessary noise, participate in audio mixing way and make while avoiding and being reduced using quiet detection Into " without speech " participant have no to there is a problem of feeling；The loudness control strategies such as present invention employing, by calculating each road signal Loudness drawing the weight of each road signal, the mean loudness of final Shi Ge roads signal be close to identical, auditory effect also close to；This While invention achieves raising small-signal voice quality, the fairness to each participant has also been embodied.

The above, the only present invention preferably specific embodiment, but protection scope of the present invention is not limited thereto, Any those familiar with the art the invention discloses technical scope in, technology according to the present invention scheme and its Inventive concept equivalent or change in addition, should all be included within the scope of the present invention.

Claims

1. a kind of mixer, it is characterised in that the mixer includes：

The speech detection unit being connected with the framing unit；The speech detection unit is used for framing Hou Ge roads signal Whether detected containing voice signal；The speech detection unit has by whether present frame in detection all the way signal is in Voice signal state, determines whether the road signal contains voice signal；

The time varing filter being connected with the speech detection unit；The time varing filter is used for according to the speech detection list The testing result of unit, carries out time-varying low-pass filtering treatment respectively to framing Hou Ge roads signal；At present frame in signal all the way When having voice signal state, the passband width of the time varing filter gradually broadens, when in signal all the way, present frame is in nothing During voice signal state, the passband width of the time varing filter becomes narrow gradually；

The loudness computing unit being connected with the speech detection unit；The loudness computing unit is used for being examined according to the voice The testing result of unit is surveyed, to calculate mean loudness of the framing Hou Ge roads signal in current predetermined amount of time respectively；

The weight calculation unit being connected with loudness computing unit；The weight calculation unit is used for calculating list according to the loudness The result of calculation of unit, calculates each sample weights included by present frame in each road signal for participate in audio mixing respectively；

The downmixing unit being connected with the time varing filter and the weight calculation unit；The downmixing unit is used for according to each Road signal respectively through the present frame output signal after time-varying low-pass filtering treatment, and in each road signal included by present frame Each sample weights, obtain and export the current frame signal all the way after each road signal present frame audio mixing；

The post-processing unit being connected with the downmixing unit；The post-processing unit is used for being exported all the way according to downmixing unit Current frame signal, each sample weights included by signal present frame and each output sample after calculating audio mixing.

2. mixer according to claim 1, it is characterised in that the speech detection unit includes：

The minimum frame power determination module being connected with the power computation module；The minimum frame power determination module is used for root According to the result of calculation of the power computation module, minimum in current predetermined amount of time to obtain framing Hou Ge roads signal respectively Frame power；

The voice status being connected with the power computation module and the minimum frame power determination module know module；Institute's predicate Sound-like state know module for by the comparison between the power to present frame in signal all the way and the minimum frame power come Detect in signal all the way and whether contain voice signal.

3. mixer according to claim 2, it is characterised in that

The power computation module passes through formulaTo calculate the power of present frame in signal all the way, formula In：Pow represents that the power of present frame, x [i] represent that i-th input data of present frame, N represent the sample size in a frame；

The current predetermined amount of time represents the duration T of the front r frames to present frame from present frame；The minimum frame power is true Cover half block by formula pow_min=min present frame power, 1 frame power before present frame, r frames power before present frame } To obtain the minimum frame power in current predetermined amount of time of signal all the way, in formula：Min { } represents all data in braces Minima,Ceil (x) is represented and is close to x and the integer more than or equal to x, F_SRepresent that sample frequency, N are represented Sample size in one frame；

The voice status know that module whether contain in signal all the way voice signal by setting VAD to represent, and VAD is assigned Initial value causes VAD=1；As pow >=32 pow_min, and during VAD=0, the voice status know that VAD is put 1 by module, represent The road signal is in voice signal state；As pow≤4 pow_min, and the voice status know that module will during VAD=1 VAD sets to 0, and represents that the road signal is in without voice signal state；Comparative result between pow and pow_min is other situations When, the voice status know that VAD is kept constant by module.

4. mixer according to claim 1, it is characterised in that the time varing filter passes through formula f [i]=(1-b) * x [i]+b*f [i-1] obtains i-th filtering output value of present frame in signal all the way, in formula：F [i] represents current in signal all the way I-th filtering output value of frame, x [i] represent that i-th input data of present frame, N represent the sample size in a frame, 0≤i ＜ N, b represent filter factor, when present frame is in and has voice signal state,In present frame When being in without voice signal state,As b ＜ 0.18, b=0.18 is taken, as b ＞ 0.956, B=0.956 is taken, p1 represents that sampling numbers of the b from 0.956 time span for fading to 0.18, p2 represent that b is faded to from 0.18 Sampling number in 0.956 time span.

5. mixer according to claim 1, it is characterised in that

The loudness computing unit includes：Loudness that DFT transform module is connected with DFT transform module obtain module and with sound Degree obtains the mean loudness acquisition module that module is connected；

When in signal all the way present frame in have voice signal state when, by the DFT transform module in the road signal when Previous frame carries out DFT transform, obtains the loudness value that module calculates the present frame by the loudness afterwards, finally by described flat Loudness obtains module and calculates the mean loudness in the current predetermined amount of time of road signal；

When in signal all the way, present frame is in without voice signal state, the mean loudness in the current predetermined amount of time of road signal The mean loudness being equal in the upper predetermined amount of time containing voice signal before present frame.

6. mixer according to claim 5, it is characterised in that

The DFT transform module passes through formulaS=0,1 ..., N-1 carried out to present frame in signal all the way DFT transform, in formula,S represents that discrete frequency, x [i] represent that i-th input data of present frame, X [s] are represented Result that x [i] is obtained after DFT transform, j represent imaginary unit, j²=-1；

The mean loudness obtains module and passes through formula To calculate the mean loudness in the current predetermined amount of time of signal all the way；Current predetermined amount of time represent from the front r frames of present frame to The duration T of present frame；In formula：Ceil (x) is represented and is close to x and the integer more than or equal to x, F_S Represent that sample frequency, N represent the sample size in a frame.

7. mixer according to claim 6, it is characterised in that

The weight calculation unit passes through formulaPresent frame weight in the signal of kth road is drawn, and By formula weight_k[i]=weight_k0≤i ＜ N obtain the weight of i-th sample of present frame in the signal of kth road；In formula： weight_kRepresent present frame weight in the signal of kth road, LOUD_kRepresent average sound of the kth road signal in current predetermined amount of time Degree, M represent participate in audio mixing signal way, k=1,2 ..., M, weight_k[i] represents i-th sample of present frame in the signal of kth road This weight；

The downmixing unit passes through formula0≤i ＜ N show that each road signal is current Current frame signal all the way after frame audio mixing, in formula：I-th sample of the current frame signal all the way after Mix [i] expression audio mixings, M tables Show that the signal way for participating in audio mixing, f_k [i] represent i-th of present frame in the signal of time-varying low-pass filtering treatment Houk road Sample output signal, weight_kThe weight of i-th sample in the signal of [i] expression kth road included by present frame, k=1, 2、…、M；

The post-processing unit be additionally operable to obtain audio mixing after maximum sample in signal present frame and signal after calculating audio mixing Present frame weight.

8. mixer according to claim 7, it is characterised in that

The weight calculation unit is additionally operable to be smoothed each frame weight in each road signal of participation audio mixing respectively.

9. mixer according to claim 8, it is characterised in that

When in signal all the way, present frame is the first frame, the weight calculation unit passes through formula weight_k[i]=weight_k0 ≤ i ＜ N obtain the weight of i-th sample of present frame in the signal of kth road；

0≤i ＜ P obtain the weight of i-th sample of present frame in the signal of kth road；Wherein, weight_k[i] is represented in the signal of kth road The weight of i-th sample of present frame, N represent that sample size in a frame, P represent weight_k[i] 1 frame weight from before present frame The sampling number of present frame weight is gradually faded to；

The post-processing unit passes through formula signal_max=max { | Mix [0] |, | Mix [1] |, | Mix [N-1] | } is obtained Maximum sample after audio mixing in signal present frame, in formula：signal_maxMaximum after expression audio mixing included by signal present frame Sample, max { } represent the output letter of the 0th sample of signal present frame after the maximum of data in braces, Mix [0] expression audio mixings Number, the output signal of the 1st sample of signal present frame, Mix [N-1] represent after audio mixing signal present frame the after Mix [1] represents audio mixing The output signal of N-1 samples；

Work as signal_maxWhen≤32768, the post-processing unit calculates present frame weight weight of signal after audio mixing_mix= 1, work as signal_maxDuring ＞ 32768, the post-processing unit calculates the present frame weight of signal after audio mixing

When the present frame of signal after audio mixing is the first frame, the post-processing unit passes through formula weight_mix[i]= weight_mix0≤i ＜ N obtain the weight of signal i-th sample of present frame after audio mixing, and after audio mixing, the present frame of signal is not first During frame, the post-processing unit passes through formula 0≤i ＜ Q obtain the weight of signal i-th sample of present frame after audio mixing, in formula：weight_mixAfter [i] represents audio mixing, signal is current The weight of i-th sample of frame, Q represent weight_mix[i] before present frame, 1 frame weight gradually fades to present frame after audio mixing from after audio mixing The sampling number of weight；

The post-processing unit passes through formulaAfter drawing audio mixing, signal is worked as I-th output sample y [i] of previous frame, in formula：Final [i]=Mix [i] * weight_mix[i], 0≤i ＜ N.

10. a kind of sound mixing method, it is characterised in that methods described comprises the steps：

Step 2：Whether framing Hou Ge roads signal is detected containing voice signal；By current in detection all the way signal Whether frame determines whether the road signal contain voice signal in there is voice signal state；

Step 3：According to Speech signal detection result, time-varying low-pass filtering treatment is carried out respectively to framing Hou Ge roads signal：When All the way in signal present frame in when having voice signal state, passband width gradually broadens, when in signal all the way, present frame is in During without voice signal state, passband width becomes narrow gradually；

Step 4：According to Speech signal detection result, to calculate framing Hou Ge roads signal respectively in current predetermined amount of time Mean loudness；

Step 5：According to mean loudness result of calculation, the various kinds included by present frame in each road signal for participate in audio mixing is calculated respectively This weight；

Step 6：According to each road signal respectively through the present frame output signal after time-varying low-pass filtering treatment, and each road signal Each sample weights included by middle present frame, obtain and export the current frame signal all the way after each road signal present frame audio mixing；

Step 7：According to the current frame signal all the way after each road signal present frame audio mixing, after calculating audio mixing included by signal present frame Each sample weights and each output sample.