CN106504758B

CN106504758B - Mixer and sound mixing method

Info

Publication number: CN106504758B
Application number: CN201610939143.8A
Authority: CN
Inventors: 陈喆; 殷福亮; 呼德
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2016-10-25
Filing date: 2016-10-25
Publication date: 2019-07-16
Anticipated expiration: 2036-10-25
Also published as: CN106504758A

Abstract

The invention discloses a kind of mixer and sound mixing method, the mixer includes: framing unit；The speech detection unit being connected with the framing unit；The speech detection unit is used to whether detect the road framing Hou Ge signal containing voice signal；The time varing filter being connected with the speech detection unit；The loudness computing unit being connected with the speech detection unit；The weight calculation unit being connected with loudness computing unit；The downmixing unit being connected with the time varing filter and the weight calculation unit；The post-processing unit being connected with the downmixing unit；The present invention can be improved voice quality, also embody the fairness to each participant.

Description

Mixer and sound mixing method

Technical field

The present invention relates to a kind of audio mixing technology, specially a kind of mixer and sound mixing method.

Background technique

Video conference and videoconference are the conferencing forms held by communication network, they can be ginseng in strange land space Meeting person is added to provide real-time voice exchange.To obtain close to true meeting communication atmosphere, audio mixing technology is indispensable, and mixes Sound technology will have a direct impact on the voice quality of meeting.Audio mixing technology is specifically divided into simulation audio mixing technology and digital audio mixing technology, Wherein, digital audio mixing technology is used widely due to the big advantages such as low with noise of, dynamic range high with precision.Number is mixed The basic principle of sound technology is to be overlapped mutually each road voice signal by the digital signal after analog-to-digital conversion and formed all the way Audio mixing output signal.

Since audio digital signals there are problems that quantifying upper and lower bound, superposition is likely to result in result spilling, So in terms of the demand of digital audio mixing technology shows following two: 1. guaranteeing that the signal after audio mixing will not frequently overflow；With The increase of voice number, spillover can more and more frequently, can if directly carrying out saturation arithmetic to these spill overs Noise is introduced, so that the sound after audio mixing sounds discontinuous or explosion sound occurs.2. guaranteeing each road voice quality；Each road language Size, the frequency of sound are different, well guarantee that quality of these signals after audio mixing is measure digital audio mixing technology one Item major criterion.

It is disclosed in the existing literature " audio mixing technology and its in the application in voip session system " that author is Zhang Chuanyong A kind of weighting method audio mixing technology, main thought are to calculate a weighted value to per voice signal all the way, later to weighting after Signal be overlapped；And weight the purpose is to reduce or eliminate overflow, to guarantee voice quality.The weighting method audio mixing technology Specific implementation it is as follows: assuming that there is the road N signal, every one frame of road signal has M sample, wherein f (i, j) be jth road signal I-th of sample value, then its corresponding weighted value are as follows:

Finally export are as follows:

Wherein, weight (i, j) is the weight of i-th of sample of jth road signal, and Output (i) is that i-th of sample is defeated Out.There are the following problems for weighting method audio mixing technology disclosed in the existing literature: each road voice signal in audio mixing, get over by signal amplitude Small then its weight also can be smaller, which results in becoming smaller after small signal audio mixing, easily leads to biggish distortion；Secondly, general The people that video conference is made a speech simultaneously does not exceed 4, and this mode the line signal (containing noise) that do not speak without Audio mixing is directly participated in any processing, easily reduces the signal-to-noise ratio of the voice after audio mixing.

One kind is disclosed in the existing literature " a kind of new multimedia conferencing real-time sound mixing scheme " that author is Zhou Jingli etc. Automatic threshold audio mixing technology, before audio mixing to each road signal according to its short-time energy to determine whether for voice signal (i.e. Mute detection), the route of not voice is judged as " no floor status ", these " no floor status " signals will not be participated in mixed Sound；When audio mixing, this mode calculates its decay factor according to itself short-time energy size of audio data, when audio short-time energy Decay in certain proportion when more than some threshold value, and does not need then to decay lower than threshold value, and then each road The weight of signal is only related to the short-time energy of oneself.Automatic threshold audio mixing technology disclosed in the existing literature exists asks as follows Topic: the signal due to being judged as " no floor status " does not participate in audio mixing, so showing no sign of " no hair in the sound after audio mixing Speech state " signal, so that the participant of these " no floor status " has no the presence of sense；Meanwhile participant is from silencing to when speaking It will appear fluctuating, influence the sense of hearing；Secondly, although this mode ensure that weight >=1 of small signal, but still small signal is not can guarantee Voice quality.

Therefore in existing audio mixing technology or each road signal (in spite of there is voice) without it is any processing and Audio mixing is directly participated in, or the signal number for participating in audio mixing is reduced using mute detection；If not using mute detection, audio mixing mistake Unnecessary noise is added in Cheng Zhonghui, to influence voice quality.If reducing audio mixing number, " nothing using mute detection The participant of floor status " can become have no sense of participation；In addition, in existing audio mixing technology being made with the amplitude of voice signal Determine that the weight of each road signal is overlapped again for standard, but the loudness of voice is not exactly equal to its amplitude, it is depended on The amplitude and frequency of voice signal.

Summary of the invention

The it is proposed of the present invention in view of the above problems, and develop one kind and can be improved voice quality, also embodied to respectively with The mixer and sound mixing method of the fairness of meeting person.

Technological means of the invention is as follows:

A kind of mixer, comprising:

Framing unit, for carrying out framing respectively to each road signal for participating in audio mixing；

The speech detection unit being connected with the framing unit；The speech detection unit is used for the road framing Hou Ge Whether signal is detected containing voice signal；Whether the speech detection unit is located by present frame in detection all the way signal In there is voice signal state, to determine whether the road signal contains voice signal；

The time varing filter being connected with the speech detection unit；The time varing filter is used to be examined according to the voice The testing result for surveying unit, carries out time-varying low-pass filtering treatment to the road framing Hou Ge signal respectively；When current in signal all the way Frame is in when having voice signal state, and the passband width of the time varing filter gradually broadens, in signal all the way at present frame When no voice signal state, the passband width of the time varing filter is become narrow gradually；

The loudness computing unit being connected with the speech detection unit；The loudness computing unit is used for according to institute's predicate The testing result of sound detection unit, to calculate separately mean loudness of the framing Hou Ge road signal in current predetermined amount of time；

The weight calculation unit being connected with loudness computing unit；The weight calculation unit is used for according to the program meter The calculated result for calculating unit calculates separately each sample weights included by present frame in each road signal for participating in audio mixing；

The downmixing unit being connected with the time varing filter and the weight calculation unit；The downmixing unit is used for root It is wrapped according to present frame in each road signal respectively the present frame output signal after time-varying low-pass filtering treatment and each road signal Each sample weights included obtain and export the current frame signal all the way after each road signal present frame audio mixing；

The post-processing unit being connected with the downmixing unit；What the post-processing unit was used to be exported according to downmixing unit Current frame signal all the way calculates each sample weights and each output sample included by signal present frame after audio mixing；

Further, the speech detection unit includes:

Power computation module, for calculating separately the power of present frame in the signal of the road framing Hou Ge；

The minimum frame power determination module being connected with the power computation module；The minimum frame power determination module is used In the calculated result according to the power computation module, to obtain framing Hou Ge road signal respectively in current predetermined amount of time The smallest frame power；

The voice status being connected with the power computation module and the minimum frame power determination module knows module；Institute Stating voice status knows module for by the ratio between the power and the smallest frame power to present frame in signal all the way Relatively detect in signal all the way whether to contain voice signal；

Further,

The power computation module passes through formulaTo calculate the function of present frame in signal all the way Rate, in formula: pow indicates that the power of present frame, x [i] indicate i-th of input data of present frame, and N indicates the sample number in a frame Amount；

The current predetermined amount of time indicates the duration T from the preceding r frame of present frame to present frame；The minimum frame function Rate determining module by formula pow_min=min present frame power, 1 frame power before present frame, r frame before present frame Power } obtain signal the smallest frame power in current predetermined amount of time all the way, in formula: min { } indicates own in braces The minimum value of data,Ceil (x) is indicated close to x and is more than or equal to integer, the F of x_SIndicate sampling frequency Rate, N indicate the sample size in a frame；

The voice status knows module by setting VAD to indicate whether contain voice signal in signal all the way, and right VAD assigns initial value and makes VAD=1；As pow >=32pow_min, and when VAD=0, the voice status knows that module sets VAD 1, indicate the road signal be in have voice signal state；As pow≤4pow_min, and when VAD=1, the voice status is known VAD is set 0 by module, indicates the road signal in no voice signal state；Comparison result between pow and pow_min is it When its situation, the voice status knows that module remains unchanged VAD；

Further, the time varing filter obtains signal all the way by formula f [i]=(1-b) * x [i]+b*f [i-1] I-th of filtering output value of middle present frame, in formula: f [i] indicates i-th of filtering output value, the x [i] of present frame in signal all the way Indicate that i-th of input data of present frame, N indicate that the sample size in a frame, 0≤i < N, b indicate filter factor, in present frame In when having voice signal state,When present frame is in no voice signal state,As b < 0.18, take b=0.18, as b > 0.956, take b=0.956, p1 indicate b from 0.956 fades to the sampling number in 0.18 time span, and p2 indicates the sampling in the time span that b fades to 0.956 from 0.18 Points；

Further,

The loudness computing unit include: DFT transform module, the loudness being connected with DFT transform module obtain module and The mean loudness that module is connected, which is obtained, with loudness obtains module；

When in signal all the way present frame be in have voice signal state when, by the DFT transform module to the road signal Middle present frame carries out DFT transform, the loudness value that module calculates the present frame is obtained by the loudness later, finally by institute It states mean loudness and obtains module and calculate mean loudness in the current predetermined amount of time of road signal；

When present frame is in no voice signal state in signal all the way, being averaged in the current predetermined amount of time of road signal Loudness is equal to the mean loudness in the upper predetermined amount of time containing voice signal before present frame；

Further,

The DFT transform module passes through formulaTo signal all the way Middle present frame carries out DFT transform, in formula,S indicates that discrete frequency, x [i] indicate i-th of present frame input Data, X [s] indicate x [i] obtained after DFT transform result, j indicate imaginary unit, j²=-1；

The loudness obtains module and utilizes formulaThe loudness value of the present frame is counted It calculates, in formula, loudness indicates the result that the loudness value of present frame, X [s] indicate that x [i] is obtained after DFT transform, Equal [s] indicate have default value etc. loudness array, s₂₀=ceil (20*N/F_s)、s₂₀₀₀₀=floor (20000*N/F_s)、 Ceil (x) indicates that integer, floor (x) close to x and more than or equal to x are indicated close to x and be less than or equal to integer, the F of x_SExpression is adopted Sample frequency, N indicate the sample size in a frame；

The mean loudness obtains module and passes through formulaTo calculate one Mean loudness in the current predetermined amount of time of road signal；Current predetermined amount of time is indicated from the preceding r frame of present frame to present frame Duration T；In formula:Ceil (x) is indicated close to x and is more than or equal to integer, the F of x_SIndicate sampling Frequency, N indicate the sample size in a frame；

Further,

The weight calculation unit passes through formulaShow that present frame is weighed in the signal of kth road Weight, and pass through formula weight_k[i]=weight_k0≤i < N obtains the weight of i-th of sample of present frame in the signal of kth road； In formula: weight_kIndicate present frame weight in the signal of kth road, LOUD_kIndicate that kth road signal is flat in current predetermined amount of time Equal loudness, M indicate participate in audio mixing signal number, k=1,2 ..., M, weight_k[i] indicates present frame i-th in the signal of kth road The weight of a sample；

The downmixing unit passes through formulaObtain each road Current frame signal all the way after signal present frame audio mixing, in formula: Mix [i] indicates i-th of the current frame signal all the way after audio mixing Sample, M indicate to participate in the signal number of audio mixing, f_k [i] indicate by the signal of the road time-varying low-pass filtering treatment Houk when I-th of sample output signal of previous frame, weight_k[i] indicates the weight of i-th of sample included by present frame in the signal of kth road, K=1,2 ..., M；

The post-processing unit is also used to obtain the maximum sample after audio mixing in signal present frame and believes after calculating audio mixing Number present frame weight；

Further,

The weight calculation unit is also used to respectively smoothly locate each frame weight in each road signal for participating in audio mixing Reason；

Further,

The realization process that the weight calculation unit is smoothed each frame weight in signal all the way are as follows:

When present frame is first frame in signal all the way, the weight calculation unit passes through formula weight_k[i]= weight_k0≤i < N obtains the weight of i-th of sample of present frame in the signal of kth road；

When present frame is not first frame in signal all the way, the weight calculation unit passes through formulaObtain kth road letter The weight of i-th of sample of present frame in number；Wherein, weight_k[i] indicates the power of i-th of sample of present frame in the signal of kth road Weight, N indicate that sample size, P in a frame indicate weight_k[i] 1 frame weight before present frame gradually fades to present frame weight Sampling number；

The post-processing unit passes through formula signal_max=max | Mix [0] |, | Mix [1] |, | Mix [N-1] | obtain maximum sample after audio mixing in signal present frame, in formula: signal_maxAfter expression audio mixing included by signal present frame Maximum sample, max { } indicate the maximum value of data in braces, the 0th sample of signal present frame is defeated after Mix [0] expression audio mixing Signal is current after the output signal of the 1st sample of signal present frame, Mix [N-1] expression audio mixing after signal, Mix [1] expression audio mixing out The output signal of frame N-1 sample；

Work as signal_maxWhen≤32768, the post-processing unit calculates the present frame weight of signal after audio mixing weight_mix=1, work as signal_maxWhen > 32768, the post-processing unit calculates the present frame weight of signal after audio mixing

When the present frame of signal is first frame after audio mixing, the post-processing unit passes through formula weight_mix[i]= weight_mix, 0≤i < N obtain audio mixing after i-th of sample of signal present frame weight, the present frame of signal is not after audio mixing When first frame, the post-processing unit passes through formulaSignal after acquisition audio mixing The weight of i-th of sample of present frame, in formula: weight_mixThe weight of i-th of sample of signal present frame, Q table after [i] expression audio mixing Show weight_mix[i] 1 frame weight gradually fades to the sampling number of present frame weight after audio mixing before present frame after audio mixing；

The post-processing unit passes through formulaBelieve after obtaining audio mixing I-th of output sample y [i] of number present frame, in formula: final [i]=Mix [i] * weight_mix[i], 0≤i < N.

A kind of sound mixing method, includes the following steps:

Step 1: framing is carried out respectively to each road signal for participating in audio mixing；

Step 2: whether the road framing Hou Ge signal is detected containing voice signal；By detecting in signal all the way Whether present frame has been in voice signal state, to determine whether the road signal contains voice signal；

Step 3: according to Speech signal detection as a result, being carried out at time-varying low-pass filtering respectively to the road framing Hou Ge signal Reason: when in signal all the way present frame be in have voice signal state when, passband width gradually broadens, when present frame in signal all the way When in no voice signal state, passband width is become narrow gradually；

Step 4: according to Speech signal detection as a result, to calculate separately framing Hou Ge road signal in current predetermined amount of time Interior mean loudness；

Step 5: according to mean loudness calculated result, calculating separately in each road signal for participating in audio mixing included by present frame Each sample weights；

Step 6: according to each road signal the present frame output signal after time-varying low-pass filtering treatment and each road respectively Each sample weights included by present frame in signal obtain and export the letter of present frame all the way after each road signal present frame audio mixing Number；

Step 7: according to the current frame signal all the way after each road signal present frame audio mixing, calculating signal present frame institute after audio mixing Including each sample weights and each output sample.

By adopting the above-described technical solution, mixer provided by the invention and sound mixing method, provide a kind of new language Sound activates detection mode, i.e., judges whether present frame is voice signal by the mean power of voice signal；The present invention passes through The introducing of time varing filter solves the current road audio mixing technology Zhong Ge signal and directly participates in audio mixing and introduce asking for unnecessary noise Topic, at the same avoid using it is mute detection come reduce participate in audio mixing number and caused by " no speech " participant have no exist feel Problem；The loudness control strategies such as present invention use obtain the weight of each road signal by calculating the loudness of each road signal, finally The mean loudness of the road Shi Ge signal close to identical, auditory effect also close to；The present invention realizes the especially small signal of voice signal The raising of voice quality has also embodied the fairness to each participant.

Detailed description of the invention

Fig. 1 is the structural block diagram of mixer of the present invention；

Fig. 2 is the structural block diagram of speech detection unit of the present invention；

Fig. 3 is the structural block diagram of loudness computing unit of the present invention；

Fig. 4 is the work flow diagram of mixer of the present invention；

Fig. 5 is the waveform diagram of the voice signal of three road different characteristics of the invention；

Fig. 6 is the waveform of the three road voice signals after loudness computing unit of the present invention and weight calculation unit processing Schematic diagram；

Fig. 7 is the waveform diagram of mixer output signal of the present invention；

Fig. 8 is the flow chart of sound mixing method of the present invention.

Specific embodiment

A kind of mixer as shown in Figure 1, Figure 2, Figure 3 and Figure 4, comprising: framing unit, for each road for participating in audio mixing Signal carries out framing respectively；The speech detection unit being connected with the framing unit；The speech detection unit be used for point Whether the road Zheng Houge signal is detected containing voice signal；The speech detection unit by detection all the way in signal when Whether previous frame has been in voice signal state, to determine whether the road signal contains voice signal；With the speech detection unit The time varing filter being connected；The time varing filter is used for the testing result according to the speech detection unit, after framing Each road signal carry out time-varying low-pass filtering treatment respectively；When in signal all the way present frame be in have voice signal state when, institute The passband width for stating time varing filter gradually broadens, when present frame is in no voice signal state in signal all the way, when described The passband width for becoming filter becomes narrow gradually；The loudness computing unit being connected with the speech detection unit；The program meter Unit is calculated for the testing result according to the speech detection unit, is made a reservation for calculate separately the road framing Hou Ge signal currently Mean loudness in period；The weight calculation unit being connected with loudness computing unit；The weight calculation unit is used for root According to the calculated result of the loudness computing unit, each sample included by present frame in each road signal for participating in audio mixing is calculated separately Weight；The downmixing unit being connected with the time varing filter and the weight calculation unit；The downmixing unit is used for basis Each road signal is respectively in the present frame output signal after time-varying low-pass filtering treatment and each road signal included by present frame Each sample weights, obtain and export the current frame signal all the way after each road signal present frame audio mixing；With the downmixing unit phase The post-processing unit of connection；The post-processing unit is used for the current frame signal all the way exported according to downmixing unit, calculates audio mixing Each sample weights included by signal present frame and each output sample afterwards；Further, the speech detection unit includes: power Computing module, for calculating separately the power of present frame in the signal of the road framing Hou Ge；It is connected with the power computation module Minimum frame power determination module；The minimum frame power determination module is used for the calculating knot according to the power computation module Fruit, to obtain framing Hou Ge road signal the smallest frame power in current predetermined amount of time respectively；With the power calculation mould The voice status that block is connected with the minimum frame power determination module knows module；The voice status knows module for leading to It crosses to whether the power of present frame in signal all the way contains compared between the smallest frame power to detect in signal all the way There is voice signal；Further, the power computation module passes through formulaTo calculate signal all the way The power of middle present frame, in formula: pow indicates that the power of present frame, x [i] indicate i-th of input data of present frame, and N indicates a frame In sample size；

The current predetermined amount of time indicates the duration T from the preceding r frame of present frame to present frame；The minimum frame function Rate determining module by formula pow_min=min present frame power, 1 frame power before present frame, r frame before present frame Power } obtain signal the smallest frame power in current predetermined amount of time all the way, in formula: min { } indicates own in braces The minimum value of data,Ceil (x) is indicated close to x and is more than or equal to integer, the F of x_SIndicate sampling frequency Rate, N indicate the sample size in a frame；The voice status know module and indicated in signal all the way by setting VAD whether Containing voice signal, and initial value is assigned to VAD and makes VAD=1；As pow >=32pow_min, and voice shape when VAD=0 State knows that VAD is set 1 by module, indicate the road signal be in have voice signal state；As pow≤4pow_min, and VAD=1 Shi Suoshu voice status knows that VAD is set 0 by module, indicates the road signal in no voice signal state；In pow and pow_min Between comparison result be other situations when, the voice status knows that module remains unchanged VAD；Further, when described Become i-th of the filtering output that filter obtains present frame in signal all the way by formula f [i]=(1-b) * x [i]+b*f [i-1] It is worth, in formula: f [i] indicates that i-th of filtering output value of present frame in signal all the way, x [i] indicate i-th of input number of present frame Indicate that the sample size in a frame, 0≤i < N, b indicate filter factor according to, N, when present frame, which is in, voice signal state,When present frame is in no voice signal state, As b < 0.18, b=0.18 is taken, as b > 0.956, b=0.956, p1 is taken to indicate in the time span that b fades to 0.18 from 0.956 Sampling number, p2 indicate b fade to 0.956 from 0.18 time span in sampling number；Further, the loudness calculates Unit includes: DFT transform module, the loudness being connected with DFT transform module acquisition module and is connected with loudness acquisition module Mean loudness obtain module；When in signal all the way present frame be in have voice signal state when, pass through the DFT transform module DFT transform is carried out to present frame in the road signal, the loudness value that module calculates the present frame is obtained by the loudness later, Module, which is obtained, finally by the mean loudness calculates the mean loudness in the current predetermined amount of time of road signal；When signal all the way When middle present frame is in no voice signal state, mean loudness in the current predetermined amount of time of road signal be equal to present frame it Mean loudness in the preceding upper predetermined amount of time containing voice signal；Further, the DFT transform module passes through formulaDFT transform is carried out to present frame in signal all the way, in formula,S indicates that discrete frequency, x [i] indicate that i-th of input data of present frame, X [s] indicate that x [i] becomes by DFT The result that obtains after changing, j indicate imaginary unit, j²=-1；

The loudness obtains module and utilizes formula The loudness value of the present frame is calculated, in formula, loudness indicates that the loudness value of present frame, X [s] indicate that x [i] passes through The result that is obtained after DFT transform, Equal [s] indicate to have default value etc. loudness array, s₂₀=ceil (20*N/F_s)、 s₂₀₀₀₀=floor (20000*N/F_s), ceil (x) indicate to indicate close to x and more than or equal to the integer of x, floor (x) close to x and Integer, F less than or equal to x_SIndicate that sample frequency, N indicate the sample size in a frame；The mean loudness obtains module and passes through FormulaTo count Calculate the mean loudness in the current predetermined amount of time of signal all the way；Current predetermined amount of time is indicated from the preceding r frame of present frame to current The duration T of frame；In formula:Ceil (x) is indicated close to x and is more than or equal to integer, the F of x_SIt indicates Sample frequency, N indicate the sample size in a frame；Further, the weight calculation unit passes through formulaIt obtains present frame weight in the signal of kth road, and passes through formula weight_k[i]=weight_k 0 ≤ i < N obtains the weight of i-th of sample of present frame in the signal of kth road；In formula: weight_kIndicate present frame in the signal of kth road Weight, LOUD_kIndicate that mean loudness of the kth road signal in current predetermined amount of time, M indicate to participate in the signal number of audio mixing, k =1,2 ..., M, weight_k[i] indicates the weight of i-th of sample of present frame in the signal of kth road；The downmixing unit passes through public affairs FormulaIt is current all the way after obtaining each road signal present frame audio mixing Frame signal, in formula: Mix [i] indicates that i-th of sample of the current frame signal all the way after audio mixing, M indicate to participate in the signal road of audio mixing I-th of sample output signal of present frame in the signal of the road time-varying low-pass filtering treatment Houk is passed through in number, f_k [i] expression, weight_k[i] indicates the weight of i-th of sample included by present frame in the signal of kth road, k=1,2 ..., M；The post-processing Unit is also used to obtain the maximum sample after audio mixing in signal present frame and calculates the present frame weight of signal after audio mixing；Into One step, the weight calculation unit is also used to respectively smoothly locate each frame weight in each road signal for participating in audio mixing Reason；Further, the realization process that the weight calculation unit is smoothed each frame weight in signal all the way are as follows: when When present frame is first frame in signal all the way, the weight calculation unit passes through formula weight_k[i]=weight_k0≤i < N Obtain the weight of i-th of sample of present frame in the signal of kth road；When present frame is not first frame in signal all the way, the weight Computing unit passes through formula Obtain the weight of i-th of sample of present frame in the signal of kth road；Wherein, weight_k[i] indicates present frame i-th in the signal of kth road The weight of a sample, N indicate that sample size, P in a frame indicate weight_k[i] 1 frame weight before present frame, which gradually fades to, to be worked as The sampling number of previous frame weight；The post-processing unit passes through formula signal_max=max | Mix [0] |, | Mix [1] |, | Mix [N-1] | } obtain maximum sample after audio mixing in signal present frame, in formula: signal_maxBelieve after indicating audio mixing Signal is worked as after maximum sample included by number present frame, max { } indicate the maximum value of data in braces, Mix [0] indicates audio mixing The output signal of the 0th sample of previous frame, Mix [1] indicate the output signal of the 1st sample of signal present frame, Mix [N-1] table after audio mixing Show the output signal of signal present frame N-1 sample after audio mixing；Work as signal_maxWhen≤32768, the post-processing unit is calculated Out after audio mixing signal present frame weight weight_mix=1, work as signal_maxWhen > 32768, the post-processing unit is calculated The present frame weight of signal after audio mixingWhen the present frame of signal is first frame after audio mixing, institute It states post-processing unit and passes through formula weight_mix[i]=weight_mix0≤i < N obtains i-th of sample of signal present frame after audio mixing This weight, when the present frame of signal is not first frame after audio mixing, the post-processing unit passes through formulaSignal after acquisition audio mixing The weight of i-th of sample of present frame, in formula: weight_mixThe weight of i-th of sample of signal present frame, Q table after [i] expression audio mixing Show weight_mix[i] 1 frame weight gradually fades to the sampling number of present frame weight after audio mixing before present frame after audio mixing；After described Processing unit passes through formulaObtain i-th of signal present frame after audio mixing A output sample y [i], in formula: final [i]=Mix [i] * weight_mix[i], 0≤i < N.

As shown in figure 8, including the following steps: the present invention also provides a kind of sound mixing method

Further, the step 1 specifically comprises the following steps:

Step 11: calculating separately the power of present frame in the signal of the road framing Hou Ge；

Step 12: according to the calculated result of present frame power in each road signal, to obtain the road framing Hou Ge signal respectively The smallest frame power in current predetermined amount of time；

Step 13: being detected compared between the smallest frame power by the power to present frame in signal all the way Whether contain voice signal in signal all the way；

Further,

The power of present frame passes through formula in signal all the wayIt is calculated, in formula: pow table Show that the power of present frame, x [i] indicate i-th of input data of present frame, N indicates the sample size in a frame；

The current predetermined amount of time indicates the duration T from the preceding r frame of present frame to present frame；Pass through formula pow_ Min=min present frame power, and 1 frame power before present frame, r frame power before present frame } working as to obtain signal all the way The smallest frame power in preceding predetermined amount of time, in formula: min { } indicate the minimum value of all data in braces,Ceil (x) is indicated close to x and is more than or equal to integer, the F of x_SIndicate that sample frequency, N indicate in a frame Sample size；

It indicates whether to contain in signal all the way voice signal by setting VAD, and initial value is assigned to VAD and makes VAD=1； VAD is set 1 as pow >=32pow_min, and when VAD=0, indicate the road signal be in have voice signal state；When pow≤ 4pow_min, and VAD is set 0 when VAD=1, indicate the road signal in no voice signal state；Pow and pow_min it Between comparison result be other situations when, VAD is remained unchanged；

Further, the step 2 specifically:

I-th of the filtering output of present frame in signal all the way is obtained by formula f [i]=(1-b) * x [i]+b*f [i-1] It is worth, in formula: f [i] indicates that i-th of filtering output value of present frame in signal all the way, x [i] indicate i-th of input number of present frame Indicate that the sample size in a frame, 0≤i < N, b indicate filter factor according to, N, when present frame, which is in, voice signal state,When present frame is in no voice signal state, As b < 0.18, b=0.18 is taken, as b > 0.956, b=0.956, p1 is taken to indicate in the time span that b fades to 0.18 from 0.956 Sampling number, p2 indicate b fade to 0.956 from 0.18 time span in sampling number；

Further,

When in signal all the way present frame be in have voice signal state when, the step 3 specifically comprises the following steps:

Step 31: DFT transform is carried out to present frame in the road signal；

Step 32: calculating the loudness value of the present frame；

Step 33: calculating the mean loudness in the current predetermined amount of time of road signal；

Further,

Pass through formulaDFT is carried out to present frame in signal all the way It converts, in formula,S indicates that discrete frequency, x [i] indicate that i-th of input data of present frame, X [s] indicate x Result that [i] is obtained after DFT transform, j indicate imaginary unit, j²=-1；

Utilize formulaTo described current The loudness value of frame is calculated, and in formula, loudness indicates that the loudness value of present frame, X [s] indicate x [i] after DFT transform Obtained result, Equal [s] indicate to have default value etc. loudness array, s₂₀=ceil (20^*N/F_s)、s₂₀₀₀₀=floor (20000^*N/F_s), ceil (x) indicates to indicate close to x and more than or equal to the integer of x, floor (x) close to x and less than or equal to x's Integer, F_SIndicate that sample frequency, N indicate the sample size in a frame；

Pass through formula To calculate the mean loudness in the current predetermined amount of time of signal all the way；Current predetermined amount of time indicate from the preceding r frame of present frame to The duration T of present frame；In formula:Ceil (x) is indicated close to x and is more than or equal to integer, the F of x_S Indicate that sample frequency, N indicate the sample size in a frame；

Further, pass through formulaIt obtains present frame weight in the signal of kth road, and passes through Formula weight_k[i]=weight_k0≤i < N obtains the weight of i-th of sample of present frame in the signal of kth road；In formula: weight_kIndicate present frame weight in the signal of kth road, LOUD_kIndicate average sound of the kth road signal in current predetermined amount of time Degree, M indicate participate in audio mixing signal number, k=1,2 ..., M, weight_k[i] indicates i-th of sample of present frame in the signal of kth road This weight；

Further, pass through formulaObtain each road letter Current frame signal all the way after number present frame audio mixing, in formula: Mix [i] indicates i-th of sample of the current frame signal all the way after audio mixing This, M indicates to participate in the signal number of audio mixing, and f_k [i] is indicated by current in the signal of the road time-varying low-pass filtering treatment Houk I-th of sample output signal of frame, weight_k[i] indicates the weight of i-th of sample included by present frame in the signal of kth road, k =1,2 ..., M；

The method also includes following steps: maximum sample and calculating audio mixing after acquisition audio mixing in signal present frame The present frame weight of signal afterwards；

Further, also with following steps after the step 4: respectively to each in each road signal for participating in audio mixing Frame weight is smoothed；

Further, the realization process each frame weight in signal all the way being smoothed are as follows:

When present frame is first frame in signal all the way, pass through formula weight_k[i]=weight_k0≤i < N obtains the The weight of i-th of sample of present frame in the signal of the road k；

When present frame is not first frame in signal all the way, pass through formulaObtain kth road letter The weight of i-th of sample of present frame in number；Wherein, weight_k[i] indicates the power of i-th of sample of present frame in the signal of kth road Weight, N indicate that sample size, P in a frame indicate weight_k[i] 1 frame weight before present frame gradually fades to present frame weight Sampling number；

Pass through formula signal_max=max | Mix [0] |, | Mix [1] |, | Mix [N-1] | } obtain audio mixing after believe Maximum sample in number present frame, in formula: signal_maxMaximum sample, max { } included by signal present frame after expression audio mixing Indicate the output signal of the 0th sample of signal present frame, Mix [1] table after the maximum value of data in braces, Mix [0] indicate audio mixing Show the defeated of signal present frame N-1 sample after the output signal of the 1st sample of signal present frame after audio mixing, Mix [N-1] indicate audio mixing Signal out；

Work as signal_maxWhen≤32768, the present frame weight weight of signal after audio mixing is calculated_mix=1, when signal_maxWhen > 32768, the present frame weight of signal after audio mixing is calculated

When the present frame of signal is first frame after audio mixing, pass through formula weight_mix[i]=weight_mix0≤i < N The weight of i-th of sample of signal present frame passes through formula when the present frame of signal is not first frame after audio mixing after acquisition audio mixingSignal after acquisition audio mixing The weight of i-th of sample of present frame, in formula: weight_mixThe weight of i-th of sample of signal present frame, Q table after [i] expression audio mixing Show weight_mix[i] 1 frame weight gradually fades to the sampling number of present frame weight after audio mixing before present frame after audio mixing；

Pass through formulaObtain i-th of signal present frame after audio mixing A output sample y [i], in formula: final [i]=Mix [i] * weight_mix[i], 0≤i < N.

The upper limit of the passband width of time varing filter of the present invention is 20kHz, and the lower limit of passband width is 0.3kHz；Work as filtering When passband is greater than the upper limit or is less than lower limit, bound is kept it in；Before carrying out time-varying low-pass filtering treatment, clock synchronization first Become filter initialized, specially make f [- 1]=0, b=0.18, at this time the passband width of time varing filter be 0~ 20kHz；The current predetermined amount of time of the present invention can be with the current 4s of value；A upper predetermined amount of time refers to a upper 4s of current 4s；This Invent Equal [s] indicate to have default value etc. loudness array, default value in the loudness array such as described according to etc. loudness Curve and obtain, specific value such as table 1；

The numerical tabular of the loudness array Equal such as table 1. [s].

Illustrate the specific acquisition process of Equal [s] value below with reference to table 1: 1. calculating s*N/F_sValue；2. according to s*N/F_s Value, frequency range corresponding to the value is searched in table 1；3. being found out corresponding according to frequency range obtained in table 1 Equal [s] value；For example, working as s*N/F_sWhen=1, value is fallen in the frequency range of table 1 (0.985~1.500), therefore Equal [s]=1.5.

In order to guarantee that each road signal averaging loudness is identical, need to calculate each road signal etc. loudness weight；Present invention warp Smoothing step is crossed, so that the weight between each frame has the smoothing process of P point, it is ensured that weighted data is in each frame Between smooth change, and then voice signal sounds more smooth, is conducive to the guarantee of voice quality；If weight calculation unit pair Each frame weight is without preferred smoothing process in every road signal, each sample included by present frame in the signal of the road Zek Weight (weight_k[i], 0,1,2 ... N-1 of i value) it is equal to present frame weight weight in the signal of kth road_k, specifically,If weight calculation unit carries out preferred smoothing process to frame weight each in every road signal, Then when present frame is first frame in signal all the way, the weight calculation unit passes through formula weight_k[i]=weight_k 0≤i < N obtains the weight of i-th of sample of present frame in the signal of kth road；When present frame is not first frame in signal all the way, the weight Computing unit passes through formula The weight for obtaining i-th of sample of present frame in the signal of kth road, i.e., after smoothing process, present frame institute in the signal of kth road Including each sample weights be not congruent to present frame weight weight in the signal of kth road_k；In addition, to each road signal audio mixing it Afterwards, the frequent spillover that multi-path voice signal is likely to occur after being added in order to prevent, then can be post-processed, together accordingly When held stationary between the weighted data of each frame signal can also be guaranteed by post-processing operation, and then be conducive to voice after audio mixing Flatness.The loudness of the present invention using each road voice signal calculates its weight as standard, and the road Shi Ge voice signal is averaged Loudness is identical, finally carries out Overflow handling again, and then each road voice signal, after audio mixing, loudness is acoustically close to phase Together, and spilling will not be frequent.

Audio mixing is carried out come further using mixer of the present invention below by the voice signal to three road different characteristics Illustrate effectiveness of the invention, wherein sample frequency F_S48kHz is taken, the sample number N in a frame signal takes 2048, current predetermined Period takes current 4 seconds, the sampling number p1 in the time span that frame number r takes 100, b to fade to 0.18 from 0.956 take 960, b from The 0.18 sampling number p2 faded in 0.956 time span takes 96000, weight_k[i] 1 frame weight is gradually before present frame The sampling number P for fading to present frame weight takes 100, weight_mix[i] 1 frame weight gradually fades to audio mixing before present frame after audio mixing The sampling number Q of present frame weight takes 100 afterwards；

Fig. 5 shows the waveform diagram of the voice signal of three road different characteristics of the invention, as shown in figure 5, in order to verify The audio mixing effect of small signal, first via voice signal (signal 1) amplitude range are -3500~3500, are much smaller than other two-way languages Sound signal；Due to loudness and amplitude proportional, so the loudness of first via voice signal is also much smaller than other two-way voice signals； In order to verify the validity of speech detection unit and time varing filter, " there is language the characteristics of the second road voice signal (signal 2) Sound " state and " no voice " state alternate and joined uniform white noise；The characteristics of third road voice signal (signal 3) Are as follows: a part of signal amplitude in front is smaller, and aft section signal amplitude is larger, and then third road voice signal can be compared after audio mixing The variation of front and back amplitude, to analyze the variation of its front and back loudness；Fig. 6 is by loudness computing unit of the present invention and weight calculation The waveform diagram of three road voice signals after cell processing, as shown in fig. 6, by speech detection unit, time varing filter, After the processing of loudness computing unit and weight calculation unit, different variations is had occurred in three road voice signals；Wherein, the first via Voice signal (signal 1) amplitude significantly increases, and loudness also increases with it；Continuously there is " nothing in second road voice signal (signal 2) When voice " state, the uniform white noise in signal is cut, and signal amplitude is also reduced；Third road voice signal (signal 3) The amplitude of a part of signal in front is increased, and the amplitude of aft section signal is then reduced；Fig. 7 shows audio mixing of the present invention The waveform diagram of device output signal, as shown in fig. 7, be overlapped by downmixing unit to three road voice signals, then after passing through Processing unit overflows final output signal infrequently, and overflows and have been subjected to saturated process.By the above test result as it can be seen that three The different voice signal of road loudness, after the loudness control such as carrying out by mixer of the present invention, mean loudness is close to phase Deng；Since three road voice signals have different characteristics, good robustness and stability of the invention have been also shown.

The present invention provides a kind of new voice activation detection modes, i.e., judge to work as by the mean power of voice signal Whether previous frame is voice signal；It is straight to solve the current road audio mixing technology Zhong Ge signal by the introducing of time varing filter by the present invention The problem of participating in audio mixing and introducing unnecessary noise is connect, while avoiding using mute detection to reduce and participating in audio mixing number and make At " no speech " participant have no to there are problems that sense；The loudness control strategies such as present invention use, pass through and calculate each road signal Loudness obtain the weight of each road signal, the mean loudness of the final road Shi Ge signal close to identical, auditory effect also close to；This While invention realizes raising small signal speech quality, the fairness to each participant has also been embodied.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims

1. a kind of mixer, comprising:

It is characterized in that, the mixer further includes；

The speech detection unit being connected with the framing unit；The speech detection unit is used for the road framing Hou Ge signal Whether detected containing voice signal；Whether the speech detection unit has been in by present frame in detection all the way signal Voice signal state, to determine whether the road signal contains voice signal；

The time varing filter being connected with the speech detection unit；The time varing filter is used for according to the speech detection list The testing result of member, carries out time-varying low-pass filtering treatment to the road framing Hou Ge signal respectively；In signal all the way at present frame When having voice signal state, the passband width of the time varing filter gradually broadens, when present frame is in nothing in signal all the way When voice signal state, the passband width of the time varing filter is become narrow gradually；

The loudness computing unit being connected with the speech detection unit；The loudness computing unit is used to be examined according to the voice The testing result of unit is surveyed, to calculate separately mean loudness of the framing Hou Ge road signal in current predetermined amount of time；

The weight calculation unit being connected with loudness computing unit；The weight calculation unit is used to be calculated according to the loudness single The calculated result of member calculates separately each sample weights included by present frame in each road signal for participating in audio mixing；

The downmixing unit being connected with the time varing filter and the weight calculation unit；The downmixing unit is used for according to each Road signal is respectively in the present frame output signal after time-varying low-pass filtering treatment and each road signal included by present frame Each sample weights obtain and export the current frame signal all the way after each road signal present frame audio mixing；

The post-processing unit being connected with the downmixing unit；The post-processing unit is used for according to downmixing unit output all the way Current frame signal calculates each sample weights and each output sample included by signal present frame after audio mixing.

2. mixer according to claim 1, it is characterised in that the speech detection unit includes:

The minimum frame power determination module being connected with the power computation module；The minimum frame power determination module is used for root It is minimum in current predetermined amount of time to obtain framing Hou Ge road signal respectively according to the calculated result of the power computation module Frame power；

The voice status being connected with the power computation module and the minimum frame power determination module knows module；Institute's predicate Sound-like state know module for by the power to present frame in signal all the way compared between the smallest frame power come It detects in signal all the way and whether contains voice signal.

3. mixer according to claim 2, which is characterized in that

The power computation module passes through formulaCalculate the power of present frame in signal all the way, In formula: pow indicates that the power of present frame, x [i] indicate i-th of input data of present frame, and N indicates the sample size in a frame；

The current predetermined amount of time indicates the duration T from the preceding r frame of present frame to present frame；The minimum frame power is true Cover half block is obtained by formula pow_min=min { present frame power, 1 frame power ... before present frame, r frame power before present frame } Signal the smallest frame power in current predetermined amount of time all the way, in formula: min { } indicates the minimum of all data in braces Value,Ceil (x) is indicated close to x and is more than or equal to integer, the F of x_SIndicate that sample frequency, N indicate one Sample size in frame；

The voice status knows module indicates whether to contain in signal all the way voice signal by setting VAD, and assigns to VAD Initial value makes VAD=1；As pow >=32pow_min, and when VAD=0, the voice status knows that VAD is set 1 by module, indicates The road signal, which is in, voice signal state；As pow≤4pow_min, and when VAD=1, the voice status knows that module will VAD sets 0, indicates the road signal in no voice signal state；Comparison result between pow and pow_min is other situations When, the voice status knows that module remains unchanged VAD.

4. mixer according to claim 1, it is characterised in that the time varing filter passes through formula f [i]=(1-b) * x [i]+b*f [i-1] obtains i-th of filtering output value of present frame in signal all the way, and in formula: f [i] indicates current in signal all the way I-th of filtering output value, the x [i] of frame indicate that i-th of input data of present frame, N indicate sample size, 0≤i < in a frame N, b indicates filter factor, when present frame, which is in, voice signal state,In present frame When in no voice signal state,As b < 0.18, b=0.18 is taken, as b > 0.956, B=0.956, p1 is taken to indicate that the sampling number in the time span that b fades to 0.18 from 0.956, p2 indicate that b is faded to from 0.18 Sampling number in 0.956 time span.

5. mixer according to claim 1, which is characterized in that

The loudness computing unit include: DFT transform module, the loudness being connected with DFT transform module obtain module and with sound Degree obtains the mean loudness that module is connected and obtains module；

When in signal all the way present frame be in have voice signal state when, by the DFT transform module in the road signal when Previous frame carries out DFT transform, obtains the loudness value that module calculates the present frame by the loudness later, finally by described flat Equal loudness obtains module and calculates the mean loudness in the current predetermined amount of time of road signal；

Mean loudness when present frame is in no voice signal state in signal all the way, in the current predetermined amount of time of road signal Equal to the mean loudness in the upper predetermined amount of time containing voice signal before present frame.

6. mixer according to claim 5, which is characterized in that

The DFT transform module passes through formulaTo believing all the way Present frame carries out DFT transform in number, in formula,S indicate discrete frequency, x [i] indicate i-th of present frame it is defeated Entering data, the result that X [s] expression x [i] obtains after DFT transform, j indicates imaginary unit, j²=-1；

The loudness obtains module and utilizes formula The loudness value of the present frame is calculated, in formula, loudness indicates that the loudness value of present frame, X [s] indicate that x [i] passes through The result that is obtained after DFT transform, Equal [s] indicate to have default value etc. loudness array, s₂₀=ceil (20*N/F_s)、 s₂₀₀₀₀=floor (20000*N/F_s), ceil (x) indicate to indicate close to x and more than or equal to the integer of x, floor (x) close to x and Integer, F less than or equal to x_SIndicate that sample frequency, N indicate the sample size in a frame；

The mean loudness obtains module and passes through formulaTo calculate all the way Mean loudness in the current predetermined amount of time of signal；Current predetermined amount of time indicates preceding r frame the holding to present frame from present frame Continuous time T；In formula:Ceil (x) is indicated close to x and is more than or equal to integer, the F of x_SIndicate sampling frequency Rate, N indicate the sample size in a frame.

7. mixer according to claim 6, which is characterized in that

The weight calculation unit passes through formulaObtain present frame weight in the signal of kth road, and Pass through formula weight_k[i]=weight_k0≤i < N obtains the weight of i-th of sample of present frame in the signal of kth road；In formula: weight_kIndicate present frame weight in the signal of kth road, LOUD_kIndicate average sound of the kth road signal in current predetermined amount of time Degree, M indicate participate in audio mixing signal number, k=1,2 ..., M, weight_k[i] is indicated in the signal of kth road i-th of present frame The weight of sample；

The downmixing unit passes through formulaObtain each road signal Current frame signal all the way after present frame audio mixing, in formula: Mix [i] indicates i-th of sample of the current frame signal all the way after audio mixing This, M indicates to participate in the signal number of audio mixing, and f_k [i] is indicated by current in the signal of the road time-varying low-pass filtering treatment Houk I-th of sample output signal of frame, weight_k[i] indicates the weight of i-th of sample included by present frame in the signal of kth road, k =1,2 ..., M；

The post-processing unit is also used to obtain the maximum sample after audio mixing in signal present frame and calculates signal after audio mixing Present frame weight.

8. mixer according to claim 7, which is characterized in that

The weight calculation unit is also used to respectively be smoothed each frame weight in each road signal for participating in audio mixing.

9. mixer according to claim 8, which is characterized in that

When present frame is first frame in signal all the way, the weight calculation unit passes through formula weight_k[i]=weight_k0 ≤ i < N obtains the weight of i-th of sample of present frame in the signal of kth road；

The post-processing unit passes through formula signal_max=max | Mix [0] |, | Mix [1] | ... | Mix [N-1] | } it is mixed Maximum sample after sound in signal present frame, in formula: signal_maxIndicate maximum sample included by signal present frame after audio mixing, Max { } indicates the maximum value of data in braces, the output signal of the 0th sample of signal present frame, Mix after Mix [0] expression audio mixing [1] output signal of the 1st sample of signal present frame, Mix [N-1] indicate signal present frame N-1 sample after audio mixing after expression audio mixing This output signal；

Work as signal_maxWhen≤32768, the post-processing unit calculates the present frame weight weight of signal after audio mixing_mix= 1, work as signal_maxWhen > 32768, the post-processing unit calculates the present frame weight of signal after audio mixing

When the present frame of signal is first frame after audio mixing, the post-processing unit passes through formula weight_mix[i]= weight_mix0≤i < N obtains the weight of i-th of sample of signal present frame after audio mixing, and the present frame of signal is not the after audio mixing When one frame, the post-processing unit passes through formulaSignal after acquisition audio mixing The weight of i-th of sample of present frame, in formula: weight_mixThe weight of i-th of sample of signal present frame, Q table after [i] expression audio mixing Show weight_mix[i] 1 frame weight gradually fades to the sampling number of present frame weight after audio mixing before present frame after audio mixing；

The post-processing unit passes through formulaSignal is worked as after obtaining audio mixing I-th of output sample y [i] of previous frame, in formula: final [i]=Mix [i] * weight_mix[i], 0≤i < N.

10. a kind of sound mixing method, it is characterised in that described method includes following steps:

Step 2: whether the road framing Hou Ge signal is detected containing voice signal；It is current in signal all the way by detecting Whether frame has been in voice signal state, to determine whether the road signal contains voice signal；

Step 3: according to Speech signal detection as a result, carrying out time-varying low-pass filtering treatment respectively to the road framing Hou Ge signal: when Present frame is in when having voice signal state in signal all the way, and passband width gradually broadens, when present frame is in signal all the way When without voice signal state, passband width is become narrow gradually；

Step 4: according to Speech signal detection as a result, to calculate separately framing Hou Ge road signal in current predetermined amount of time Mean loudness；

Step 5: according to mean loudness calculated result, calculating separately various kinds included by present frame in each road signal for participating in audio mixing This weight；

Step 6: according to each road signal the present frame output signal after time-varying low-pass filtering treatment and each road signal respectively Each sample weights included by middle present frame obtain and export the current frame signal all the way after each road signal present frame audio mixing；

Step 7: according to the current frame signal all the way after each road signal present frame audio mixing, calculating after audio mixing included by signal present frame Each sample weights and each output sample.