CN101414463B

CN101414463B - Method, apparatus and system for encoding mixed sound

Info

Publication number: CN101414463B
Application number: CN2007101813767A
Authority: CN
Inventors: 张清; 苗磊; 李伟; 许剑峰; 许丽净; 杜正中; 胡晨; 杨毅; 齐峰岩
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2007-10-19
Filing date: 2007-10-19
Publication date: 2011-08-10
Anticipated expiration: 2027-10-19
Also published as: CN101414463A

Abstract

The invention discloses a terminal side coding method. Voice information is set with an audio mixing identification based on an audio mixing strategy, the voice information is encoded to obtain core coding data according to the audio mixing identification information; if the audio mixing identification information needs audio mixing, dynamic side information is computed to generate and output an audio coding bit stream which contains the audio mixing identification, the core coding data and the dynamic side information; and if the audio mixing identification information does not need audio mixing, a terminal generates and outputs an audio coding bit stream which contains the audio mixing identification and the core coding data. The invention further discloses an audio mixing method of a corresponding network side, and a device and a system which are used for the audio mixing coding. The method can help solve the problems of signal overflow and errors when in the audio mixing, and can not reduce the coding efficiency.

Description

A kind of audio mixing coding method, device and system

Technical field

The present invention relates to the multimedia communication technology field, particularly a kind of audio mixing coding method, device and system.

Background technology

At present, the application of real-time multimedia communication service is more and more, and in order to satisfy growing business demand, for example multimedia conference system or the like, so various multimedia conference system correlation techniques seems very important.

In multimedia conferencing, audio interaction is the most basic key element.In centralized conference, all (Multi-point Controlling Unit, MCU) foundation sends audio code stream and receives audio code stream from MCU to MCU based on the connection of clean culture (unicast) each terminal in real time with multipoint control unit.Therefore, the input of MCU all is the audio code streams behind the various encoding schemes coding, and it is output as according to synthesis strategy and carries out audio code stream after audio mixing is handled.

Be illustrated in figure 1 as a multimedia conference system synoptic diagram, wherein frame of broken lines can be regarded a MCU unit as.Terminal location 1, audio code stream such as input such as terminal location 2 grades is through decoding respectively, and decoded audio code stream is encoded respectively to the audio code stream behind the audio mixing behind audio mixing unit audio mixing again, outputs to relevant terminal again.Multimedia conference system as shown in Figure 1 has M terminal to participate in audio mixing.For specific moment t, each terminal can be sent voice data and MCU, and MCU at first decodes voice data, and every road signal is carried out the audio mixing CALCULATION OF PARAMETERS, finally the multipath decoding signal is carried out audio mixing and handles.The algorithms most in use that audio mixing is handled promptly adds and all road decoded datas, will add and after data again by encoder encodes, finally be sent to each terminal.

Adopt above-mentioned time domain stack audio mixing scheme, usually can introduce noise.This is that wherein min represents the lower limit of scope because each terminal all has certain scope [min, max] in the sound signal that transmits to MCU, and max represents the upper limit of scope.When directly adding and during the signal of all roads, exceeding signal span [min, max] possibly.Because there is the problem that quantizes upper and lower bound in digital audio and video signals, the stack computing causes the result to overflow possibly.Common processing means are to overflow detection, and then carry out saturation arithmetic, and the result who promptly surpasses the upper limit is changed to higher limit, and the value that surpasses lower limit is changed to lower limit.This computing itself has destroyed the original temporal signatures of voice signal, thereby has introduced noise, and Here it is the reason of explosion sound and voice non-continuous event can occur in some system.

Along with the terminal data that participates in audio mixing increases, the frequency that occurs overflowing also constantly rises, so there is a terminal number upper limit in this class time domain stack audio mixing scheme, and this higher limit is very low, experimental results show that, under a lot of situations,, flow can't have been differentiated if its result just has a lot of noises and interrupted when 4 terminals participate in audio mixing.

Summary of the invention

In view of this, the embodiment of the invention proposes a kind of audio mixing coding method, can overcome the noise problem of time domain audio mixing coding in the prior art.Described audio mixing coding method comprises the steps:

Acoustic information is provided with the audio mixing flag according to the audio mixing strategy, according to audio mixing flag information described acoustic information is encoded, the result of coding is as the core encoder data;

If audio mixing flag information is the needs audio mixing, then calculate dynamic side information, generate and export the stream of audio codes that comprises described audio mixing flag, core encoder data and dynamic side information; If audio mixing flag information for not needing audio mixing, then generates and exports the stream of audio codes that comprises described audio mixing flag and core encoder data;

Network side receives the stream of audio codes of self terminal, judge whether that according to wherein audio mixing flag information needs carry out audio mixing to this stream of audio codes and handle, needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes according to dynamic side information wherein, the core encoder data of selected N road stream of audio codes are carried out audio mixing to be handled, and the stream of audio codes behind the output audio mixing, wherein N is smaller or equal to M '.

The embodiment of the invention also proposes a kind of end side coding method, comprises the steps:

According to the audio mixing strategy acoustic information is provided with the audio mixing sign, according to described audio mixing identification information described acoustic information being encoded obtains the core encoder data;

If described audio mixing identification information is the needs audio mixing, then calculate dynamic side information, generate and export the stream of audio codes that comprises described audio mixing sign, core encoder data and dynamic side information; If described audio mixing identification information is not for needing audio mixing, then terminal generates and exports the stream of audio codes that comprises described audio mixing sign and core encoder data.

The embodiment of the invention also proposes the coding method of a kind of network side audio mixing, comprises the steps:

Receive M road stream of audio codes, whether needs carry out audio mixing to this stream of audio codes handles according to wherein audio mixing identification information judgment, needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes according to dynamic side information wherein, the core encoder data of selected N road stream of audio codes are carried out audio mixing handle, and the stream of audio codes behind the output audio mixing, wherein M, M ' and N are positive integer, N is smaller or equal to M ', and M ' is smaller or equal to M.

The embodiment of the invention proposes a kind of multimedia conference system, comprises M terminal and multipoint control unit;

Comprise M terminal and multipoint control unit, it is characterized in that,

Described terminal is used for the acoustic information collected is provided with the audio mixing flag according to the audio mixing strategy of this locality, according to audio mixing flag information described acoustic information is encoded, and the result of coding is as the core encoder data; And the audio mixing flag is set according to the audio mixing strategy of this locality, generate and output to comprise described core encoder data, audio mixing flag be to need the audio mixing and the dynamic stream of audio codes of side information, perhaps generate and export to comprise described core encoder data and audio mixing flag for not needing the stream of audio codes of audio mixing;

Described multipoint control unit is used to receive the stream of audio codes of self terminal, value according to wherein audio mixing flag judges whether that needs carry out audio mixing to this stream of audio codes and handle, needs are carried out M ' the road audio code stream that audio mixing is handled, according to selecting N road audio code stream in the dynamic side information wherein, the core encoder data of selected N road audio code stream are carried out audio mixing to be handled, and the stream of audio codes behind the output audio mixing, wherein M, M ' and N are positive integer, N is smaller or equal to M ', and M ' is smaller or equal to M.

The embodiment of the invention proposes a kind of multimedia conferencing terminal, comprising:

The sound collecting module is used to collect acoustic information;

The audio mixing policy module is used for according to the audio mixing strategy that sets in advance the collected acoustic information of described sound collecting module being provided with the audio mixing flag;

The core encoder module is used for described acoustic information is encoded, output core encoder data;

Become frame module, be used for calculating dynamic side information according to the audio mixing flag of described audio mixing policy module setting, and according to the value of described audio mixing flag, generation comprises the coded audio data frame of described core encoder data, audio mixing flag and dynamic side information, perhaps generates the coded audio data frame that comprises described core encoder data and audio mixing flag;

Output module, the coded audio data frame that is used for the described one-tenth frame module generation of externally output is as stream of audio codes.

The embodiment of the invention proposes a kind of multipoint control unit, comprising:

Selected cell, be used for receiving stream of audio codes from M terminal, value according to the audio mixing flag of described stream of audio codes judges whether that needs carry out audio mixing to this stream of audio codes and handle, needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes according to dynamic side information wherein;

The audio mixing unit is used for that the core encoder data of the selected N of described selected cell road stream of audio codes are carried out audio mixing and handles, and obtains the stream of audio codes behind the audio mixing of M ' road;

Transmitting element is used for the stream of audio codes from described audio mixing unit is sent to the corresponding target terminal.

As can be seen from the above technical solutions, in end side, in encoding code stream, carry out the demarcation of audio mixing flag and increase the corresponding dynamic side information; At network side, select the stream of audio codes of needs audio mixing to carry out the audio mixing processing according to audio mixing flag and dynamic side information, the noise problem in the time of can solving the audio mixing coding.

Description of drawings

Fig. 1 is a multimedia conference system synoptic diagram of prior art;

Fig. 2 is the multimedia conference system synoptic diagram of the embodiment of the invention;

Fig. 3 is the structural drawing of the coded frame data in the stream of audio codes of terminal cell encoder output of the embodiment of the invention;

Fig. 4 is the coding process flow diagram of the end side of the embodiment of the invention;

Fig. 5 is the audio mixing coding process flow diagram of the MCU side of the embodiment of the invention;

A kind of multimedia conferencing terminal block diagram that Fig. 6 proposes for inventive embodiments;

A kind of multipoint control unit block diagram that Fig. 7 proposes for the embodiment of the invention.

Embodiment

The embodiment of the invention proposes the audio mixing coding method based on the audio mixing flag, in the data stream of terminal output, core encoder code stream except voice-bearer, also comprise audio mixing flag and dynamic side information, wherein dynamically side information carries the required information of audio mixing coding, if the audio mixing flag need to be set to audio mixing, dynamic side information is set then; If the audio mixing flag do not need to be set to audio mixing, dynamic side information is not set then.The core encoder code stream that MCU selects needs to carry out the audio mixing processing according to described audio mixing flag carries out audio mixing to be handled.

For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is further elaborated below in conjunction with accompanying drawing.

Fig. 2 shows the multimedia conference system synoptic diagram figure of the embodiment of the invention.In this multimedia conference system, comprise M terminal, i.e. terminal 1, terminal 2...... terminal M; Also comprise a MCU.

With terminal 1 is example, and this terminal comprises cell encoder 201, and the sound that the sound collection means of 201 pairs of terminals 1 of cell encoder such as microphone are collected is encoded, and generates the core encoder code stream that carries described acoustic information.The audio mixing strategy that cell encoder 201 also is provided with according to this locality is provided with the audio mixing flag.Described audio mixing strategy is used for determining whether the acoustic coding of this terminal output needs to carry out the audio mixing processing, different audio mixing strategies can be set according to the actual needs, for example, different priority can be set different terminals, preferentially carry out audio mixing for audio code stream from the high terminal of priority; The acoustic energy threshold value can also be set, when the acoustic energy that terminal is collected is then carried out audio mixing or the like to the audio code stream of this terminal above this energy threshold.And a plurality of audio mixing strategies can use simultaneously.

If the audio mixing flag that is provided with need to represent audio mixing, then cell encoder 201 also will generate dynamic side information, writes in the audio code stream; If the audio mixing flag do not need to represent audio mixing, then only comprise core encoder and audio mixing flag in the audio code stream of cell encoder 201 outputs.

Fig. 3 shows the structural drawing of the coded frame data in the stream of audio codes of terminal cell encoder output of the embodiment of the invention.If the total length of a Frame is the n bit, when the audio mixing flag was represented to need audio mixing, this coded frame data comprised the audio mixing flag of t bit shown in the last figure among Fig. 3, the dynamic side information of m bit, and the core encoder of n-m-t bit.Wherein, the audio mixing flag is arranged on frame head, is convenient to MCU identification.When the audio mixing flag was represented not need audio mixing, this coded frame data comprised the audio mixing flag of t bit and the core encoder of n-t bit shown in the figure below among Fig. 3.

For arrowband enhancement layer G.711 (Low Band Enhance, LBE) coding, the desirable following numerical value of various piece among Fig. 3: t=1, n=80, m=9.

Side information comprises: frame energy (Frame Energy) harmony cent value (Voicing score), if the side information code length is 9 bits, then wherein 6 bits are the frame energy of quantification, the sound score value of 3 bits for quantizing.

Wherein, the frame energy calculation is represented with formula (1):

Frame_Energy = \frac{Σ_{i = 0}^{Frame_Length - 1} S^{2} (i)}{Frame_Length} - - - (1)

Frame_Length is a frame length, and S (i) is that (i is the sampled value sequence number in the frame for Quadrature MirrorFilter, low band signal QMF) through Quadrature Mirror Filter QMF.

The sound score value calculates with formula (2):

Voicing_score = \frac{Zero_Cros \sin g_Rate}{Scale_factor}

Wherein, in zero-crossing rate (Zero_Crossing_Rate) the expression 10ms, time domain waveform zero passage number of times.The reduction constant of the reduction factor (Scale_Factor) for setting in advance, value is [0,1].

According to actual conditions, dynamically side information also can be set to other amount that can be used for handling as audio mixing basis for estimation, for example, can be set to quiet motion detection (VAD).

After the audio code stream of terminal output sends to MCU, at first import selected cell 202.Selected cell 202 at first identifies the audio mixing flag from the stream of audio codes of receiving, value according to the audio mixing flag, determine whether that need carry out audio mixing to this road stream of audio codes handles, if do not need audio mixing to handle, then selected cell 202 exports this road stream of audio codes to the corresponding target terminal.The stream of audio codes that needs audio mixing to handle for all M ' (M ' smaller or equal to M) road, selected cell 202 bases dynamic side information wherein, select N (N is smaller or equal to M ') road stream of audio codes, these stream of audio codes are sent to corresponding demoder respectively, after decoding, re-send to audio mixing unit 203 and carry out audio mixing and handle, obtain the audio code stream behind the audio mixing of M ' road, again with this M ' road audio code stream respectively with after the encoder encodes, be sent to relevant terminal.

The cataloged procedure of the end side of the embodiment of the invention comprises the steps: as shown in Figure 4

Step 401: the acoustic information collected is provided with the audio mixing flag according to the audio mixing strategy of this locality, then described acoustic information is encoded, the result of coding is as the core encoder data;

Step 402: if the audio mixing flag is set is the needs audio mixing, then calculate dynamic side information, can calculate frame energy harmony cent value as dynamic side information according to aforementioned formula (1) and formula (2).

Step 403: generate and the output audio encoding code stream.Described generation stream of audio codes specifically comprises: if set audio mixing flag then generates the coded audio data frame that comprises described audio mixing flag, core encoder data and dynamic side information for effectively; If set audio mixing flag is invalid, then generate the coded audio data frame that comprises described audio mixing flag and core encoder data.Described audio mixing flag be arranged on Frame before, preferably, length is 1 bit.

The audio mixing cataloged procedure of the MCU side of the embodiment of the invention comprises the steps: as shown in Figure 5

Step 501:MCU receives the stream of audio codes of self terminal, judges whether that according to the value of wherein audio mixing flag needs carry out audio mixing to this stream of audio codes and handle, if then execution in step 502, otherwise, execution in step 503.

Step 502: this road stream of audio codes is directly sent to corresponding purpose terminal, and finish processing to this road stream of audio codes.

Step 503: the stream of audio codes of receiving for synchronization from the individual terminal of M ', and the audio mixing flag in these stream of audio codes is need carry out the audio mixing processing, MCU is according to the dynamic side information in these code streams, therefrom select N road stream of audio codes, and abandon remaining M '-N road stream of audio codes.Wherein N is smaller or equal to M '.

Can be according to the value of energy in the side information, if greater than some threshold value T, audio mixing then is less than then not carrying out audio mixing.

504: the core encoder data to selected N road stream of audio codes are decoded respectively, decoded core encoder data are carried out audio mixing handle, and obtain the audio code stream behind the audio mixing of M ' road.

Step 505: the audio code stream behind the audio mixing of described M ' road is encoded respectively, the stream of audio codes behind coding of the M ' road behind the coding and the audio mixing is sent to the individual purpose terminal of M ' respectively.

Fig. 6 is a kind of multimedia conferencing terminal that inventive embodiments proposes, and comprising:

Sound collecting module 601 is used to collect acoustic information;

Audio mixing policy module 602 is used for according to the audio mixing strategy that sets in advance described sound collecting module 601 collected acoustic informations being provided with the audio mixing flag;

Core encoder module 603 is used for described acoustic information is encoded, output core encoder data; If audio mixing policy module 602 audio mixing flag do not need to be set to audio mixing, when then core encoder module 603 is encoded, need not to consider the Bit Allocation in Discrete of dynamic side information; If this audio mixing flag need to be set to audio mixing, when then core encoder module 603 is encoded, need to consider the Bit Allocation in Discrete of dynamic side information.For example, if total bit number of coded frame data is the n bit, the audio mixing flag is the t bit, and dynamically side information is the m bit, then for the situation of the Bit Allocation in Discrete that does not need to consider dynamic side information, the core encoder data length that core encoder module 603 codings obtain is the n-t bit; Consider the situation of the Bit Allocation in Discrete of dynamic side information for needs, the core encoder data length that core encoder module 603 codings obtain is the n-m-t bit.

Become frame module 604, be used for calculating dynamic side information according to the audio mixing flag that described audio mixing policy module 603 is provided with, and according to the value of described audio mixing flag, generation comprises the audio data frame of described core encoder data, audio mixing flag and dynamic side information, perhaps generates the audio data frame that comprises described core encoder data and audio mixing flag;

Output module 605 is used for the audio data frame that described one-tenth frame module 604 generates is externally exported as stream of audio codes.

Fig. 7 is a kind of multipoint control unit that the embodiment of the invention proposes, and comprising:

Selected cell 701, be used for receiving stream of audio codes from M terminal, value according to the audio mixing flag of described stream of audio codes judges whether that needs carry out audio mixing to this stream of audio codes and handle, needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes according to dynamic side information wherein;

Audio mixing unit 702 is used for that the core encoder data of the selected N of described selected cell road stream of audio codes are carried out audio mixing and handles, and obtains the audio code stream behind the audio mixing of M ' road;

Transmitting element 703 is used for the audio code stream from described audio mixing unit is sent to the corresponding target terminal.

The stream of audio codes that described selected cell 701 will not need audio mixing to handle sends to described transmitting element 703; Then described transmitting element 703 will send to the corresponding target terminal from the stream of audio codes of described selected cell.

Described multipoint control unit further comprises: demoder 704 is used for the core encoder data of described selected cell 701 selected stream of audio codes are decoded, and decoded core encoder data is sent to described audio mixing unit 702;

Scrambler 705 be used for encoding from the audio code stream behind the audio mixing of described audio mixing unit 702, and the stream of audio codes after will encoding sends to described transmitting element 703.

Embodiment of the invention scheme is carried out the demarcation of audio mixing flag and is increased the corresponding dynamic side information in encoding code stream, according to audio mixing flag and dynamic assignment side information Bit Allocation in Discrete.MCU according to the audio mixing flag and dynamically side information select the stream of audio codes of needs audio mixing to carry out audio mixing to handle, can introduce the problem of error in the time of can solving that signal overflows and large-signal carried out audio mixing, and reduce the computation complexity of MCU; When not carrying out audio mixing, can make full use of the code stream Bit Allocation in Discrete, improve the core encoder quality.The present invention program both can be used for mixer system, can use the codec of coding/decoding system commonly used again, and the Based Intelligent Control of favourable realization encoding code stream strengthens MCU unit interactivity.

The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. an audio mixing coding method is characterized in that, comprises the steps:

2. method according to claim 1 is characterized in that, described dynamic side information comprises frame energy, sound score value and/or quiet motion detection.

3. method according to claim 2 is characterized in that, the dynamic side information of described calculating comprises: according to formula

Calculate the frame energy, wherein, Frame_Energy represents the frame energy, and S (i) is the low band signal through Quadrature Mirror Filter QMF, and i is the sampled value sequence number in the frame.

4. method according to claim 2 is characterized in that, the dynamic side information of described calculating comprises: according to formula

Calculate the sound score value, wherein Voicing_score represents the sound score value; Zero_Crossing_Rate represented in the schedule time, the time domain waveform zero passage number of times of described acoustic information; Scale_Factor is the reduction constant that sets in advance, and value is [0,1].

5. method according to claim 1, it is characterized in that, the information of described basis audio mixing flag wherein judges whether that needs carry out audio mixing to this stream of audio codes and handle, its judged result is handled for not needing that this stream of audio codes is carried out audio mixing, then further comprises: export described stream of audio codes to the purpose terminal.

6. according to each described method of claim 1 to 5, it is characterized in that, described core encoder data to selected N road stream of audio codes are carried out audio mixing and are handled, and the audio code stream behind the output audio mixing comprises: the core encoder data in the audio code stream of selected N road are decoded respectively, decoded N road core encoder data are carried out audio mixing to be handled, obtain the audio code stream behind the audio mixing of M ' road, audio code stream behind the audio mixing of described M ' road is encoded respectively, the stream of audio codes behind coding of the M ' road behind the coding and the audio mixing is sent to the individual purpose terminal of M ' respectively.

7. an end side coding method is characterized in that, comprises the steps:

8. the audio mixing coding method of a network side is characterized in that, comprises the steps:

9. a multimedia conference system comprises M terminal and multipoint control unit, it is characterized in that,

10. a multimedia conferencing terminal is characterized in that, comprising:

The sound collecting module is used to collect acoustic information;

11. a multipoint control unit is characterized in that, comprising:

12. multipoint control unit according to claim 11 is characterized in that, the stream of audio codes that described selected cell will not need audio mixing to handle sends to described transmitting element; Then described transmitting element will send to the corresponding target terminal from the stream of audio codes of described selected cell.

13. according to claim 11 or 12 described multipoint control units, it is characterized in that, described multipoint control unit further comprises: demoder, be used for the core encoder data of the selected stream of audio codes of described selected cell are decoded, and decoded core encoder data are sent to described audio mixing unit;

Scrambler be used for encoding from the audio code stream behind the audio mixing of described audio mixing unit, and the stream of audio codes after will encoding sends to described transmitting element.