CN1845573A

CN1845573A - Simultaneous interpretation video conference system and method for supporting high capacity mixed sound

Info

Publication number: CN1845573A
Application number: CN200610040060.1A
Authority: CN
Inventors: 都思丹; 薛卫; 周余; 叶迎宪; 刘红星
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2006-04-30
Filing date: 2006-04-30
Publication date: 2006-10-11

Abstract

The disclosed simultaneous interpretation video conference system comprises: based on Mel-scale reverse spectrum signature and SVM, applying silence detection method with higher silence detection rate to detect silence and normal voice, applying a large-scale mixing-voice method with voice short-time energy as weight for process, and defining new audio data package head format for the simultaneous interpretation.

Description

Support the simultaneous interpretation video conference system and the method for high capacity mixed sound

Technical field

The present invention is a kind of the Internet simultaneous interpretation video conference system that is used for, and has specifically solved the communication problem of a meeting room high capacity mixed sound and simultaneous interpretation.

Background technology

Along with the high speed development of industries such as domestic affairs concerning foreign affairs, foreign trade, a kind ofly can satisfy high capacity mixed sound and will have good application prospects with the multilingual voice-over-net communication platform that exchanges.

More common audio mixing framework is centralized and distributed audio mixing now, under centralized configuration, each conference terminal sends to the center frequency mixer with the voice data of oneself, finishes the audio mix process and the audio mixing result is fed back to all terminals on the frequency mixer of center.Under distributed frame, each conference terminal receives voice data and the independent audio mixing of carrying out on self website from other all members.Clearly, this mode has caused the double counting of audio mixing process, and Internet traffic is very big, causes network congestion and investment expensive easily.Centralized processing has and reduces the client amount of calculation, and Internet traffic is low, and is simple and be easy to characteristics such as realization.At present the less multimedia conference system of scale all is this processing mode that adopts, but along with the increase of parliamentary dimension, the drawback of centralized processing is also more and more obvious.At first be that the audio mixing amount of calculation increases along with the increase of participant number of terminals, audio mixing time-delay simultaneously must increase; Next is the decline of voice quality, present disclosed several audio mixing algorithms: linear superposition, on average adjust the method for weighting, the align method of weighting, the weak alignment method of weighting etc. by force, the shortcoming that volume reduces, random noise is overflowed and introduced in summation have audio mixing when audio mixing voice way reaches some after.Therefore,, generally all adopt right of speech to switch and realize for audio mixing quantity is limited, very inconvenient for the user like this.A part of the present invention is exactly in order to solve this a series of problem, and concrete grammar is to suppress the quiet transmission of speech end and use more effective sound mixing method in mixer by efficient mute detection method, can accomplish at least 20 tunnel real-time sound mixing in the use.

General multimedia conference system is that unit carries out speech processes with the meeting room, each meeting room has only a mixer, this pattern can't satisfy international style exchange activity requirement, the international style exchange activity comprises meeting, commercial affairs exchange, product recommendations can wait, this conferencing environment requires multilingual information to issue simultaneously and allows sponsor to exchange with the country variant personnel, and some video meeting systems of existing market must be offered a plurality of meeting rooms at different language, could guarantee the multilingual audio frequency can be simultaneously by audio mixing be sent to different objects, obvious this mode be uneconomic with bring the not convenient of operation.

Summary of the invention

In order to improve audio mixing efficient and to solve the simultaneous interpretation problem, the invention provides a kind of more efficient mute detection method, sound mixing method and simultaneous interpretation method.Can realize higher silence detection rate, carry out multilingual synchronous mixed audio than the more audio mixing way of other sound mixing method with at same meeting room.

The objective of the invention is to be achieved through the following technical solutions:

System adopts the centralized processing framework, has defined two main systems: client terminal (Terminal), multipoint control unit (MCU).Client terminal comprises functional modules such as coding and decoding video, audio coding decoding, control unit, transport network layer, auxiliary office, and audio coding decoding adopts the mute detection method that proposes below, and whether detected before compressed audio needs to compress these frame voice.Multipoint control unit generally is installed on the server, and MCU has comprised multiple spot control module and multiple spot processing module, and multiple spot processing module formula is with the sound mixing method of adaptive weighting in short-term that proposes below.

Support the method for high capacity mixed sound to realize by following steps:

1, client terminal sound intermediate frequency coding/decoding module uses provided by the invention based on Mel yardstick cepstrum feature and the transmission of SVMs mute detection method with the minimizing voice data.Here adopt Mel yardstick cepstrum coefficient as phonetic feature, Mel yardstick cepstrum coefficient utilizes the auditory masking effect of people's ear, voice is divided into a series of critical band forms leg-of-mutton bank of filters on frequency domain, be i.e. Mel filter sequence.The process of silence detection is:

1) the Mel yardstick cepstrum coefficient of extraction one frame voice data, Mel yardstick cepstrum coefficient (CMFCC) computing formula is as follows:

c_{MFCC} (i) = \sqrt{\frac{2}{L}} Σ_{l = 1}^{L} \log m (l) \cos {(l - \frac{1}{2}) \frac{iπ}{L}} - - - (1)

Wherein:

m (l) = Σ_{k = o (l)}^{h (l)} W_{l} (k) | X_{n} (k) |, l = 1,2, \cdot \cdot \cdot, L - - - (2)

W_{i} (k) = \{\begin{matrix} \frac{k - o (l)}{c (l) - o (l)} & o (l) \leq k \leq c (l) \\ \frac{h (l) - k}{h (l) - c (l)} & c (l) \leq k \leq h (l) \end{matrix} - - - (3)

In the formula, o (l), c (l) and h (l) are respectively lower limit, center and the upper limiting frequency of 1 triangle filter.

2) with two category support vector machines the Mel yardstick cepstrum coefficient of audio frequency is differentiated, obtained normal voice and quiet two class results.Certainly also can use other grader, the present invention is unrestricted to this.

2, adaptive weighting sound mixing method in short-term in the multipoint control unit

Definition audio mixing weight w[j], at first calculate the averaged amplitude value of every road sound in k Frame:

Avg [j] = \frac{1}{kl} Σ_{i = 0}^{kl - 1} | data [j, i] | - - - (4)

Data[j in the following formula, i] i sample value of expression j road voice, alphabetical 1 represents the sample number of sound in the Frame.Then according to Avg[j] calculate the weight w[j that j road voice should occupy]:

w [j] = Avg [j] / Σ_{p = 0}^{n - 1} Avg [j] - - - (5)

Then according to w[j] sound is mixed:

MixData [i] = Σ_{j = 0}^{n - 1} data [j, i] * w [j] - - - (6)

The performing step of simultaneous interpretation method is: define new voice data packet header form, make tool can show languages.When MCU sets up meeting room, be that a meeting room is created n languages mixer.Show speech languages classification when speech side begins, reciever shows accepts the languages classification, perhaps to making a speech, accept languages setting.Judge when MCU receives audio frequency that this road audio frequency belongs to which meeting room, languages, and send into corresponding mixer.MCU transmits data behind the audio mixing respectively according to the request of reciever then.

Description of drawings

Fig. 1 is a modular structure schematic diagram of the present invention;

Fig. 2 is a system flow chart of the present invention.

Embodiment

1, Figure 1 shows that the composition frame chart of system module,,, after the encoded device compression,, send by network according to the certain format packing from the video and audio signal that input equipment obtains sending client terminal; At multipoint control unit, the multiple spot control module provides controlled function to all meetings, and the multiple spot processing module provides the data forwarding service; Receiving client terminal, at first unpacked from output packet, the video of acquisition, audio compression data are sent into output equipment after decoding, and user data and control data have also obtained corresponding processing.System comprises each function:

Coding and decoding video: finish redundant compressed encoding, can pass through MPEG4, H.264 wait realization to video code flow.

Audio coding decoding: finish the silence detection and the encoding and decoding of voice signal, and selectively add buffer delay to guarantee the continuity of voice, can use g.723, g729 etc. at receiving terminal.

Control unit: provide end-to-end signaling, to guarantee the proper communication of terminal.Defined request, replied, signaling and four kinds of information of indication, communicate capability negotiation by various terminal rooms, the opening/closing logic channel sends operations such as order or indication, finishes control of communication.

Transport network layer:, receive data from network simultaneously with data formattings such as video, audio frequency, control and transmission.In addition, also be responsible for to handle some such as logic divide frame, add sequence number, function such as error detection.

Auxiliary office: be used for realizing concrete operations functions such as electronic whiteboard, text chat, file transmission.

Fig. 2 has described the flow of data stream of system of the present invention middle pitch, video.The feature of sound, video and sequence number etc. can be realized by Real-time Transport Protocol, adopt TCP or udp protocol during transmission.

2, support the method for high capacity mixed sound to implement to describe: in the silence detection, Mel yardstick cepstrum coefficient is L=12, the inner product function of SVMs is selected RBF for use, and the training method of SVMs can adopt the SMO method, and the present invention is also unrestricted to this.

The adaptive weighting sound mixing method can be designed the computation structure of highly-parallelization in short-term.Notice the averaged amplitude value Avg[j of each road audio frequency in the formula (4)] calculating be separate, so each road can be calculated Avg[j concurrently].Mix this step and arrived, the calculating on each road remains separate, therefore is fit to carry out parallel computation equally.Also available MMX, SSE, SSE2 instruction set are optimized program in the programming process.Actual test shows, this method audio mixing is respond well, does not produce new audio mixing noise, has kept the details of former each road sound under the principle of volume justice preferably.

3, simultaneous interpretation technology is when concrete the use, each client can freely be selected the languages listened to from a plurality of different languages, for right to speak, need carry out authority setting, client for general identity, the languages of its speech can only be used a kind of languages of acquiescence, and having only identity is that the languages that translation or senior client can select to make a speech are other languages.Each client is all being uploaded to MCU after the audio compression of this locality, the languages of making a speech and selecting according to the client by MCU, in different mixers, mix behind the decompress(ion) respectively, and then listen to selected languages according to the client its needed languages recompression transmission is gone down.For making a speech and listening to the client who is in same languages, MCU also needs earlier its sound to be cut from the sound that mixes, and hears the sound of oneself to avoid this client.

Can effectively represent and distinguish the datagram languages type that sends or receive in order to make MCU, client, define new voice data packet header form, in data packet head, use many number of bits that languages are defined, use when general 3 bits just can satisfy 8 languages.

Claims

1, a kind of simultaneous interpretation video conference system and method for supporting high capacity mixed sound is characterized in that it comprises:

(1) method of support high capacity mixed sound is by suppressing the quiet transmission of speech end and use adaptive weighting sound mixing method in short-term in the multipoint control unit mixer based on Mel yardstick cepstrum feature and SVMs mute detection method.

(2) same meeting room carries out multilingual synchronous mixed audio, has defined new voice data packet header form, and uses a plurality of audio mixing processes at a meeting room.

2, according to the simultaneous interpretation video conference system and the method for right 1 described support high capacity mixed sound, it is characterized in that: in the content (1), propose based on Mel yardstick cepstrum feature and SVMs mute detection method, adaptive weighting sound mixing method in short-term.

3, according to the simultaneous interpretation video conference system and the method for right 1 described support high capacity mixed sound, it is characterized in that: in the content (2), defined new voice data packet header form, and used a plurality of audio mixing processes at a meeting room.