CN1845573A - Simultaneous interpretation video conference system and method for supporting high capacity mixed sound - Google Patents

Simultaneous interpretation video conference system and method for supporting high capacity mixed sound Download PDF

Info

Publication number
CN1845573A
CN1845573A CN200610040060.1A CN200610040060A CN1845573A CN 1845573 A CN1845573 A CN 1845573A CN 200610040060 A CN200610040060 A CN 200610040060A CN 1845573 A CN1845573 A CN 1845573A
Authority
CN
China
Prior art keywords
high capacity
simultaneous interpretation
sound
conference system
video conference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200610040060.1A
Other languages
Chinese (zh)
Inventor
都思丹
薛卫
周余
叶迎宪
刘红星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN200610040060.1A priority Critical patent/CN1845573A/en
Publication of CN1845573A publication Critical patent/CN1845573A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The disclosed simultaneous interpretation video conference system comprises: based on Mel-scale reverse spectrum signature and SVM, applying silence detection method with higher silence detection rate to detect silence and normal voice, applying a large-scale mixing-voice method with voice short-time energy as weight for process, and defining new audio data package head format for the simultaneous interpretation.

Description

Support the simultaneous interpretation video conference system and the method for high capacity mixed sound
Technical field
The present invention is a kind of the Internet simultaneous interpretation video conference system that is used for, and has specifically solved the communication problem of a meeting room high capacity mixed sound and simultaneous interpretation.
Background technology
Along with the high speed development of industries such as domestic affairs concerning foreign affairs, foreign trade, a kind ofly can satisfy high capacity mixed sound and will have good application prospects with the multilingual voice-over-net communication platform that exchanges.
More common audio mixing framework is centralized and distributed audio mixing now, under centralized configuration, each conference terminal sends to the center frequency mixer with the voice data of oneself, finishes the audio mix process and the audio mixing result is fed back to all terminals on the frequency mixer of center.Under distributed frame, each conference terminal receives voice data and the independent audio mixing of carrying out on self website from other all members.Clearly, this mode has caused the double counting of audio mixing process, and Internet traffic is very big, causes network congestion and investment expensive easily.Centralized processing has and reduces the client amount of calculation, and Internet traffic is low, and is simple and be easy to characteristics such as realization.At present the less multimedia conference system of scale all is this processing mode that adopts, but along with the increase of parliamentary dimension, the drawback of centralized processing is also more and more obvious.At first be that the audio mixing amount of calculation increases along with the increase of participant number of terminals, audio mixing time-delay simultaneously must increase; Next is the decline of voice quality, present disclosed several audio mixing algorithms: linear superposition, on average adjust the method for weighting, the align method of weighting, the weak alignment method of weighting etc. by force, the shortcoming that volume reduces, random noise is overflowed and introduced in summation have audio mixing when audio mixing voice way reaches some after.Therefore,, generally all adopt right of speech to switch and realize for audio mixing quantity is limited, very inconvenient for the user like this.A part of the present invention is exactly in order to solve this a series of problem, and concrete grammar is to suppress the quiet transmission of speech end and use more effective sound mixing method in mixer by efficient mute detection method, can accomplish at least 20 tunnel real-time sound mixing in the use.
General multimedia conference system is that unit carries out speech processes with the meeting room, each meeting room has only a mixer, this pattern can't satisfy international style exchange activity requirement, the international style exchange activity comprises meeting, commercial affairs exchange, product recommendations can wait, this conferencing environment requires multilingual information to issue simultaneously and allows sponsor to exchange with the country variant personnel, and some video meeting systems of existing market must be offered a plurality of meeting rooms at different language, could guarantee the multilingual audio frequency can be simultaneously by audio mixing be sent to different objects, obvious this mode be uneconomic with bring the not convenient of operation.
Summary of the invention
In order to improve audio mixing efficient and to solve the simultaneous interpretation problem, the invention provides a kind of more efficient mute detection method, sound mixing method and simultaneous interpretation method.Can realize higher silence detection rate, carry out multilingual synchronous mixed audio than the more audio mixing way of other sound mixing method with at same meeting room.
The objective of the invention is to be achieved through the following technical solutions:
System adopts the centralized processing framework, has defined two main systems: client terminal (Terminal), multipoint control unit (MCU).Client terminal comprises functional modules such as coding and decoding video, audio coding decoding, control unit, transport network layer, auxiliary office, and audio coding decoding adopts the mute detection method that proposes below, and whether detected before compressed audio needs to compress these frame voice.Multipoint control unit generally is installed on the server, and MCU has comprised multiple spot control module and multiple spot processing module, and multiple spot processing module formula is with the sound mixing method of adaptive weighting in short-term that proposes below.
Support the method for high capacity mixed sound to realize by following steps:
1, client terminal sound intermediate frequency coding/decoding module uses provided by the invention based on Mel yardstick cepstrum feature and the transmission of SVMs mute detection method with the minimizing voice data.Here adopt Mel yardstick cepstrum coefficient as phonetic feature, Mel yardstick cepstrum coefficient utilizes the auditory masking effect of people's ear, voice is divided into a series of critical band forms leg-of-mutton bank of filters on frequency domain, be i.e. Mel filter sequence.The process of silence detection is:
1) the Mel yardstick cepstrum coefficient of extraction one frame voice data, Mel yardstick cepstrum coefficient (CMFCC) computing formula is as follows:
c MFCC ( i ) = 2 L Σ l = 1 L log m ( l ) cos { ( l - 1 2 ) iπ L } - - - ( 1 )
Wherein:
m ( l ) = Σ k = o ( l ) h ( l ) W l ( k ) | X n ( k ) | , l = 1,2 , · · · , L - - - ( 2 )
W i ( k ) = k - o ( l ) c ( l ) - o ( l ) o ( l ) ≤ k ≤ c ( l ) h ( l ) - k h ( l ) - c ( l ) c ( l ) ≤ k ≤ h ( l ) - - - ( 3 )
In the formula, o (l), c (l) and h (l) are respectively lower limit, center and the upper limiting frequency of 1 triangle filter.
2) with two category support vector machines the Mel yardstick cepstrum coefficient of audio frequency is differentiated, obtained normal voice and quiet two class results.Certainly also can use other grader, the present invention is unrestricted to this.
2, adaptive weighting sound mixing method in short-term in the multipoint control unit
Definition audio mixing weight w[j], at first calculate the averaged amplitude value of every road sound in k Frame:
Avg [ j ] = 1 kl Σ i = 0 kl - 1 | data [ j , i ] | - - - ( 4 )
Data[j in the following formula, i] i sample value of expression j road voice, alphabetical 1 represents the sample number of sound in the Frame.Then according to Avg[j] calculate the weight w[j that j road voice should occupy]:
w [ j ] = Avg [ j ] / Σ p = 0 n - 1 Avg [ j ] - - - ( 5 )
Then according to w[j] sound is mixed:
MixData [ i ] = Σ j = 0 n - 1 data [ j , i ] * w [ j ] - - - ( 6 )
The performing step of simultaneous interpretation method is: define new voice data packet header form, make tool can show languages.When MCU sets up meeting room, be that a meeting room is created n languages mixer.Show speech languages classification when speech side begins, reciever shows accepts the languages classification, perhaps to making a speech, accept languages setting.Judge when MCU receives audio frequency that this road audio frequency belongs to which meeting room, languages, and send into corresponding mixer.MCU transmits data behind the audio mixing respectively according to the request of reciever then.
Description of drawings
Fig. 1 is a modular structure schematic diagram of the present invention;
Fig. 2 is a system flow chart of the present invention.
Embodiment
1, Figure 1 shows that the composition frame chart of system module,,, after the encoded device compression,, send by network according to the certain format packing from the video and audio signal that input equipment obtains sending client terminal; At multipoint control unit, the multiple spot control module provides controlled function to all meetings, and the multiple spot processing module provides the data forwarding service; Receiving client terminal, at first unpacked from output packet, the video of acquisition, audio compression data are sent into output equipment after decoding, and user data and control data have also obtained corresponding processing.System comprises each function:
Coding and decoding video: finish redundant compressed encoding, can pass through MPEG4, H.264 wait realization to video code flow.
Audio coding decoding: finish the silence detection and the encoding and decoding of voice signal, and selectively add buffer delay to guarantee the continuity of voice, can use g.723, g729 etc. at receiving terminal.
Control unit: provide end-to-end signaling, to guarantee the proper communication of terminal.Defined request, replied, signaling and four kinds of information of indication, communicate capability negotiation by various terminal rooms, the opening/closing logic channel sends operations such as order or indication, finishes control of communication.
Transport network layer:, receive data from network simultaneously with data formattings such as video, audio frequency, control and transmission.In addition, also be responsible for to handle some such as logic divide frame, add sequence number, function such as error detection.
Auxiliary office: be used for realizing concrete operations functions such as electronic whiteboard, text chat, file transmission.
Fig. 2 has described the flow of data stream of system of the present invention middle pitch, video.The feature of sound, video and sequence number etc. can be realized by Real-time Transport Protocol, adopt TCP or udp protocol during transmission.
2, support the method for high capacity mixed sound to implement to describe: in the silence detection, Mel yardstick cepstrum coefficient is L=12, the inner product function of SVMs is selected RBF for use, and the training method of SVMs can adopt the SMO method, and the present invention is also unrestricted to this.
The adaptive weighting sound mixing method can be designed the computation structure of highly-parallelization in short-term.Notice the averaged amplitude value Avg[j of each road audio frequency in the formula (4)] calculating be separate, so each road can be calculated Avg[j concurrently].Mix this step and arrived, the calculating on each road remains separate, therefore is fit to carry out parallel computation equally.Also available MMX, SSE, SSE2 instruction set are optimized program in the programming process.Actual test shows, this method audio mixing is respond well, does not produce new audio mixing noise, has kept the details of former each road sound under the principle of volume justice preferably.
3, simultaneous interpretation technology is when concrete the use, each client can freely be selected the languages listened to from a plurality of different languages, for right to speak, need carry out authority setting, client for general identity, the languages of its speech can only be used a kind of languages of acquiescence, and having only identity is that the languages that translation or senior client can select to make a speech are other languages.Each client is all being uploaded to MCU after the audio compression of this locality, the languages of making a speech and selecting according to the client by MCU, in different mixers, mix behind the decompress(ion) respectively, and then listen to selected languages according to the client its needed languages recompression transmission is gone down.For making a speech and listening to the client who is in same languages, MCU also needs earlier its sound to be cut from the sound that mixes, and hears the sound of oneself to avoid this client.
Can effectively represent and distinguish the datagram languages type that sends or receive in order to make MCU, client, define new voice data packet header form, in data packet head, use many number of bits that languages are defined, use when general 3 bits just can satisfy 8 languages.

Claims (3)

1, a kind of simultaneous interpretation video conference system and method for supporting high capacity mixed sound is characterized in that it comprises:
(1) method of support high capacity mixed sound is by suppressing the quiet transmission of speech end and use adaptive weighting sound mixing method in short-term in the multipoint control unit mixer based on Mel yardstick cepstrum feature and SVMs mute detection method.
(2) same meeting room carries out multilingual synchronous mixed audio, has defined new voice data packet header form, and uses a plurality of audio mixing processes at a meeting room.
2, according to the simultaneous interpretation video conference system and the method for right 1 described support high capacity mixed sound, it is characterized in that: in the content (1), propose based on Mel yardstick cepstrum feature and SVMs mute detection method, adaptive weighting sound mixing method in short-term.
3, according to the simultaneous interpretation video conference system and the method for right 1 described support high capacity mixed sound, it is characterized in that: in the content (2), defined new voice data packet header form, and used a plurality of audio mixing processes at a meeting room.
CN200610040060.1A 2006-04-30 2006-04-30 Simultaneous interpretation video conference system and method for supporting high capacity mixed sound Pending CN1845573A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200610040060.1A CN1845573A (en) 2006-04-30 2006-04-30 Simultaneous interpretation video conference system and method for supporting high capacity mixed sound

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200610040060.1A CN1845573A (en) 2006-04-30 2006-04-30 Simultaneous interpretation video conference system and method for supporting high capacity mixed sound

Publications (1)

Publication Number Publication Date
CN1845573A true CN1845573A (en) 2006-10-11

Family

ID=37064483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200610040060.1A Pending CN1845573A (en) 2006-04-30 2006-04-30 Simultaneous interpretation video conference system and method for supporting high capacity mixed sound

Country Status (1)

Country Link
CN (1) CN1845573A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008040258A1 (en) * 2006-09-30 2008-04-10 Huawei Technologies Co., Ltd. System and method for realizing multi-language conference
CN103327014A (en) * 2013-06-06 2013-09-25 腾讯科技(深圳)有限公司 Voice processing method, device and system
CN105304079A (en) * 2015-09-14 2016-02-03 上海可言信息技术有限公司 Multi-party call multi-mode speech synthesis method and system
CN106060707A (en) * 2016-05-27 2016-10-26 北京小米移动软件有限公司 Reverberation processing method and device
CN107046523A (en) * 2016-11-22 2017-08-15 深圳大学 A kind of simultaneous interpretation method and client based on individual mobile terminal
CN113257256A (en) * 2021-07-14 2021-08-13 广州朗国电子科技股份有限公司 Voice processing method, conference all-in-one machine, system and storage medium

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9031849B2 (en) 2006-09-30 2015-05-12 Huawei Technologies Co., Ltd. System, method and multipoint control unit for providing multi-language conference
WO2008040258A1 (en) * 2006-09-30 2008-04-10 Huawei Technologies Co., Ltd. System and method for realizing multi-language conference
US9311920B2 (en) * 2013-06-06 2016-04-12 Tencent Technology (Shenzhen) Company Limited Voice processing method, apparatus, and system
CN103327014A (en) * 2013-06-06 2013-09-25 腾讯科技(深圳)有限公司 Voice processing method, device and system
WO2014194728A1 (en) * 2013-06-06 2014-12-11 Tencent Technology (Shenzhen) Company Limited Voice processing method, apparatus, and system
US20150112668A1 (en) * 2013-06-06 2015-04-23 Tencent Technology (Shenzhen) Company Limited Voice processing method, apparatus, and system
CN103327014B (en) * 2013-06-06 2015-08-19 腾讯科技(深圳)有限公司 A kind of method of speech processing, Apparatus and system
CN105304079A (en) * 2015-09-14 2016-02-03 上海可言信息技术有限公司 Multi-party call multi-mode speech synthesis method and system
CN105304079B (en) * 2015-09-14 2019-05-07 上海可言信息技术有限公司 A kind of multi-mode phoneme synthesizing method of multi-party call and system and server
CN106060707A (en) * 2016-05-27 2016-10-26 北京小米移动软件有限公司 Reverberation processing method and device
CN106060707B (en) * 2016-05-27 2021-05-04 北京小米移动软件有限公司 Reverberation processing method and device
CN107046523A (en) * 2016-11-22 2017-08-15 深圳大学 A kind of simultaneous interpretation method and client based on individual mobile terminal
CN113257256A (en) * 2021-07-14 2021-08-13 广州朗国电子科技股份有限公司 Voice processing method, conference all-in-one machine, system and storage medium

Similar Documents

Publication Publication Date Title
CN102226944B (en) Audio mixing method and equipment thereof
CN112104836A (en) Audio mixing method, system, storage medium and equipment for audio server
CN101502089B (en) Method for carrying out an audio conference, audio conference device, and method for switching between encoders
US9456273B2 (en) Audio mixing method, apparatus and system
US9462224B2 (en) Guiding a desired outcome for an electronically hosted conference
Hardman et al. Reliable audio for use over the Internet
CN105304079A (en) Multi-party call multi-mode speech synthesis method and system
CN1845573A (en) Simultaneous interpretation video conference system and method for supporting high capacity mixed sound
CN103988486B (en) The method of active channel is selected in the audio mixing of multiparty teleconferencing
CN102741831B (en) Scalable audio frequency in multidrop environment
CN101218813A (en) Spatialization arrangement for conference call
CN101179693A (en) Mixed audio processing method of session television system
CN101513030A (en) Voice mixing method, multipoint conference server using the method, and program
CN113140225A (en) Voice signal processing method and device, electronic equipment and storage medium
CN104167210A (en) Lightweight class multi-side conference sound mixing method and device
CN107580155B (en) Network telephone quality determination method, network telephone quality determination device, computer equipment and storage medium
CN102915736B (en) Mixed audio processing method and stereo process system
WO2023202250A1 (en) Audio transmission method and apparatus, terminal, storage medium and program product
CN115662437A (en) Voice transcription method under scene of simultaneous use of multiple microphones
CN101502043A (en) Method for carrying out a voice conference, and voice conference system
Baskaran et al. Audio mixer with automatic gain controller for software based multipoint control unit
CN115831132A (en) Audio encoding and decoding method, device, medium and electronic equipment
Sethi et al. A new weighted audio mixing algorithm for a multipoint processor in a VoIP conferencing system
CN101123572A (en) Packet loss hiding method
CN113936669A (en) Data transmission method, system, device, computer readable storage medium and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C57 Notification of unclear or unknown address
DD01 Delivery of document by public notice

Addressee: Xue Wei

Document name: Notification before expiration of term

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20061011