WO2012055291A1 - Method and system for transmitting audio data - Google Patents

Method and system for transmitting audio data Download PDF

Info

Publication number
WO2012055291A1
WO2012055291A1 PCT/CN2011/078938 CN2011078938W WO2012055291A1 WO 2012055291 A1 WO2012055291 A1 WO 2012055291A1 CN 2011078938 W CN2011078938 W CN 2011078938W WO 2012055291 A1 WO2012055291 A1 WO 2012055291A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
telepresence video
telepresence
video conference
audio
Prior art date
Application number
PCT/CN2011/078938
Other languages
French (fr)
Chinese (zh)
Inventor
李靖
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2012055291A1 publication Critical patent/WO2012055291A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/567Multimedia conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • Telepresence is a teleconferencing technology that combines video communication and communication experience in recent years. It has the characteristics of life-size, ultra-high definition, and low latency, and its focus is on real face-to-face.
  • the effect of communication, the implementation process involves network, communication, business environment, functional applications and other aspects, and finally presented to the conference participants is an integrated real communication experience combined with transactional applications.
  • the TelePresence System is a comprehensive product solution that helps companies build and communicate with customers faster and more efficiently in a globalized economy, improving responsiveness and productivity while avoiding delays and inconveniences caused by travel. Accelerate revenue growth and streamline costs.
  • the telepresence system can be applied to administrative conferences, telemedicine, distance education, collaborative office, telepresence, remote display and many other fields.
  • a telepresence system consists of three terminals, a camera, three microphones and three speakers, as well as three plasma displays and a dedicated conference table.
  • the old implementation method is that each terminal is connected to a microphone to input sound, and a speaker is used to output sound. They are called left, center, right three seats, the left microphone, the speaker and plasma display are connected to the left terminal, the middle microphone, the speaker and plasma display are connected to the middle terminal, the right microphone, the speaker and the plasma display. Receive the terminal on the right.
  • a primary object of the present invention is to provide an audio data transmission method and system to solve at least one of the above problems.
  • an audio data transmission method applied to a telepresence video conference system comprising: receiving audio data from a terminal in a telepresence video conference system, wherein the audio data is input to the terminal in stereo mode ; encode and decode audio data and send it to another telepresence video conferencing system.
  • a terminal is in the middle of a plurality of terminals of the telepresence video conference system, and before receiving the audio data, the method further includes the following steps:
  • the telepresence video conference system determines the difference between the volume of the left and right channels of the audio signal, and the left sound
  • the audio signal is transmitted to the left screen side of the opposite end;
  • the audio signal is Transfer to the right side of the opposite side of the output.
  • the mixing process includes the following steps: acquiring decoded audio data input from a stereo format in at least two telepresence video conferencing systems; converting the sampling rate of the acquired audio data to a predetermined frequency, and then performing Overlay processing.
  • the audio data input in the stereo format in the telepresence video conference system is acquired in the mixing process, the obtained audio data is from a predetermined number of telepresence video conferences in at least two telepresence video conference systems.
  • an audio data transmission system including: a telepresence video conference system, an MCU, wherein the telepresence video conference system includes a plurality of terminals, and one of the plurality of terminals is an audio transmission terminal. , configured to receive and output audio signals in stereo mode; the MCU is configured to receive audio data from an audio transmission terminal in the telepresence video conference system, encode and decode the audio data, and send the video data to another telepresence video conference. system.
  • the audio transmission terminal is located in the middle of the plurality of terminals of the telepresence video conference system, and is further configured to determine the difference between the volume of the left and right channels of the audio signal, and the difference between the volume of the left channel and the volume of the right channel.
  • the audio signal is transmitted to the left screen side of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second threshold, the audio signal is transmitted to the right side of the opposite end.
  • the MCU is further configured to perform sound mixing processing on audio data from at least two telepresence video conferencing systems when the telepresence video conferencing system is at least two.
  • the MCU is configured to obtain decoded audio data input from a stereo format in at least two telepresence video conferencing systems, and convert the sampling rate of the acquired audio data into a predetermined frequency, and then perform superposition processing.
  • the MCU is configured to acquire audio data from a predetermined number of telepresence video conferencing systems in at least two telepresence video conferencing systems, wherein the predetermined number of telepresence video conferencing systems are in at least two networks
  • the audio signal output in the true video conferencing system is the strongest.
  • FIG. 1 is a schematic structural diagram of a telepresence video system according to the prior art
  • FIG. 2 is a flowchart of an audio data transmission method according to an embodiment of the present invention
  • FIG. 3 is an audio data according to a preferred embodiment of the present invention.
  • FIG. 4 is a flowchart of an audio data transmission method according to an example of the present invention.
  • FIG. 5 is a flowchart of decoding of an audio data transmission method according to an example of the present invention
  • FIG. 6 is an audio data transmission according to an example of the present invention.
  • FIG. 7 is a flowchart of encoding of an audio data transmission method according to an example of the present invention
  • FIG. 8 is a flowchart of transmission of an audio data transmission method according to an example of the present invention
  • FIG. 9 is a flowchart according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a telepresence video system according to a preferred embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION
  • BEST MODE FOR CARRYING OUT THE INVENTION BEST MODE FOR CARRYING OUT THE INVENTION
  • the audio data transmission method includes the following processing (step S202 - step S204): Step S202, receiving audio data from a terminal in the telepresence video conference system, wherein the audio data Enter the terminal in stereo mode. Step S204, the audio data is encoded and decoded, and sent to another telepresence video conference system.
  • step S202 receiving audio data from a terminal in the telepresence video conference system, wherein the audio data Enter the terminal in stereo mode.
  • Step S204 the audio data is encoded and decoded, and sent to another telepresence video conference system.
  • FIG. 3 is a flow chart of an audio data transmission method in accordance with a preferred embodiment of the present invention. As shown in FIG.
  • the audio data transmission method includes the following processing (step S302 - step S310): Step S302, the foregoing telepresence video conference system needs to first determine the difference between the volume of the left and right channels of the audio signal, and when the volume of the left channel is When the difference of the right channel volume is greater than the first threshold, the audio signal is transmitted to the left screen side output of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second threshold, the audio signal is transmitted to the pair The right side of the end is output.
  • the sound recognition can be realized, because the terminal receiving the audio data is located in the middle of the telepresence video conference system, and the stereo audio data format includes the left and right channels, when located at the left end of the terminal.
  • the volume received by the left channel When the person on the side speaks, the volume received by the left channel must be greater than the volume received by the right channel. Similarly, when the person located on the right side of the terminal speaks, the volume received by the right channel must be greater than the left channel.
  • the volume reached, and when the person in the middle speaks, the volume received by the left and right channels is roughly equal, and accordingly, the relationship between the volume received by the left and right channels can be used to determine which position the person is speaking. .
  • the above-mentioned speakers in the middle position are always in the opposite middle position, so the volume received by the left and right channels must have a certain difference, and if the volume is simply received according to the left and right channels.
  • the relationship between size and size to determine which side of the person is speaking may cause misjudgment.
  • it is necessary to introduce two thresholds that is, when the difference between the volume of the left channel and the volume of the right channel is greater than the first threshold, it is determined that the person on the left side speaks, and the audio signal is transmitted to the left screen side of the opposite end.
  • the difference between the right channel volume and the left channel volume is greater than the second threshold, it is determined that the person on the right side speaks, and the audio signal is transmitted to the right side of the opposite end, so that In the first place, it is possible to more accurately determine exactly where the person is speaking.
  • the values of the first threshold and the second threshold described above need to be determined according to factors such as the relative position of the speaker and the microphone, the magnitude of the environmental noise, and the like, and different values are used in different cases. The above judgment is actually for the case of three speakers. When there are more speakers, the method of threshold determination is still applicable.
  • Step S304 receiving audio data from a terminal in the telepresence video conference system, wherein the audio data is input to the terminal in a stereo manner.
  • Step S306 performing decoding processing on the audio data.
  • Step S308, performing sound mixing processing on audio data from at least two telepresence video conference systems.
  • a telepresence video conferencing system participating in the conference needs to receive all the telepresence video conferencing systems participating in the conference except the local end (that is, the local telepresence video conferencing system).
  • the mixing process in the above step S308 may further include the following processing:
  • the acquired audio data is from a predetermined number of telepresence video conferencing systems in the at least two telepresence video conferencing systems, wherein the predetermined number of telepresence video conferencing systems are
  • the audio signals outputted in the above at least two telepresence video conferencing systems have the highest intensity.
  • the sound will be difficult to distinguish.
  • the more the number of telepresence video conferencing systems participating in the video conference the more the phenomenon The more serious it will be, the more ultimately it will not be clear.
  • step S310 the audio data after the mixing is encoded and sent to another telepresence video conference system.
  • 4 is a flow chart of an audio data transmission method according to an example of the present invention. The above preferred embodiment will be described in detail below with reference to FIG. As shown in FIG.
  • the audio data transmission method includes the following processing: Step S402: Receive a message at an upper end of the terminal, and obtain an audio format of the terminal from the upper end message. Step S404, determining whether it is a stereo format. Step S406: Acquire audio data from the network access board and store it in a temporary buffer, and determine whether a complete audio package is obtained according to the identifier of the packet header, and if a complete audio package has been acquired, Copy data from the buffer to the pre-decode data buffer. If the audio format of the terminal is in stereo format, the audio data is obtained from the network access board and stored in a temporary buffer. The first half of the audio data temporary buffer is stored in the left channel before decoding, and the second half is taken.
  • Step S408 if there is sufficient audio data in the data buffer before decoding, left channel decoding and right channel decoding are respectively performed, and the decoded data is respectively stored in the decoded data buffer of the left and right channels. Description will be made below with reference to FIG. 5. As shown in FIG. 5,
  • the foregoing decoding process may further include the following processing (step S502 - step S506): Step S502, determining whether there is enough decoded audio data in the data buffer before decoding in the terminal, and if so, executing step S504, Otherwise, step S506 is performed; Step S504, taking one frame of data from the pre-decoding buffer, selecting a decoder for decoding according to the audio format, and decoding the data into the decoded buffer; if the audio format of the terminal is a stereo format, decoding the data from the left and right channels respectively
  • the buffer takes data, performs left channel decoding and right channel decoding, and stores the decoded data in the decoded data buffer of the left and right channels.
  • Step S506 if the audio format of the terminal is not stereo, the data is taken from the data buffer before the left channel decoding, the left channel is decoded, and the decoded data is stored in the decoded data buffer of the left channel.
  • Step S506 if the data is not decoded once, write a frame of silence frame to the decoded data buffer.
  • the audio format of the terminal is not the stereo format, the data is taken from the left channel (or right channel) before decoding, and the left channel (or right channel) is decoded, and after decoding The data is stored in the decoded data buffer of the left channel (or right channel).
  • Step S410 taking data from the decoded buffer to perform mixing processing. Description will be made below with reference to FIG. 6. As shown in FIG. 6. As shown in FIG.
  • the process of the mixing process mainly includes the following processing (step S602 - step S608): Step S602, determining whether there is one frame of data in the decoded buffer, and if yes, executing step S604, otherwise, the process End; Step S604, taking one frame of data from the decoded buffer, converting the sampling rate of all the terminals in the conference to 32K, and then mixing.
  • the basic principle of mixing is: a. Each end can not listen to themselves. b. Select a predetermined number (for example, three) of the sounds of all the ends to mix (overlay).
  • Step S606 processing the converted data, and processing the data of the left and right channels according to the specific mixing to the left, the middle, and the right screen.
  • step S608 the mixing process is performed, and the data is stored in the buffer before encoding. If all the terminals have both mono and stereo, then the data of the left and right channels of the stereo should be processed as needed, and then mixed. After the mixing, the mono end gets mono data, and the stereo end gets two channels of data. The mixed data is stored in the data buffer before encoding.
  • the mono terminal gets mono data
  • the stereo terminal gets two channels of data.
  • the mixed data is stored in the data buffer before encoding.
  • the data is taken from the left channel (or right channel) pre-encoding data buffer, and the left channel (or right channel) is encoded, and the coded is encoded.
  • the data is stored in the encoded data buffer of the left channel (or right channel).
  • Step S412 if the data buffer before encoding has sufficient data, if the audio format of the terminal is a stereo format, the left channel encoding and the right channel encoding are respectively performed, and the encoded data is separately stored in the left and right channels. Encoded data buffer.
  • the foregoing encoding process mainly includes the following processing (step S702 - step S706): Step S702: determining whether there is one frame of data in the pre-encoding buffer; step S704: if the data buffer of the terminal before encoding is sufficient once
  • the encoded data (for example, one frame of data) is encoded according to an audio format selection encoder, and the encoded data is stored in a coded buffer; If the audio format of the terminal is a stereo format, data is taken from the data buffer before the left and right channel encoding, the left channel encoding and the right channel encoding are performed, and the encoded data is respectively stored in the encoded data of the left and right channels. Buffer.
  • Step S706 If the audio format of the terminal is not stereo, the data is taken from the data buffer before the left channel encoding, the left channel is encoded, and the encoded data is stored in the encoded data buffer of the left channel.
  • Step S706 If the data is not encoded once, write a frame of silence frame to the encoded data buffer.
  • Step S802 determining whether there is one frame of data in the buffer after encoding
  • step S804 if the encoded data buffer has one frame of data , the data is taken from the encoded data buffer for packaging, and the header and the end of the packet, as well as the audio format and the packet length are captured into the package. If the audio format of the terminal is stereo, the data of the left and right channels is grouped into one packet (the data of the left channel is placed in the first half, and the data of the right channel is placed in the second half). Then, the entire packet is sent to the network access board, and finally sent by the network access board to the terminal for playing.
  • the audio data transmission system includes: a telepresence video conference system 90, an MCU 92, and a telepresence video conference system is described below with reference to FIG.
  • the telepresence video conferencing system 90 includes: a plurality of terminals, the only one of which is an audio transmission terminal, configured to receive and output audio signals in a stereo manner.
  • the MCU 92 is configured to receive audio data from the audio transmission terminal in the telepresence video conference system 90, encode and decode the audio data, and send the audio data to another telepresence video conference system 94.
  • the telepresence video conference system 90 has the same structure as the telepresence video conference system 94.
  • FIG. 10 the above-mentioned telepresence video conference system uses only one terminal of a plurality of terminals as an audio transmission terminal, thereby eliminating the root cause of the accent problem and completely solving the problem.
  • the audio transmission terminal includes two left and right microphones and two left and right speakers.
  • the stereo format is used to receive and output sounds, which lays a good foundation for realizing the sound recognition effect.
  • the audio transmission terminal is located at an intermediate position (which may be a middle position or a positive intermediate position) of a plurality of terminals of the telepresence video conference system, and may also be configured to determine the volume of the left and right channels of the audio signal. Difference, when the difference between the left channel volume and the right channel volume is greater than the first threshold, the audio signal is transmitted to the left side of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second At the threshold, the audio signal is transmitted to the right-screen side of the opposite end. In this way, the sound recognition can be realized.
  • the MCU is further configured to perform mixing processing on the audio data from the at least two telepresence video conferencing systems when the telepresence video conferencing system is at least two.
  • the MCU's mixing process is for the simultaneous generation of audio data of multiple telepresence video conferencing systems participating in the video conference, ensuring that each telepresence video conferencing system participating in the video conference can receive each telepresence video conferencing system at the same time.
  • the audio data ensures the real-time and accuracy of the meeting.
  • the MCU may be further configured to obtain the decoded audio data input from the at least two telepresence video conferencing systems in a stereo format, and after converting the sampling rate of the obtained audio data into a predetermined frequency, perform superposition processing. Since the present invention adopts audio data in stereo format, the MCU needs to ensure that the audio data format of each conference end is in stereo format.
  • the MCU is configured to acquire audio data from a predetermined number of telepresence video conferencing systems in the at least two telepresence video conferencing systems, wherein a predetermined number of telepresence video conferencing systems are in the at least two networks.
  • the audio signal output in the true video conferencing system is the strongest.
  • a certain number of audio data with the highest volume is selected for mixing, and specific audio data may be selected according to specific conditions.
  • the problem of accent can be solved in the telepresence high-definition video conferencing system, the sound effect is well improved, and the effect of listening to the sound is actually achieved.
  • modules or steps of the present invention can be implemented by a general-purpose computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device, such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
  • the steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps are fabricated as a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.
  • the above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed are a method and a system for transmitting audio data. The method for transmitting audio data comprises: receiving audio data from a terminal in a telepresence video conference system, wherein the audio data is input to the terminal in a stereo manner; and performing encoding and decoding on the audio data, and sending the audio data to another telepresence video conference system. Through technical solutions of the present invention, the sound overlapping problem in the telepresence video conference system in the prior art is solved, and effects of improving the sound and improving the user experience are achieved.

Description

音频数据传输方法及系统 技术领域 本发明涉及通信领域, 具体而言, 涉及一种音频数据传输方法及系统。 背景技术 网真技术(Telepresence)是一种近几年出现的将视频通信与沟通体验融为一体的 远程会议技术, 具有真人大小、 超高清晰、 低延时的特点, 其注重的是真实面对面沟 通的效果, 实现过程涉及到网络、 通信、 会商环境、 功能应用等多个方面, 最终呈现 给会商参与者的是一种与事务应用相结合的一体化真实沟通体验。 网真系统是一个全面的产品解决方案, 该产品能够在全球化经济中, 帮助企业更 加快捷、 有效地建立与顾客的沟通, 提高响应速度和生产率, 同时避免差旅带来的延 误和不便, 加速收入增长和精简成本。 网真系统可以应用于行政会议, 远程医疗, 远程教育, 协同办公, 远程出席, 远 程展示等众多领域。 一个网真系统由三台终端, 一个摄像头、 三个麦克风和三个扬声器, 以及由三个 等离子显示屏和专用会议桌组成。老的实现方法是每一台终端接一个麦克风输入声音, 接一个扬声器输出声音。 分别称为左, 中, 右三席, 左边的麦克风, 扬声器和等离子 显示屏接到左边的终端, 中间的麦克风, 扬声器和等离子显示屏接到中间的终端, 右 边的麦克风, 扬声器和等离子显示屏接到右边的终端。 这样声音和图像就会从左边或 者右边或者中间同时出来, 达到听声辨位的效果, 如图 1所示。 但这种实现方法有一个致命的缺点, 就是当一个人说话的时候, 可能会从三个麦 克风同时输入, 进入三个终端, 经过混音处理后, 传到远端的网真会议室的左中右三 个终端, 从左中右三个扬声器输出。 由于三个终端的网络速度不一样的原因, 声音到 达的时间会不一样, 这样就会出现同样一句话, 分三次从三个终端的扬声器里出来, 就像重音一样, 严重影响收听效果, 同时很难做到听声辨位。 发明内容 本发明的主要目的在于提供一种音频数据传输方法及系统, 以至少解决上述问题 之一。 根据本发明的一个方面, 提供了一种应用于网真视频会议系统的音频数据传输方 法, 包括: 接收来自于网真视频会议系统中一个终端的音频数据, 其中, 音频数据以 立体声方式输入终端; 对音频数据进行编解码处理, 并发送至另一网真视频会议系统。 在上述方法中, 一个终端处于网真视频会议系统的多个终端的中间位置, 接收音 频数据前, 还包括以下步骤: 网真视频会议系统确定音频信号左右声道音量的差值, 当左声道音量与右声道音量的差值大于第一阈值时, 将音频信号传输至对端的左屏侧 输出; 当右声道音量与左声道音量的差值大于第二阈值时, 将音频信号传输至对端的 右屏侧输出。 在上述方法中, 当网真视频会议系统为至少两个时, 在解码处理后编码处理前还 包括以下步骤: 对来自于至少两个网真视频会议系统的音频数据进行混音处理。 在上述方法中, 混音处理包括以下步骤: 获取解码后的来自于至少两个网真视频 会议系统中以立体声格式输入的音频数据; 将获取的音频数据的采样率转换为预定频 率之后, 进行叠加处理。 在上述方法中, 在混音处理中, 获取网真视频会议系统中以立体声格式输入的音 频数据时, 获取的音频数据来自于至少两个网真视频会议系统中预定个数的网真视频 会议系统, 其中, 预定个数的网真视频会议系统在至少两个网真视频会议系统中输出 的音频信号强度最大。 根据本发明的另一个方面, 提供了一种音频数据传输系统, 包括: 网真视频会议 系统、 MCU, 其中, 网真视频会议系统包括多个终端, 多个终端中的一个终端为音频 传输终端, 设置为以立体声方式接收和输出音频信号; MCU, 设置为接收来自于网真 视频会议系统中音频传输终端的音频数据, 对该音频数据进行编解码处理, 并发送至 另一网真视频会议系统。 在上述系统中, 音频传输终端, 位于网真视频会议系统的多个终端的中间位置, 还设置为确定音频信号左右声道音量的差值, 当左声道音量与右声道音量的差值大于 第一阈值时, 将音频信号传输至对端的左屏侧输出; 当右声道音量与左声道音量的差 值大于第二阈值时, 将音频信号传输至对端的右屏侧输出。 在上述系统中, MCU还设置为当网真视频会议系统为至少两个时,对来自于至少 两个网真视频会议系统的音频数据进行混音处理。 在上述系统中, MCU, 设置为获取解码后的来自于至少两个网真视频会议系统中 以立体声格式输入的音频数据, 将获取的音频数据的采样率转换为预定频率之后, 进 行叠加处理。 在上述系统中, MCU, 设置为获取来自于至少两个网真视频会议系统中预定个数 的网真视频会议系统的音频数据, 其中, 预定个数的网真视频会议系统在至少两个网 真视频会议系统中输出的音频信号强度最大。 通过本发明, 采用只使用网真视频会议系统中的一个终端传输音频信号的方案, 解决了重音的问题, 进而达到了改善声音效果, 提高用户体验的效果。 附图说明 此处所说明的附图用来提供对本发明的进一步理解, 构成本申请的一部分, 本发 明的示意性实施例及其说明用于解释本发明, 并不构成对本发明的不当限定。 在附图 中: 图 1是根据现有技术的网真视频系统的结构示意图; 图 2是根据本发明实施例的音频数据传输方法的流程图; 图 3是根据本发明优选实施例的音频数据传输方法的流程图; 图 4是根据本发明实例的音频数据传输方法的流程图; 图 5是根据本发明实例的音频数据传输方法解码的流程图; 图 6是根据本发明实例的音频数据传输方法混音的流程图; 图 7是根据本发明实例的音频数据传输方法编码的流程图; 图 8是根据本发明实例的音频数据传输方法发送的流程图; 图 9是根据本发明实施例的音频数据传输系统的结构框图。 图 10是根据本发明优选实施例的网真视频系统的结构示意图。 具体实施方式 下文中将参考附图并结合实施例来详细说明本发明。 需要说明的是, 在不冲突的 情况下, 本申请中的实施例及实施例中的特征可以相互组合。 图 2是根据本发明实施例的音频数据传输方法的流程图。 如图 2所示, 根据本发 明实施例的音频数据传输方法包括以下处理 (步骤 S202-步骤 S204): 步骤 S202, 接收来自于网真视频会议系统中一个终端的音频数据, 其中, 该音频 数据以立体声方式输入该终端。 步骤 S204, 对该音频数据进行编解码处理, 并发送至另一网真视频会议系统。 通过上述方法, 只使用网真视频会议系统的一个终端接收音频数据, 消除了重音 现象产生的根源。 同时, 立体声输入方式在网真视频会议系统中的应用, 为真正实现 听声辨位的效果打下了良好的基础。 图 3是根据本发明优选实施例的音频数据传输方法的流程图。 如图 3所示, 该音 频数据传输方法包括以下处理 (步骤 S302-步骤 S310): 步骤 S302, 上述网真视频会议系统需要先确定音频信号左右声道音量的差值, 当 左声道音量与右声道音量的差值大于第一阈值时, 将音频信号传输至对端的左屏侧输 出; 当右声道音量与左声道音量的差值大于第二阈值时, 将音频信号传输至对端的右 屏侧输出。 通过上述方法即可真正地实现听声辨位, 因为此时接收音频数据的终端位于网真 视频会议系统中间位置, 且使用立体声的音频数据格式即包括左右两个声道, 当位于 该终端左侧的人说话时, 左声道接收到的音量必然会大于右声道接收到的音量, 同理 当位于该终端右侧的人说话时,右声道接收到的音量必然会大于左声道接收到的音量, 而当位于中间的人说话时, 左右声道接收到的音量则大致相等, 据此就可以根据左右 声道接收到的音量之间的关系来判断究竟是哪个位置的人在说话。 但是, 在具体实施过程中, 上述的处于中间位置的说话人总是处于相对的中间位 置, 所以左右声道接收到音量必然有一定的差异, 这时如果单纯的根据左右声道接收 到的音量大小关系来判断是哪侧的人在说话可能会造成误判。 为了解决这个问题, 需 要引入两个阈值, 即当左声道音量与右声道音量的差值大于第一阈值时, 判定为左侧 的人说话, 将音频信号传输至对端的左屏侧输出, 当右声道音量与左声道音量的差值 大于第二阈值时, 判定为右侧的人说话, 将音频信号传输至对端的右屏侧输出, 这样 一来就可以更为准确的判定究竟是位于哪个位置的人在说话。 上述的第一阈值和第二 阈的值需要根据说话人与话筒的相对位置、 环境噪音的大小等因素确定, 在不同的情 况下使用不同的值。 上述判断实际上是针对三名说话人的情况, 当有更多的说话人时阈值判定的方法 其实依然是适用的, 只是需要引入更多的阈值从而设置出多个差值段, 例如, 当左声 道音量与右声道音量的差值大于第一阈值小于第三阈值时, 判定为左侧第一人说话, 当左声道音量与右声道音量的差值大于第三阈值小于第五阈值时, 判定为左侧第二人 说话, 同理右侧说话人的位置也可以这样判断。 根据上述分析, 可以进一步的推知, 即使上述接受音频数据的终端没有处在中间的位置, 也可以通过调整阈值的设置来达 到判定说话人位置的效果, 只是这种方式更为繁琐而不易实现, 且判断的可靠性也更 低。 步骤 S304, 接收来自于网真视频会议系统中一个终端的音频数据, 其中, 该音频 数据以立体声方式输入该终端。 步骤 S306, 对该音频数据进行解码处理。 步骤 S308, 对来自于至少两个网真视频会议系统的音频数据进行混音处理。 一般来说, 在进行网真视频会议的过程中, 参加会议的一个网真视频会议系统需 要同时接收到除了本端 (即本网真视频会议系统) 以外所有参加会议的网真视频会议 系统的声音, 因此当两个或两个以上的网真视频会议系统同时有人说话时, 就需要对 这些声音进行混音处理以保证各网真视频会议系统可以同时听到这些声音。 优选地, 上述步骤 S308中的混音处理可以进一步包括以下处理: The present invention relates to the field of communications, and in particular to an audio data transmission method and system. BACKGROUND OF THE INVENTION Telepresence is a teleconferencing technology that combines video communication and communication experience in recent years. It has the characteristics of life-size, ultra-high definition, and low latency, and its focus is on real face-to-face. The effect of communication, the implementation process involves network, communication, business environment, functional applications and other aspects, and finally presented to the conference participants is an integrated real communication experience combined with transactional applications. The TelePresence System is a comprehensive product solution that helps companies build and communicate with customers faster and more efficiently in a globalized economy, improving responsiveness and productivity while avoiding delays and inconveniences caused by travel. Accelerate revenue growth and streamline costs. The telepresence system can be applied to administrative conferences, telemedicine, distance education, collaborative office, telepresence, remote display and many other fields. A telepresence system consists of three terminals, a camera, three microphones and three speakers, as well as three plasma displays and a dedicated conference table. The old implementation method is that each terminal is connected to a microphone to input sound, and a speaker is used to output sound. They are called left, center, right three seats, the left microphone, the speaker and plasma display are connected to the left terminal, the middle microphone, the speaker and plasma display are connected to the middle terminal, the right microphone, the speaker and the plasma display. Receive the terminal on the right. In this way, the sound and image will come out from the left or the right or in the middle to achieve the effect of listening to the sound, as shown in Figure 1. However, this implementation method has a fatal disadvantage. When a person speaks, it may input from three microphones simultaneously, enter three terminals, and after mixing, pass to the left of the remote telepresence conference room. The middle right terminal is output from the left, center, and right speakers. Due to the different network speeds of the three terminals, the time of arrival of the sound will be different, so that the same sentence will appear in three speakers from the three terminals, just like accent, which seriously affects the listening effect. It is difficult to listen to the sound. SUMMARY OF THE INVENTION A primary object of the present invention is to provide an audio data transmission method and system to solve at least one of the above problems. According to an aspect of the present invention, an audio data transmission method applied to a telepresence video conference system is provided, comprising: receiving audio data from a terminal in a telepresence video conference system, wherein the audio data is input to the terminal in stereo mode ; encode and decode audio data and send it to another telepresence video conferencing system. In the above method, a terminal is in the middle of a plurality of terminals of the telepresence video conference system, and before receiving the audio data, the method further includes the following steps: The telepresence video conference system determines the difference between the volume of the left and right channels of the audio signal, and the left sound When the difference between the volume of the channel and the volume of the right channel is greater than the first threshold, the audio signal is transmitted to the left screen side of the opposite end; when the difference between the volume of the right channel and the volume of the left channel is greater than the second threshold, the audio signal is Transfer to the right side of the opposite side of the output. In the above method, when the telepresence video conference system is at least two, the following steps are further included before the decoding processing after the decoding process: mixing audio data from at least two telepresence video conference systems. In the above method, the mixing process includes the following steps: acquiring decoded audio data input from a stereo format in at least two telepresence video conferencing systems; converting the sampling rate of the acquired audio data to a predetermined frequency, and then performing Overlay processing. In the above method, when the audio data input in the stereo format in the telepresence video conference system is acquired in the mixing process, the obtained audio data is from a predetermined number of telepresence video conferences in at least two telepresence video conference systems. The system, wherein the predetermined number of telepresence video conferencing systems output the highest audio signal strength in at least two telepresence video conferencing systems. According to another aspect of the present invention, an audio data transmission system is provided, including: a telepresence video conference system, an MCU, wherein the telepresence video conference system includes a plurality of terminals, and one of the plurality of terminals is an audio transmission terminal. , configured to receive and output audio signals in stereo mode; the MCU is configured to receive audio data from an audio transmission terminal in the telepresence video conference system, encode and decode the audio data, and send the video data to another telepresence video conference. system. In the above system, the audio transmission terminal is located in the middle of the plurality of terminals of the telepresence video conference system, and is further configured to determine the difference between the volume of the left and right channels of the audio signal, and the difference between the volume of the left channel and the volume of the right channel. When the value is greater than the first threshold, the audio signal is transmitted to the left screen side of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second threshold, the audio signal is transmitted to the right side of the opposite end. In the above system, the MCU is further configured to perform sound mixing processing on audio data from at least two telepresence video conferencing systems when the telepresence video conferencing system is at least two. In the above system, the MCU is configured to obtain decoded audio data input from a stereo format in at least two telepresence video conferencing systems, and convert the sampling rate of the acquired audio data into a predetermined frequency, and then perform superposition processing. In the above system, the MCU is configured to acquire audio data from a predetermined number of telepresence video conferencing systems in at least two telepresence video conferencing systems, wherein the predetermined number of telepresence video conferencing systems are in at least two networks The audio signal output in the true video conferencing system is the strongest. Through the invention, the scheme of transmitting audio signals by using only one terminal in the telepresence video conference system is solved, and the problem of accent is solved, thereby achieving the effect of improving the sound effect and improving the user experience. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are set to illustrate,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, In the drawings: FIG. 1 is a schematic structural diagram of a telepresence video system according to the prior art; FIG. 2 is a flowchart of an audio data transmission method according to an embodiment of the present invention; FIG. 3 is an audio data according to a preferred embodiment of the present invention. FIG. 4 is a flowchart of an audio data transmission method according to an example of the present invention; FIG. 5 is a flowchart of decoding of an audio data transmission method according to an example of the present invention; FIG. 6 is an audio data transmission according to an example of the present invention. FIG. 7 is a flowchart of encoding of an audio data transmission method according to an example of the present invention; FIG. 8 is a flowchart of transmission of an audio data transmission method according to an example of the present invention; FIG. 9 is a flowchart according to an embodiment of the present invention. A block diagram of the audio data transmission system. FIG. 10 is a schematic structural diagram of a telepresence video system according to a preferred embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict. 2 is a flow chart of an audio data transmission method in accordance with an embodiment of the present invention. As shown in FIG. 2, the audio data transmission method according to the embodiment of the present invention includes the following processing (step S202 - step S204): Step S202, receiving audio data from a terminal in the telepresence video conference system, wherein the audio data Enter the terminal in stereo mode. Step S204, the audio data is encoded and decoded, and sent to another telepresence video conference system. Through the above method, only one terminal of the telepresence video conference system receives the audio data, thereby eliminating the root cause of the accent phenomenon. At the same time, the application of the stereo input mode in the telepresence video conferencing system lays a good foundation for realizing the effect of listening to the sound. 3 is a flow chart of an audio data transmission method in accordance with a preferred embodiment of the present invention. As shown in FIG. 3, the audio data transmission method includes the following processing (step S302 - step S310): Step S302, the foregoing telepresence video conference system needs to first determine the difference between the volume of the left and right channels of the audio signal, and when the volume of the left channel is When the difference of the right channel volume is greater than the first threshold, the audio signal is transmitted to the left screen side output of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second threshold, the audio signal is transmitted to the pair The right side of the end is output. Through the above method, the sound recognition can be realized, because the terminal receiving the audio data is located in the middle of the telepresence video conference system, and the stereo audio data format includes the left and right channels, when located at the left end of the terminal. When the person on the side speaks, the volume received by the left channel must be greater than the volume received by the right channel. Similarly, when the person located on the right side of the terminal speaks, the volume received by the right channel must be greater than the left channel. The volume reached, and when the person in the middle speaks, the volume received by the left and right channels is roughly equal, and accordingly, the relationship between the volume received by the left and right channels can be used to determine which position the person is speaking. . However, in the specific implementation process, the above-mentioned speakers in the middle position are always in the opposite middle position, so the volume received by the left and right channels must have a certain difference, and if the volume is simply received according to the left and right channels. The relationship between size and size to determine which side of the person is speaking may cause misjudgment. In order to solve this problem, it is necessary to introduce two thresholds, that is, when the difference between the volume of the left channel and the volume of the right channel is greater than the first threshold, it is determined that the person on the left side speaks, and the audio signal is transmitted to the left screen side of the opposite end. When the difference between the right channel volume and the left channel volume is greater than the second threshold, it is determined that the person on the right side speaks, and the audio signal is transmitted to the right side of the opposite end, so that In the first place, it is possible to more accurately determine exactly where the person is speaking. The values of the first threshold and the second threshold described above need to be determined according to factors such as the relative position of the speaker and the microphone, the magnitude of the environmental noise, and the like, and different values are used in different cases. The above judgment is actually for the case of three speakers. When there are more speakers, the method of threshold determination is still applicable. It is only necessary to introduce more thresholds to set a plurality of difference segments, for example, when When the difference between the left channel volume and the right channel volume is greater than the first threshold is less than the third threshold, it is determined that the first person on the left side speaks, and when the difference between the left channel volume and the right channel volume is greater than the third threshold is less than the third threshold When the threshold is five, it is determined that the second person on the left side speaks, and the position of the right speaker can also be judged in this way. According to the above analysis, it can be further inferred that even if the terminal receiving the audio data is not in the middle position, the effect of determining the position of the speaker can be achieved by adjusting the setting of the threshold, but this method is more cumbersome and difficult to implement. And the reliability of judgment is also lower. Step S304, receiving audio data from a terminal in the telepresence video conference system, wherein the audio data is input to the terminal in a stereo manner. Step S306, performing decoding processing on the audio data. Step S308, performing sound mixing processing on audio data from at least two telepresence video conference systems. Generally, in the process of performing a telepresence video conference, a telepresence video conferencing system participating in the conference needs to receive all the telepresence video conferencing systems participating in the conference except the local end (that is, the local telepresence video conferencing system). Sound, so when two or more telepresence video conferencing systems are talking at the same time, they need to be mixed to ensure that the telepresence video conferencing system can hear these sounds at the same time. Preferably, the mixing process in the above step S308 may further include the following processing:
( 1 )获取解码后的来自于至少两个网真视频会议系统中以立体声格式输入的音频 数据; (1) obtaining decoded audio data input from a stereo format in at least two telepresence video conferencing systems;
(2) 将获取的音频数据的采样率转换为预定频率之后, 进行叠加处理。 通过上述步骤即可完成混音处理, 同时解决了某些参加视频会议的端 (下面简称 为与会端) 未采用立体声格式的问题, 由于在本发明中采用的音频数据都为立体声格 式, 因此如果存在音频数据格式不是立体声格式与会端, 则需要先将该与会端的音频 数据转换为立体声格式再参与混音处理。 优选地, 在混音处理过程中, 获取的音频数据来自于上述至少两个网真视频会议 系统中预定个数的网真视频会议系统, 其中, 所述预定个数的网真视频会议系统在上 述至少两个网真视频会议系统中输出的音频信号强度最大。 当参与会议的与会端很多时, 若还是同时混音所有参加视频会议的网真视频会议 系统的声音就会造成声音难以分辨, 参加视频会议的网真视频会议系统的数量越多这 种现象就会越严重, 最终导致根本听不清楚。 因此, 为了避免此种情况, 当参加视频 会议的网真视频会议系统的数量较多时, 就需要根据音频信号强度最大的音频数据选 择预定个数的端进行混音, 而不是对全部的端进行混音, 上述个数可以根据具体的情 况进行选择。 步骤 S310,对混音后的音频数据进行编码处理,并发送至另一网真视频会议系统。 图 4是根据本发明实例的音频数据传输方法的流程图。 下面结合图 4对上述优选 实施例进行详细说明。 如图 4所示, 根据本发明实例的音频数据传输方法包括以下处理: 步骤 S402, 接收终端上端的消息, 从上端消息中获知终端的音频格式。 步骤 S404, 判断是否是立体声格式。 步骤 S406, 从网络接入板获取音频数据并存入一个临时缓冲区, 并根据包头包尾 的标识符判断是否获取了一个完整的音频包, 如果已经获取了一个完整的音频包, 就 从临时缓冲区中拷贝数据到解码前数据缓冲区。 如果终端的音频格式是立体声格式, 从网络接入板获取一次音频数据存入一个临时缓冲区, 从音频数据临时缓冲区中取前 半部分存入左声道解码前数据缓冲区, 取后半部分存入右声道解码前数据缓冲区。 在具体实施过程中, 如果终端的音频格式不是立体声格式, 就只从音频数据临时 缓冲区中取数据存入左声道 (或右声道) 解码前数据缓冲区。 步骤 S408, 如果解码前数据缓冲区有足够的音频数据, 就分别进行左声道解码和 右声道解码, 并把解码后的数据分别存入左右声道的解码后数据缓冲区。 以下结合图 5进行描述。 如图 5所示, 上述解码流程可以进一步包括以下处理 (步骤 S502-步骤 S506): 步骤 S502, 判断终端的解码前数据缓冲区里是否有足够一次解码的音频数据, 如 果有, 执行步骤 S504, 否则, 执行步骤 S506; 步骤 S504, 从解码前缓冲区里取一帧数据, 根据音频格式选择解码器进行解码, 解码数据存入解码后缓冲区; 如果终端的音频格式是立体声格式,就分别从左右声道解码前数据缓冲区取数据, 进行左声道解码和右声道解码, 并把解码后的数据分别存入左右声道的解码后数据缓 冲区。 如果终端的音频格式不是立体声格式, 就从左声道解码前数据缓冲区取数据, 进 行左声道解码, 并把解码后的数据存入左声道的解码后数据缓冲区。 步骤 S506, 如果数据不够一次解码, 就往解码后数据缓冲区写一帧静音帧。 在具体实施过程中, 如果终端的音频格式不是立体声格式, 就从左声道 (或右声 道) 解码前数据缓冲区取数据, 进行左声道 (或右声道) 解码, 并把解码后的数据存 入左声道 (或右声道) 的解码后数据缓冲区。 步骤 S410, 从解码后缓冲区取数据进行混音处理。 以下结合图 6进行描述。 如图 6所示, 该混音处理的过程主要包括以下处理 (步骤 S602-步骤 S608 ): 步骤 S602, 判断解码后的缓冲区中是否有一帧数据, 如果是, 则执行步骤 S604, 否则, 流程结束; 步骤 S604, 从解码后缓冲区中取一帧数据, 把会议中所有的端的采样率都转换成 32K, 然后进行混音。 其中, 混音的基本原理是: a. 各端自己不能听自己。 b. 在所有端中选预定个数 (例如, 三个) 的声音最大的端进行混音 (叠加)。 例如,会议中有 A,B,C,D四个端,那么 A听到的是 B+C+D, B听到的是 A+C+D, C 听到的是 A+B+ D, D听到的是 A+B+ C。 步骤 S606, 处理转换后的数据, 根据具体混音至左, 中, 右屏的不同, 分别对左、 右声道的数据进行加工。 步骤 S608, 进行混音处理, 并把数据存入编码前的缓冲区中。 如果所有的端里既有单声道, 又有立体声, 那么就还要根据需要先对立体声的左 右声道的数据进行处理, 然后进行混音。 混音完后, 单声道的端得到单声道的数据, 立体声的终端得到双声道的数据。 混音后的数据存入编码前的数据缓冲区。 如果是普通立体声混音, 那么就把所有终端的左声道和左声道进行混音, 所有的 右声道和右声道进行混音。 如果是立体声混音至左屏, 那么就要把立体声终端的右声 道置 0, 左声道 = (左声道 +右声道 )/2。 如果是立体声混音至右屏, 那么就要把立体声终 端的左声道置 0, 右声道 = (左声道 +右声道 )/2。 如果是立体声混音至中屏, 那么左声道 =0.707(左声道 +右声道 )/2, 右声道 =0.707(左声道 +右声道 )/2。 数据处理后, 对所有的 左声道和左声道进行混音, 右声道和右声道进行混音。 混音后的数据存入编码前的数 据缓冲区。 如果所有的端里既有单声道, 又有立体声, 那么就还要根据需要先对立体声的左 右声道的数据进行处理, 然后进行混音。 如果是单声道混音至左屏, 就先把这个端转 成立体声, 左声道 =单声道的数据, 右声道置 0, 然后左声道和左声道混音, 右声道和 右声道混音。如果是单声道混音至右屏, 就先把这个端转成立体声, 右声道 =单声道的 数据,左声道置 0, 然后左声道和左声道混音, 右声道和右声道混音。 如果是单声道混 音至中屏, 就先把这个端转成立体声, 右声道 =0.707*(单声道的数据;), 左声道 =0.707*(单声道的数据;),然后左声道和左声道混音, 右声道和右声道混音。 混音完后, 单声道的终端得到单声道的数据, 立体声的终端得到双声道的数据。 混音后的数据存 入编码前的数据缓冲区。 在具体实施过程中, 如果终端的音频格式不是立体声格式, 就从左声道 (或右声 道) 编码前数据缓冲区取数据, 进行左声道 (或右声道) 编码, 并把编码后的数据存 入左声道 (或右声道) 的编码后数据缓冲区。 步骤 S412, 如果编码前的数据缓冲区有足够的数据, 如果终端的音频格式是立体 声格式, 就分别进行左声道编码和右声道编码, 并把编码后的数据分别存入左右声道 的编码后数据缓冲区。 以下结合图 7进行描述。 如图 7所示, 上述编码流程主要包括以下处理 (步骤 S702-步骤 S706): 步骤 S702: 判断编码前缓冲区中是否有一帧数据; 步骤 S704: 如果终端的编码前的数据缓冲区有足够一次编码的数据 (例如, 一帧 数据), 就根据音频格式选择编码器进行编码, 编码数据存入编码后缓冲区; 如果终端的音频格式是立体声格式,就分别从左右声道编码前数据缓冲区取数据, 进行左声道编码和右声道编码, 并把编码后的数据分别存入左右声道的编码后数据缓 冲区。 如果终端的音频格式不是立体声格式, 就从左声道编码前数据缓冲区取数据, 进 行左声道编码, 并把编码后的数据存入左声道的编码后数据缓冲区。 步骤 S706: 如果数据不够一次编码, 就往编码后数据缓冲区写一帧静音帧。 步骤 S414, 如果编码后数据缓冲区有足够的数据, 就把左右声道的数据打成一个 包(左声道的数据放在前半部分, 右声道的数据放在后半部分), 然后把整个包发给网 络接入板, 最终由网络接入板发送给终端进行播放。 以下结合图 8进行描述。 如图 8所示, 上述打包及发送的过程主要包括以下处理 (步骤 S802-步骤 S804): 步骤 S802: 判断编码后缓冲区中是否有一帧数据; 步骤 S804: 如果编码后数据缓冲区有一帧数据, 就从编码后数据缓冲区中取数据 进行打包, 同时把包头和包尾以及音频格式以及包长打到包里。 如果终端的音频格式是立体声格式, 就把左右声道的数据打成一个包 (左声道的 数据放在前半部分, 右声道的数据放在后半部分)。然后把整个包发给网络接入板, 最 终由网络接入板发给终端进行播放。 如果终端的音频格式不是立体声格式, 就只把左声道 (或右声道) 的数据打成一 个包。 然后把整个包发给网络接入板, 最终由网络接入板发给终端进行播放。 图 9是根据本发明实施例的音频数据传输系统的结构框图。 如图 9所示, 该音频 数据传输系统包括: 网真视频会议系统 90、 MCU 92, 以下结合图 10对网真视频会议 系统进行描述。 如图 10所示, 网真视频会议系统 90包括: 多个终端, 这多个终端中的唯一一个 终端为音频传输终端, 设置为以立体声方式接收和输出音频信号。 (2) After converting the sampling rate of the acquired audio data to a predetermined frequency, superimposition processing is performed. Through the above steps, the mixing process can be completed, and at the same time, the problem that some of the parties participating in the video conference (hereinafter simply referred to as the conference end) does not adopt the stereo format is solved, since the audio data used in the present invention is in the stereo format, If the audio data format is not the stereo format and the conference end, the audio data of the conference end must be converted into a stereo format and then participate in the mixing process. Preferably, in the mixing process, the acquired audio data is from a predetermined number of telepresence video conferencing systems in the at least two telepresence video conferencing systems, wherein the predetermined number of telepresence video conferencing systems are The audio signals outputted in the above at least two telepresence video conferencing systems have the highest intensity. When there are a lot of participants in the conference, if you still mix the sounds of all the telepresence video conferencing systems participating in the video conference, the sound will be difficult to distinguish. The more the number of telepresence video conferencing systems participating in the video conference, the more the phenomenon The more serious it will be, the more ultimately it will not be clear. Therefore, in order to avoid this situation, when the number of telepresence video conferencing systems participating in the video conference is large, it is necessary to select a predetermined number of ends for mixing according to the audio data with the highest audio signal strength, instead of performing all the ends. Mixing, the above number can be selected according to the specific situation. In step S310, the audio data after the mixing is encoded and sent to another telepresence video conference system. 4 is a flow chart of an audio data transmission method according to an example of the present invention. The above preferred embodiment will be described in detail below with reference to FIG. As shown in FIG. 4, the audio data transmission method according to an example of the present invention includes the following processing: Step S402: Receive a message at an upper end of the terminal, and obtain an audio format of the terminal from the upper end message. Step S404, determining whether it is a stereo format. Step S406: Acquire audio data from the network access board and store it in a temporary buffer, and determine whether a complete audio package is obtained according to the identifier of the packet header, and if a complete audio package has been acquired, Copy data from the buffer to the pre-decode data buffer. If the audio format of the terminal is in stereo format, the audio data is obtained from the network access board and stored in a temporary buffer. The first half of the audio data temporary buffer is stored in the left channel before decoding, and the second half is taken. Store the data buffer before the right channel is decoded. In the specific implementation process, if the audio format of the terminal is not the stereo format, only the data from the audio data temporary buffer is stored in the left channel (or right channel) pre-decoding data buffer. Step S408, if there is sufficient audio data in the data buffer before decoding, left channel decoding and right channel decoding are respectively performed, and the decoded data is respectively stored in the decoded data buffer of the left and right channels. Description will be made below with reference to FIG. 5. As shown in FIG. 5, the foregoing decoding process may further include the following processing (step S502 - step S506): Step S502, determining whether there is enough decoded audio data in the data buffer before decoding in the terminal, and if so, executing step S504, Otherwise, step S506 is performed; Step S504, taking one frame of data from the pre-decoding buffer, selecting a decoder for decoding according to the audio format, and decoding the data into the decoded buffer; if the audio format of the terminal is a stereo format, decoding the data from the left and right channels respectively The buffer takes data, performs left channel decoding and right channel decoding, and stores the decoded data in the decoded data buffer of the left and right channels. If the audio format of the terminal is not stereo, the data is taken from the data buffer before the left channel decoding, the left channel is decoded, and the decoded data is stored in the decoded data buffer of the left channel. Step S506, if the data is not decoded once, write a frame of silence frame to the decoded data buffer. In the specific implementation process, if the audio format of the terminal is not the stereo format, the data is taken from the left channel (or right channel) before decoding, and the left channel (or right channel) is decoded, and after decoding The data is stored in the decoded data buffer of the left channel (or right channel). Step S410, taking data from the decoded buffer to perform mixing processing. Description will be made below with reference to FIG. 6. As shown in FIG. 6, the process of the mixing process mainly includes the following processing (step S602 - step S608): Step S602, determining whether there is one frame of data in the decoded buffer, and if yes, executing step S604, otherwise, the process End; Step S604, taking one frame of data from the decoded buffer, converting the sampling rate of all the terminals in the conference to 32K, and then mixing. Among them, the basic principle of mixing is: a. Each end can not listen to themselves. b. Select a predetermined number (for example, three) of the sounds of all the ends to mix (overlay). For example, there are four ends A, B, C, and D in the conference, then A hears B+C+D, B hears A+C+D, and C hears A+B+ D, D. I heard A+B+ C. Step S606, processing the converted data, and processing the data of the left and right channels according to the specific mixing to the left, the middle, and the right screen. In step S608, the mixing process is performed, and the data is stored in the buffer before encoding. If all the terminals have both mono and stereo, then the data of the left and right channels of the stereo should be processed as needed, and then mixed. After the mixing, the mono end gets mono data, and the stereo end gets two channels of data. The mixed data is stored in the data buffer before encoding. In the case of a normal stereo mix, the left and left channels of all the terminals are mixed, and all the right and right channels are mixed. If it is a stereo mix to the left, then set the right channel of the stereo terminal to 0, left channel = (left channel + right channel)/2. If the stereo mix is to the right, then the left channel of the stereo terminal is set to 0, and the right channel = (left channel + right channel)/2. If it is a stereo mix to the mid screen, then the left channel = 0.707 (left channel + right channel) / 2, right channel = 0.707 (left channel + right channel) / 2. After data processing, all left and left channels are mixed, and the right and right channels are mixed. The mixed data is stored in the data buffer before encoding. If all the terminals have both mono and stereo, then the data of the left and right channels of the stereo should be processed as needed, and then mixed. If it is a mono mix to the left screen, first turn this end into stereo, left channel = mono data, right channel set to 0, then left channel and left channel mix, right channel Mix with the right channel. If it is a mono mix to the right screen, first turn this end into stereo, right channel = mono data, left channel set to 0, then left channel and left channel mix, right channel Mix with the right channel. If it is a mono mix to the middle screen, first convert this end to stereo, right channel=0.707* (mono data;), left channel=0.707* (mono data;), Then the left and left channels are mixed, and the right and right channels are mixed. After the mixing, the mono terminal gets mono data, and the stereo terminal gets two channels of data. The mixed data is stored in the data buffer before encoding. In the specific implementation process, if the audio format of the terminal is not the stereo format, the data is taken from the left channel (or right channel) pre-encoding data buffer, and the left channel (or right channel) is encoded, and the coded is encoded. The data is stored in the encoded data buffer of the left channel (or right channel). Step S412, if the data buffer before encoding has sufficient data, if the audio format of the terminal is a stereo format, the left channel encoding and the right channel encoding are respectively performed, and the encoded data is separately stored in the left and right channels. Encoded data buffer. Description will be made below with reference to FIG. As shown in FIG. 7, the foregoing encoding process mainly includes the following processing (step S702 - step S706): Step S702: determining whether there is one frame of data in the pre-encoding buffer; step S704: if the data buffer of the terminal before encoding is sufficient once The encoded data (for example, one frame of data) is encoded according to an audio format selection encoder, and the encoded data is stored in a coded buffer; If the audio format of the terminal is a stereo format, data is taken from the data buffer before the left and right channel encoding, the left channel encoding and the right channel encoding are performed, and the encoded data is respectively stored in the encoded data of the left and right channels. Buffer. If the audio format of the terminal is not stereo, the data is taken from the data buffer before the left channel encoding, the left channel is encoded, and the encoded data is stored in the encoded data buffer of the left channel. Step S706: If the data is not encoded once, write a frame of silence frame to the encoded data buffer. Step S414, if there is enough data in the data buffer after encoding, the data of the left and right channels is marked into a packet (the data of the left channel is placed in the first half, the data of the right channel is placed in the second half), and then The entire packet is sent to the network access board, and finally sent by the network access board to the terminal for playback. This will be described below in conjunction with FIG. As shown in FIG. 8, the foregoing process of packetizing and transmitting mainly includes the following processing (step S802-step S804): Step S802: determining whether there is one frame of data in the buffer after encoding; step S804: if the encoded data buffer has one frame of data , the data is taken from the encoded data buffer for packaging, and the header and the end of the packet, as well as the audio format and the packet length are captured into the package. If the audio format of the terminal is stereo, the data of the left and right channels is grouped into one packet (the data of the left channel is placed in the first half, and the data of the right channel is placed in the second half). Then, the entire packet is sent to the network access board, and finally sent by the network access board to the terminal for playing. If the audio format of the terminal is not in stereo format, only the data of the left channel (or right channel) is labeled as a packet. Then, the entire packet is sent to the network access board, and finally sent by the network access board to the terminal for playing. 9 is a block diagram showing the structure of an audio data transmission system according to an embodiment of the present invention. As shown in FIG. 9, the audio data transmission system includes: a telepresence video conference system 90, an MCU 92, and a telepresence video conference system is described below with reference to FIG. As shown in FIG. 10, the telepresence video conferencing system 90 includes: a plurality of terminals, the only one of which is an audio transmission terminal, configured to receive and output audio signals in a stereo manner.
MCU 92, 设置为接收来自于上述网真视频会议系统 90中音频传输终端的音频数 据, 对该音频数据进行编解码处理, 并发送至另一网真视频会议系统 94。 其中, 网真视频会议系统 90与网真视频会议系统 94的结构相同, 具体可以参见 图 10。 从图 10中可以清楚的看出,上述网真视频会议系统只使用了多个终端中的一个终 端作为音频传输终端, 从而消除了重音问题产生的根源, 彻底解决了这个问题。 同时, 该音频传输终端包括左右两个麦克风及左右两个音箱, 实际上就是采用了立体声格式 接收和输出声音, 为真正实现听声辨位效果打下了良好的基础。 优选地, 如图 10所示, 该音频传输终端位于网真视频会议系统的多个终端的中间 位置(可以是正中间位置或偏正中间位置),还可以设置为确定音频信号左右声道音量 的差值, 当左声道音量与右声道音量的差值大于第一阈值时, 将音频信号传输至对端 的左屏侧输出; 当右声道音量与左声道音量的差值大于第二阈值时, 将音频信号传输 至对端的右屏侧输出。 这样一来即可真正实现听声辨位, 理由在上述方法实施例中有详细描述, 此处不 再赘述。 优选地, MCU还设置为在网真视频会议系统为至少两个时, 对来自于至少两个 网真视频会议系统的音频数据进行混音处理。 The MCU 92 is configured to receive audio data from the audio transmission terminal in the telepresence video conference system 90, encode and decode the audio data, and send the audio data to another telepresence video conference system 94. The telepresence video conference system 90 has the same structure as the telepresence video conference system 94. For details, refer to FIG. As can be clearly seen from FIG. 10, the above-mentioned telepresence video conference system uses only one terminal of a plurality of terminals as an audio transmission terminal, thereby eliminating the root cause of the accent problem and completely solving the problem. At the same time, the audio transmission terminal includes two left and right microphones and two left and right speakers. In fact, the stereo format is used to receive and output sounds, which lays a good foundation for realizing the sound recognition effect. Preferably, as shown in FIG. 10, the audio transmission terminal is located at an intermediate position (which may be a middle position or a positive intermediate position) of a plurality of terminals of the telepresence video conference system, and may also be configured to determine the volume of the left and right channels of the audio signal. Difference, when the difference between the left channel volume and the right channel volume is greater than the first threshold, the audio signal is transmitted to the left side of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second At the threshold, the audio signal is transmitted to the right-screen side of the opposite end. In this way, the sound recognition can be realized. The reason is described in detail in the above method embodiment, and details are not described herein again. Preferably, the MCU is further configured to perform mixing processing on the audio data from the at least two telepresence video conferencing systems when the telepresence video conferencing system is at least two.
MCU 的混音处理针对多个参加视频会议的网真视频会议系统同时产生的音频数 据的情况, 保证了参加视频会议的各个网真视频会议系统可以在同一时间接收到每个 网真视频会议系统的音频数据, 确保了会议的实时性、 准确性。 优选地, MCU, 还可以设置为获取解码后的来自于至少两个网真视频会议系统中 以立体声格式输入的音频数据, 将获取的音频数据的采样率转换为预定频率之后, 进 行叠加处理。 由于本发明采用立体声格式音频数据, 因此 MCU需要确保每个与会端的音频数 据格式都为立体声格式, 当某与会端的音频数据格式不是立体声格式时, 就要先将其 转换为立体声格式再参与混音处理, 以保证音频数据格式的一致性。 优选地, MCU, 设置为获取来自于上述至少两个网真视频会议系统中预定个数的 网真视频会议系统的音频数据, 其中, 预定个数的网真视频会议系统在上述至少两个 网真视频会议系统中输出的音频信号强度最大。 当参与视频会议的网真视频会议系统的数量较多时, 就需要限制参与混音处理的 音频数据的个数, 以保证声音清楚且便于识别。 本发明的做法时选取一定个数的音量 最大的音频数据进行混音, 具体选取几个音频数据可以根据具体情况决定。 实际上, 选取的策略有多种多样, 还可以根据各与会端的优先级、 信号传输质量等不同因素制 定不同的选取策略, 为不同环境下网真视频会议的应用制定不同的选取策略。 综上所述, 借助本发明提供的上述实施例, 可以在网真高清视频会议系统中解决 重音的问题, 使声音效果得到很好的改善, 并真正达到听声辨位的效果。 显然, 本领域的技术人员应该明白, 上述的本发明的各模块或各步骤可以用通用 的计算装置来实现, 它们可以集中在单个的计算装置上, 或者分布在多个计算装置所 组成的网络上, 可选地, 它们可以用计算装置可执行的程序代码来实现, 从而, 可以 将它们存储在存储装置中由计算装置来执行, 并且在某些情况下, 可以以不同于此处 的顺序执行所示出或描述的步骤, 或者将它们分别制作成各个集成电路模块, 或者将 它们中的多个模块或步骤制作成单个集成电路模块来实现。 这样, 本发明不限制于任 何特定的硬件和软件结合。 以上所述仅为本发明的优选实施例而已, 并不用于限制本发明, 对于本领域的技 术人员来说, 本发明可以有各种更改和变化。 凡在本发明的精神和原则之内, 所作的 任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。 The MCU's mixing process is for the simultaneous generation of audio data of multiple telepresence video conferencing systems participating in the video conference, ensuring that each telepresence video conferencing system participating in the video conference can receive each telepresence video conferencing system at the same time. The audio data ensures the real-time and accuracy of the meeting. Preferably, the MCU may be further configured to obtain the decoded audio data input from the at least two telepresence video conferencing systems in a stereo format, and after converting the sampling rate of the obtained audio data into a predetermined frequency, perform superposition processing. Since the present invention adopts audio data in stereo format, the MCU needs to ensure that the audio data format of each conference end is in stereo format. When the audio data format of a conference end is not in stereo format, it is first converted into a stereo format and then participates in the sound mixing. Processing to ensure the consistency of the audio data format. Preferably, the MCU is configured to acquire audio data from a predetermined number of telepresence video conferencing systems in the at least two telepresence video conferencing systems, wherein a predetermined number of telepresence video conferencing systems are in the at least two networks. The audio signal output in the true video conferencing system is the strongest. When the number of telepresence video conferencing systems participating in video conferencing is large, it is necessary to limit the number of audio data participating in the mixing process to ensure that the sound is clear and easy to recognize. In the practice of the present invention, a certain number of audio data with the highest volume is selected for mixing, and specific audio data may be selected according to specific conditions. Actually, There are various strategies to be selected. Different selection strategies can be formulated according to the different priorities of the conferences and the signal transmission quality, and different selection strategies can be formulated for the application of telepresence video conferences in different environments. In summary, with the above embodiments provided by the present invention, the problem of accent can be solved in the telepresence high-definition video conferencing system, the sound effect is well improved, and the effect of listening to the sound is actually achieved. Obviously, those skilled in the art should understand that the above modules or steps of the present invention can be implemented by a general-purpose computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device, such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein. The steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps are fabricated as a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software. The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims

权 利 要 求 书 Claim
1. 一种音频数据传输方法, 应用于网真视频会议系统, 包括以下步骤: An audio data transmission method, applied to a telepresence video conference system, comprising the following steps:
接收来自于所述网真视频会议系统中一个终端的音频数据, 其中, 所述音 频数据以立体声方式输入所述终端;  Receiving audio data from a terminal in the telepresence video conferencing system, wherein the audio data is input to the terminal in a stereo manner;
对所述音频数据进行编解码处理, 并发送至另一所述网真视频会议系统。  The audio data is encoded and decoded, and sent to another of the telepresence video conferencing systems.
2. 根据权利要求 1所述的方法, 其中, 所述一个终端处于所述网真视频会议系统 的多个终端的中间位置, 接收所述音频数据前, 还包括以下步骤: The method according to claim 1, wherein the one terminal is in an intermediate position of a plurality of terminals of the telepresence video conference system, and before receiving the audio data, the method further includes the following steps:
所述网真视频会议系统确定所述音频信号左右声道音量的差值, 当左声道 音量与右声道音量的差值大于第一阈值时, 将所述音频信号传输至对端的左屏 侧输出; 当右声道音量与左声道音量的差值大于第二阈值时, 将所述音频信号 传输至对端的右屏侧输出。  The telepresence video conference system determines a difference between left and right channel sounds of the audio signal, and when the difference between the left channel volume and the right channel volume is greater than a first threshold, transmitting the audio signal to a left screen of the opposite end Side output; when the difference between the right channel volume and the left channel volume is greater than the second threshold, the audio signal is transmitted to the right side of the opposite end.
3. 根据权利要求 1所述的方法, 其中, 当所述网真视频会议系统为至少两个时, 在解码处理后编码处理前还包括以下步骤: 3. The method according to claim 1, wherein, when the telepresence video conferencing system is at least two, the following steps are further included before the encoding processing after the decoding processing:
对来自于所述至少两个网真视频会议系统的音频数据进行混音处理。  Mixing audio data from the at least two telepresence video conferencing systems.
4. 根据权利要求 3所述的方法, 其中, 所述混音处理包括以下步骤: 获取解码后的来自于所述至少两个网真视频会议系统中以立体声格式输入 的音频数据; 4. The method according to claim 3, wherein the mixing process comprises the steps of: obtaining decoded audio data input in a stereo format from the at least two telepresence video conferencing systems;
将所述获取的音频数据的采样率转换为预定频率之后, 进行叠加处理。  After converting the sampling rate of the acquired audio data to a predetermined frequency, superimposition processing is performed.
5. 根据权利要求 4所述的方法, 其中, 在混音处理中, 获取所述网真视频会议系 统中以立体声格式输入的音频数据时, 5. The method according to claim 4, wherein, in the mixing process, when the audio data input in the stereo format in the telepresence video conference system is acquired,
所述获取的音频数据来自于所述至少两个网真视频会议系统中预定个数的 网真视频会议系统, 其中, 所述预定个数的网真视频会议系统在所述至少两个 网真视频会议系统中输出的音频信号强度最大。  The obtained audio data is from a predetermined number of telepresence video conference systems in the at least two telepresence video conference systems, wherein the predetermined number of telepresence video conference systems are in the at least two telepresence systems The audio signal output in the video conferencing system is the strongest.
6. 一种音频数据传输系统, 包括: 网真视频会议系统以及 MCU; 6. An audio data transmission system, comprising: a telepresence video conference system and an MCU;
所述网真视频会议系统包括多个终端, 所述多个终端中的一个终端为音频 传输终端, 设置为以立体声方式接收和输出音频信号; 所述 MCU, 设置为接收来自于所述网真视频会议系统中所述音频传输终 端的所述音频数据, 对该音频数据进行编解码处理, 并发送至另一所述网真视 频会议系统。 根据权利要求 6所述的系统, 其中, 所述音频传输终端, 位于所述网真视频会 议系统的多个终端的中间位置, 还设置为确定所述音频信号左右声道音量的差 值, 当左声道音量与右声道音量的差值大于第一阈值时, 将所述音频信号传输 至对端的左屏侧输出; 当右声道音量与左声道音量的差值大于第二阈值时, 将 所述音频信号传输至对端的右屏侧输出。 根据权利要求 6所述的系统, 其中, 所述 MCU还设置为当所述网真视频会议 系统为至少两个时, 对来自于所述至少两个网真视频会议系统的音频数据进行 混音处理。 根据权利要求 8所述的系统, 其中, The telepresence video conference system includes a plurality of terminals, and one of the plurality of terminals is an audio transmission terminal, and is configured to receive and output audio signals in a stereo manner; The MCU is configured to receive the audio data from the audio transmission terminal in the telepresence video conference system, perform codec processing on the audio data, and send the audio data to another telepresence video conference system. The system according to claim 6, wherein the audio transmission terminal is located at an intermediate position of the plurality of terminals of the telepresence video conference system, and is further configured to determine a difference between the volume of the left and right channels of the audio signal. When the difference between the left channel volume and the right channel volume is greater than the first threshold, the audio signal is transmitted to the left screen side output of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second threshold And transmitting the audio signal to the right screen side output of the opposite end. The system according to claim 6, wherein the MCU is further configured to mix audio data from the at least two telepresence video conferencing systems when the telepresence video conferencing system is at least two deal with. The system according to claim 8, wherein
所述 MCU, 设置为获取解码后的来自于所述至少两个网真视频会议系统 中以立体声格式输入的音频数据, 将所述获取的音频数据的采样率转换为预定 频率之后, 进行叠加处理。 根据权利要求 9所述的系统, 其中, 所述 MCU, 设置为获取来自于所述至少两 个网真视频会议系统中预定个数的网真视频会议系统的音频数据, 其中, 所述 预定个数的网真视频会议系统在所述至少两个网真视频会议系统中输出的音频 信号强度最大。  The MCU is configured to obtain decoded audio data input in a stereo format from the at least two telepresence video conference systems, convert the sampling rate of the acquired audio data into a predetermined frequency, and perform superposition processing. . The system according to claim 9, wherein the MCU is configured to acquire audio data from a predetermined number of telepresence video conference systems in the at least two telepresence video conference systems, wherein the predetermined ones The number of telepresence video conferencing systems output the highest audio signal strength in the at least two telepresence video conferencing systems.
PCT/CN2011/078938 2010-10-26 2011-08-25 Method and system for transmitting audio data WO2012055291A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010519953.0 2010-10-26
CN201010519953.0A CN102457700B (en) 2010-10-26 2010-10-26 Audio data transmission method and system

Publications (1)

Publication Number Publication Date
WO2012055291A1 true WO2012055291A1 (en) 2012-05-03

Family

ID=45993150

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/078938 WO2012055291A1 (en) 2010-10-26 2011-08-25 Method and system for transmitting audio data

Country Status (2)

Country Link
CN (1) CN102457700B (en)
WO (1) WO2012055291A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087656A (en) * 2017-06-14 2018-12-25 广东亿迅科技有限公司 A kind of multimedia conferencing sound mixing method and device based on MCU

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103634697B (en) * 2012-08-24 2017-09-26 中兴通讯股份有限公司 Net the implementation method of true technology and net true equipment
CN105704423A (en) * 2014-11-24 2016-06-22 中兴通讯股份有限公司 Voice output method and device
CN105469803B (en) * 2015-11-23 2019-04-05 北京华夏电通科技有限公司 Echo cancel method and system in tripartite's telecommunication based on left and right acoustic channels coding
CN112689118B (en) * 2020-12-29 2023-12-08 厦门亿联网络技术股份有限公司 Data transmission method and device for multi-screen network terminal
CN113453114B (en) * 2021-06-30 2023-04-07 Oppo广东移动通信有限公司 Encoding control method, encoding control device, wireless headset and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1779780A (en) * 2005-10-21 2006-05-31 北京中星微电子有限公司 Audio mixing method based on portable terminal
CN1929593A (en) * 2005-09-07 2007-03-14 宝利通公司 Spatially correlated audio in multipoint videoconferencing
CN101427574A (en) * 2006-04-20 2009-05-06 思科技术公司 System and method for providing location specific sound in a telepresence system
CN101426058A (en) * 2008-06-17 2009-05-06 上海华平信息技术股份有限公司 System and method for improving quality of multichannel audio call

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070250567A1 (en) * 2006-04-20 2007-10-25 Graham Philip R System and method for controlling a telepresence system
US8289362B2 (en) * 2007-09-26 2012-10-16 Cisco Technology, Inc. Audio directionality control for a multi-display switched video conferencing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1929593A (en) * 2005-09-07 2007-03-14 宝利通公司 Spatially correlated audio in multipoint videoconferencing
CN1779780A (en) * 2005-10-21 2006-05-31 北京中星微电子有限公司 Audio mixing method based on portable terminal
CN101427574A (en) * 2006-04-20 2009-05-06 思科技术公司 System and method for providing location specific sound in a telepresence system
CN101426058A (en) * 2008-06-17 2009-05-06 上海华平信息技术股份有限公司 System and method for improving quality of multichannel audio call

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087656A (en) * 2017-06-14 2018-12-25 广东亿迅科技有限公司 A kind of multimedia conferencing sound mixing method and device based on MCU
CN109087656B (en) * 2017-06-14 2020-11-17 广东亿迅科技有限公司 Multimedia conference sound mixing method and device based on MCU

Also Published As

Publication number Publication date
CN102457700B (en) 2015-08-12
CN102457700A (en) 2012-05-16

Similar Documents

Publication Publication Date Title
EP2439945B1 (en) Audio panning in a multi-participant video conference
CN101627576B (en) Multipoint conference video switching
JP5534813B2 (en) System, method, and multipoint control apparatus for realizing multilingual conference
US6850496B1 (en) Virtual conference room for voice conferencing
US9113034B2 (en) Method and apparatus for processing audio in video communication
US8600530B2 (en) Method for determining an audio data spatial encoding mode
US8773491B2 (en) Method, apparatus, and system for implementing audio mixing
US9025002B2 (en) Method and apparatus for playing audio of attendant at remote end and remote video conference system
WO2012055291A1 (en) Method and system for transmitting audio data
WO2007082433A1 (en) Apparatus, network device and method for transmitting video-audio signal
US8385234B2 (en) Media stream setup in a group communication system
US20090019112A1 (en) Audio and video conferencing using multicasting
US8442198B2 (en) Distributed multi-party conferencing system
WO2014094461A1 (en) Method, device and system for processing video/audio information in video conference
CN110267064A (en) Audio broadcast state processing method, device, equipment and storage medium
WO2014154065A2 (en) Data transmission method, media acquisition device, video conference terminal and storage medium
TWI451746B (en) Video conference system and video conference method thereof
WO2016082577A1 (en) Video conference processing method and device
US20100268529A1 (en) Voice communication apparatus
US11800017B1 (en) Encoding a subset of audio input for broadcasting conferenced communications
Akoumianakis et al. The MusiNet project: Towards unraveling the full potential of Networked Music Performance systems
JP2001036881A (en) Voice transmission system and voice reproduction device
JP2010093479A (en) Conference system and conference terminal device
CN113079267B (en) Audio conferencing in a room
US11425258B2 (en) Audio conferencing in a room

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11835565

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11835565

Country of ref document: EP

Kind code of ref document: A1