WO2012055291A1

WO2012055291A1 - Method and system for transmitting audio data

Info

Publication number: WO2012055291A1
Application number: PCT/CN2011/078938
Authority: WO
Inventors: 李靖
Original assignee: 中兴通讯股份有限公司
Priority date: 2010-10-26
Filing date: 2011-08-25
Publication date: 2012-05-03
Also published as: CN102457700B; CN102457700A

Abstract

Disclosed are a method and a system for transmitting audio data. The method for transmitting audio data comprises: receiving audio data from a terminal in a telepresence video conference system, wherein the audio data is input to the terminal in a stereo manner; and performing encoding and decoding on the audio data, and sending the audio data to another telepresence video conference system. Through technical solutions of the present invention, the sound overlapping problem in the telepresence video conference system in the prior art is solved, and effects of improving the sound and improving the user experience are achieved.

Description

The present invention relates to the field of communications, and in particular to an audio data transmission method and system. BACKGROUND OF THE INVENTION Telepresence is a teleconferencing technology that combines video communication and communication experience in recent years. It has the characteristics of life-size, ultra-high definition, and low latency, and its focus is on real face-to-face. The effect of communication, the implementation process involves network, communication, business environment, functional applications and other aspects, and finally presented to the conference participants is an integrated real communication experience combined with transactional applications. The TelePresence System is a comprehensive product solution that helps companies build and communicate with customers faster and more efficiently in a globalized economy, improving responsiveness and productivity while avoiding delays and inconveniences caused by travel. Accelerate revenue growth and streamline costs. The telepresence system can be applied to administrative conferences, telemedicine, distance education, collaborative office, telepresence, remote display and many other fields. A telepresence system consists of three terminals, a camera, three microphones and three speakers, as well as three plasma displays and a dedicated conference table. The old implementation method is that each terminal is connected to a microphone to input sound, and a speaker is used to output sound. They are called left, center, right three seats, the left microphone, the speaker and plasma display are connected to the left terminal, the middle microphone, the speaker and plasma display are connected to the middle terminal, the right microphone, the speaker and the plasma display. Receive the terminal on the right. In this way, the sound and image will come out from the left or the right or in the middle to achieve the effect of listening to the sound, as shown in Figure 1. However, this implementation method has a fatal disadvantage. When a person speaks, it may input from three microphones simultaneously, enter three terminals, and after mixing, pass to the left of the remote telepresence conference room. The middle right terminal is output from the left, center, and right speakers. Due to the different network speeds of the three terminals, the time of arrival of the sound will be different, so that the same sentence will appear in three speakers from the three terminals, just like accent, which seriously affects the listening effect. It is difficult to listen to the sound. SUMMARY OF THE INVENTION A primary object of the present invention is to provide an audio data transmission method and system to solve at least one of the above problems. According to an aspect of the present invention, an audio data transmission method applied to a telepresence video conference system is provided, comprising: receiving audio data from a terminal in a telepresence video conference system, wherein the audio data is input to the terminal in stereo mode ; encode and decode audio data and send it to another telepresence video conferencing system. In the above method, a terminal is in the middle of a plurality of terminals of the telepresence video conference system, and before receiving the audio data, the method further includes the following steps: The telepresence video conference system determines the difference between the volume of the left and right channels of the audio signal, and the left sound When the difference between the volume of the channel and the volume of the right channel is greater than the first threshold, the audio signal is transmitted to the left screen side of the opposite end; when the difference between the volume of the right channel and the volume of the left channel is greater than the second threshold, the audio signal is Transfer to the right side of the opposite side of the output. In the above method, when the telepresence video conference system is at least two, the following steps are further included before the decoding processing after the decoding process: mixing audio data from at least two telepresence video conference systems. In the above method, the mixing process includes the following steps: acquiring decoded audio data input from a stereo format in at least two telepresence video conferencing systems; converting the sampling rate of the acquired audio data to a predetermined frequency, and then performing Overlay processing. In the above method, when the audio data input in the stereo format in the telepresence video conference system is acquired in the mixing process, the obtained audio data is from a predetermined number of telepresence video conferences in at least two telepresence video conference systems. The system, wherein the predetermined number of telepresence video conferencing systems output the highest audio signal strength in at least two telepresence video conferencing systems. According to another aspect of the present invention, an audio data transmission system is provided, including: a telepresence video conference system, an MCU, wherein the telepresence video conference system includes a plurality of terminals, and one of the plurality of terminals is an audio transmission terminal. , configured to receive and output audio signals in stereo mode; the MCU is configured to receive audio data from an audio transmission terminal in the telepresence video conference system, encode and decode the audio data, and send the video data to another telepresence video conference. system. In the above system, the audio transmission terminal is located in the middle of the plurality of terminals of the telepresence video conference system, and is further configured to determine the difference between the volume of the left and right channels of the audio signal, and the difference between the volume of the left channel and the volume of the right channel. When the value is greater than the first threshold, the audio signal is transmitted to the left screen side of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second threshold, the audio signal is transmitted to the right side of the opposite end. In the above system, the MCU is further configured to perform sound mixing processing on audio data from at least two telepresence video conferencing systems when the telepresence video conferencing system is at least two. In the above system, the MCU is configured to obtain decoded audio data input from a stereo format in at least two telepresence video conferencing systems, and convert the sampling rate of the acquired audio data into a predetermined frequency, and then perform superposition processing. In the above system, the MCU is configured to acquire audio data from a predetermined number of telepresence video conferencing systems in at least two telepresence video conferencing systems, wherein the predetermined number of telepresence video conferencing systems are in at least two networks The audio signal output in the true video conferencing system is the strongest. Through the invention, the scheme of transmitting audio signals by using only one terminal in the telepresence video conference system is solved, and the problem of accent is solved, thereby achieving the effect of improving the sound effect and improving the user experience. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are set to illustrate,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, In the drawings: FIG. 1 is a schematic structural diagram of a telepresence video system according to the prior art; FIG. 2 is a flowchart of an audio data transmission method according to an embodiment of the present invention; FIG. 3 is an audio data according to a preferred embodiment of the present invention. FIG. 4 is a flowchart of an audio data transmission method according to an example of the present invention; FIG. 5 is a flowchart of decoding of an audio data transmission method according to an example of the present invention; FIG. 6 is an audio data transmission according to an example of the present invention. FIG. 7 is a flowchart of encoding of an audio data transmission method according to an example of the present invention; FIG. 8 is a flowchart of transmission of an audio data transmission method according to an example of the present invention; FIG. 9 is a flowchart according to an embodiment of the present invention. A block diagram of the audio data transmission system. FIG. 10 is a schematic structural diagram of a telepresence video system according to a preferred embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict. 2 is a flow chart of an audio data transmission method in accordance with an embodiment of the present invention. As shown in FIG. 2, the audio data transmission method according to the embodiment of the present invention includes the following processing (step S202 - step S204): Step S202, receiving audio data from a terminal in the telepresence video conference system, wherein the audio data Enter the terminal in stereo mode. Step S204, the audio data is encoded and decoded, and sent to another telepresence video conference system. Through the above method, only one terminal of the telepresence video conference system receives the audio data, thereby eliminating the root cause of the accent phenomenon. At the same time, the application of the stereo input mode in the telepresence video conferencing system lays a good foundation for realizing the effect of listening to the sound. 3 is a flow chart of an audio data transmission method in accordance with a preferred embodiment of the present invention. As shown in FIG. 3, the audio data transmission method includes the following processing (step S302 - step S310): Step S302, the foregoing telepresence video conference system needs to first determine the difference between the volume of the left and right channels of the audio signal, and when the volume of the left channel is When the difference of the right channel volume is greater than the first threshold, the audio signal is transmitted to the left screen side output of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second threshold, the audio signal is transmitted to the pair The right side of the end is output. Through the above method, the sound recognition can be realized, because the terminal receiving the audio data is located in the middle of the telepresence video conference system, and the stereo audio data format includes the left and right channels, when located at the left end of the terminal. When the person on the side speaks, the volume received by the left channel must be greater than the volume received by the right channel. Similarly, when the person located on the right side of the terminal speaks, the volume received by the right channel must be greater than the left channel. The volume reached, and when the person in the middle speaks, the volume received by the left and right channels is roughly equal, and accordingly, the relationship between the volume received by the left and right channels can be used to determine which position the person is speaking. . However, in the specific implementation process, the above-mentioned speakers in the middle position are always in the opposite middle position, so the volume received by the left and right channels must have a certain difference, and if the volume is simply received according to the left and right channels. The relationship between size and size to determine which side of the person is speaking may cause misjudgment. In order to solve this problem, it is necessary to introduce two thresholds, that is, when the difference between the volume of the left channel and the volume of the right channel is greater than the first threshold, it is determined that the person on the left side speaks, and the audio signal is transmitted to the left screen side of the opposite end. When the difference between the right channel volume and the left channel volume is greater than the second threshold, it is determined that the person on the right side speaks, and the audio signal is transmitted to the right side of the opposite end, so that In the first place, it is possible to more accurately determine exactly where the person is speaking. The values of the first threshold and the second threshold described above need to be determined according to factors such as the relative position of the speaker and the microphone, the magnitude of the environmental noise, and the like, and different values are used in different cases. The above judgment is actually for the case of three speakers. When there are more speakers, the method of threshold determination is still applicable. It is only necessary to introduce more thresholds to set a plurality of difference segments, for example, when When the difference between the left channel volume and the right channel volume is greater than the first threshold is less than the third threshold, it is determined that the first person on the left side speaks, and when the difference between the left channel volume and the right channel volume is greater than the third threshold is less than the third threshold When the threshold is five, it is determined that the second person on the left side speaks, and the position of the right speaker can also be judged in this way. According to the above analysis, it can be further inferred that even if the terminal receiving the audio data is not in the middle position, the effect of determining the position of the speaker can be achieved by adjusting the setting of the threshold, but this method is more cumbersome and difficult to implement. And the reliability of judgment is also lower. Step S304, receiving audio data from a terminal in the telepresence video conference system, wherein the audio data is input to the terminal in a stereo manner. Step S306, performing decoding processing on the audio data. Step S308, performing sound mixing processing on audio data from at least two telepresence video conference systems. Generally, in the process of performing a telepresence video conference, a telepresence video conferencing system participating in the conference needs to receive all the telepresence video conferencing systems participating in the conference except the local end (that is, the local telepresence video conferencing system). Sound, so when two or more telepresence video conferencing systems are talking at the same time, they need to be mixed to ensure that the telepresence video conferencing system can hear these sounds at the same time. Preferably, the mixing process in the above step S308 may further include the following processing:

(1) obtaining decoded audio data input from a stereo format in at least two telepresence video conferencing systems;

(2) After converting the sampling rate of the acquired audio data to a predetermined frequency, superimposition processing is performed. Through the above steps, the mixing process can be completed, and at the same time, the problem that some of the parties participating in the video conference (hereinafter simply referred to as the conference end) does not adopt the stereo format is solved, since the audio data used in the present invention is in the stereo format, If the audio data format is not the stereo format and the conference end, the audio data of the conference end must be converted into a stereo format and then participate in the mixing process. Preferably, in the mixing process, the acquired audio data is from a predetermined number of telepresence video conferencing systems in the at least two telepresence video conferencing systems, wherein the predetermined number of telepresence video conferencing systems are The audio signals outputted in the above at least two telepresence video conferencing systems have the highest intensity. When there are a lot of participants in the conference, if you still mix the sounds of all the telepresence video conferencing systems participating in the video conference, the sound will be difficult to distinguish. The more the number of telepresence video conferencing systems participating in the video conference, the more the phenomenon The more serious it will be, the more ultimately it will not be clear. Therefore, in order to avoid this situation, when the number of telepresence video conferencing systems participating in the video conference is large, it is necessary to select a predetermined number of ends for mixing according to the audio data with the highest audio signal strength, instead of performing all the ends. Mixing, the above number can be selected according to the specific situation. In step S310, the audio data after the mixing is encoded and sent to another telepresence video conference system. 4 is a flow chart of an audio data transmission method according to an example of the present invention. The above preferred embodiment will be described in detail below with reference to FIG. As shown in FIG. 4, the audio data transmission method according to an example of the present invention includes the following processing: Step S402: Receive a message at an upper end of the terminal, and obtain an audio format of the terminal from the upper end message. Step S404, determining whether it is a stereo format. Step S406: Acquire audio data from the network access board and store it in a temporary buffer, and determine whether a complete audio package is obtained according to the identifier of the packet header, and if a complete audio package has been acquired, Copy data from the buffer to the pre-decode data buffer. If the audio format of the terminal is in stereo format, the audio data is obtained from the network access board and stored in a temporary buffer. The first half of the audio data temporary buffer is stored in the left channel before decoding, and the second half is taken. Store the data buffer before the right channel is decoded. In the specific implementation process, if the audio format of the terminal is not the stereo format, only the data from the audio data temporary buffer is stored in the left channel (or right channel) pre-decoding data buffer. Step S408, if there is sufficient audio data in the data buffer before decoding, left channel decoding and right channel decoding are respectively performed, and the decoded data is respectively stored in the decoded data buffer of the left and right channels. Description will be made below with reference to FIG. 5. As shown in FIG. 5, the foregoing decoding process may further include the following processing (step S502 - step S506): Step S502, determining whether there is enough decoded audio data in the data buffer before decoding in the terminal, and if so, executing step S504, Otherwise, step S506 is performed; Step S504, taking one frame of data from the pre-decoding buffer, selecting a decoder for decoding according to the audio format, and decoding the data into the decoded buffer; if the audio format of the terminal is a stereo format, decoding the data from the left and right channels respectively The buffer takes data, performs left channel decoding and right channel decoding, and stores the decoded data in the decoded data buffer of the left and right channels. If the audio format of the terminal is not stereo, the data is taken from the data buffer before the left channel decoding, the left channel is decoded, and the decoded data is stored in the decoded data buffer of the left channel. Step S506, if the data is not decoded once, write a frame of silence frame to the decoded data buffer. In the specific implementation process, if the audio format of the terminal is not the stereo format, the data is taken from the left channel (or right channel) before decoding, and the left channel (or right channel) is decoded, and after decoding The data is stored in the decoded data buffer of the left channel (or right channel). Step S410, taking data from the decoded buffer to perform mixing processing. Description will be made below with reference to FIG. 6. As shown in FIG. 6, the process of the mixing process mainly includes the following processing (step S602 - step S608): Step S602, determining whether there is one frame of data in the decoded buffer, and if yes, executing step S604, otherwise, the process End; Step S604, taking one frame of data from the decoded buffer, converting the sampling rate of all the terminals in the conference to 32K, and then mixing. Among them, the basic principle of mixing is: a. Each end can not listen to themselves. b. Select a predetermined number (for example, three) of the sounds of all the ends to mix (overlay). For example, there are four ends A, B, C, and D in the conference, then A hears B+C+D, B hears A+C+D, and C hears A+B+ D, D. I heard A+B+ C. Step S606, processing the converted data, and processing the data of the left and right channels according to the specific mixing to the left, the middle, and the right screen. In step S608, the mixing process is performed, and the data is stored in the buffer before encoding. If all the terminals have both mono and stereo, then the data of the left and right channels of the stereo should be processed as needed, and then mixed. After the mixing, the mono end gets mono data, and the stereo end gets two channels of data. The mixed data is stored in the data buffer before encoding. In the case of a normal stereo mix, the left and left channels of all the terminals are mixed, and all the right and right channels are mixed. If it is a stereo mix to the left, then set the right channel of the stereo terminal to 0, left channel = (left channel + right channel)/2. If the stereo mix is to the right, then the left channel of the stereo terminal is set to 0, and the right channel = (left channel + right channel)/2. If it is a stereo mix to the mid screen, then the left channel = 0.707 (left channel + right channel) / 2, right channel = 0.707 (left channel + right channel) / 2. After data processing, all left and left channels are mixed, and the right and right channels are mixed. The mixed data is stored in the data buffer before encoding. If all the terminals have both mono and stereo, then the data of the left and right channels of the stereo should be processed as needed, and then mixed. If it is a mono mix to the left screen, first turn this end into stereo, left channel = mono data, right channel set to 0, then left channel and left channel mix, right channel Mix with the right channel. If it is a mono mix to the right screen, first turn this end into stereo, right channel = mono data, left channel set to 0, then left channel and left channel mix, right channel Mix with the right channel. If it is a mono mix to the middle screen, first convert this end to stereo, right channel=0.707* (mono data;), left channel=0.707* (mono data;), Then the left and left channels are mixed, and the right and right channels are mixed. After the mixing, the mono terminal gets mono data, and the stereo terminal gets two channels of data. The mixed data is stored in the data buffer before encoding. In the specific implementation process, if the audio format of the terminal is not the stereo format, the data is taken from the left channel (or right channel) pre-encoding data buffer, and the left channel (or right channel) is encoded, and the coded is encoded. The data is stored in the encoded data buffer of the left channel (or right channel). Step S412, if the data buffer before encoding has sufficient data, if the audio format of the terminal is a stereo format, the left channel encoding and the right channel encoding are respectively performed, and the encoded data is separately stored in the left and right channels. Encoded data buffer. Description will be made below with reference to FIG. As shown in FIG. 7, the foregoing encoding process mainly includes the following processing (step S702 - step S706): Step S702: determining whether there is one frame of data in the pre-encoding buffer; step S704: if the data buffer of the terminal before encoding is sufficient once The encoded data (for example, one frame of data) is encoded according to an audio format selection encoder, and the encoded data is stored in a coded buffer; If the audio format of the terminal is a stereo format, data is taken from the data buffer before the left and right channel encoding, the left channel encoding and the right channel encoding are performed, and the encoded data is respectively stored in the encoded data of the left and right channels. Buffer. If the audio format of the terminal is not stereo, the data is taken from the data buffer before the left channel encoding, the left channel is encoded, and the encoded data is stored in the encoded data buffer of the left channel. Step S706: If the data is not encoded once, write a frame of silence frame to the encoded data buffer. Step S414, if there is enough data in the data buffer after encoding, the data of the left and right channels is marked into a packet (the data of the left channel is placed in the first half, the data of the right channel is placed in the second half), and then The entire packet is sent to the network access board, and finally sent by the network access board to the terminal for playback. This will be described below in conjunction with FIG. As shown in FIG. 8, the foregoing process of packetizing and transmitting mainly includes the following processing (step S802-step S804): Step S802: determining whether there is one frame of data in the buffer after encoding; step S804: if the encoded data buffer has one frame of data , the data is taken from the encoded data buffer for packaging, and the header and the end of the packet, as well as the audio format and the packet length are captured into the package. If the audio format of the terminal is stereo, the data of the left and right channels is grouped into one packet (the data of the left channel is placed in the first half, and the data of the right channel is placed in the second half). Then, the entire packet is sent to the network access board, and finally sent by the network access board to the terminal for playing. If the audio format of the terminal is not in stereo format, only the data of the left channel (or right channel) is labeled as a packet. Then, the entire packet is sent to the network access board, and finally sent by the network access board to the terminal for playing. 9 is a block diagram showing the structure of an audio data transmission system according to an embodiment of the present invention. As shown in FIG. 9, the audio data transmission system includes: a telepresence video conference system 90, an MCU 92, and a telepresence video conference system is described below with reference to FIG. As shown in FIG. 10, the telepresence video conferencing system 90 includes: a plurality of terminals, the only one of which is an audio transmission terminal, configured to receive and output audio signals in a stereo manner.

The MCU 92 is configured to receive audio data from the audio transmission terminal in the telepresence video conference system 90, encode and decode the audio data, and send the audio data to another telepresence video conference system 94. The telepresence video conference system 90 has the same structure as the telepresence video conference system 94. For details, refer to FIG. As can be clearly seen from FIG. 10, the above-mentioned telepresence video conference system uses only one terminal of a plurality of terminals as an audio transmission terminal, thereby eliminating the root cause of the accent problem and completely solving the problem. At the same time, the audio transmission terminal includes two left and right microphones and two left and right speakers. In fact, the stereo format is used to receive and output sounds, which lays a good foundation for realizing the sound recognition effect. Preferably, as shown in FIG. 10, the audio transmission terminal is located at an intermediate position (which may be a middle position or a positive intermediate position) of a plurality of terminals of the telepresence video conference system, and may also be configured to determine the volume of the left and right channels of the audio signal. Difference, when the difference between the left channel volume and the right channel volume is greater than the first threshold, the audio signal is transmitted to the left side of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second At the threshold, the audio signal is transmitted to the right-screen side of the opposite end. In this way, the sound recognition can be realized. The reason is described in detail in the above method embodiment, and details are not described herein again. Preferably, the MCU is further configured to perform mixing processing on the audio data from the at least two telepresence video conferencing systems when the telepresence video conferencing system is at least two.

The MCU's mixing process is for the simultaneous generation of audio data of multiple telepresence video conferencing systems participating in the video conference, ensuring that each telepresence video conferencing system participating in the video conference can receive each telepresence video conferencing system at the same time. The audio data ensures the real-time and accuracy of the meeting. Preferably, the MCU may be further configured to obtain the decoded audio data input from the at least two telepresence video conferencing systems in a stereo format, and after converting the sampling rate of the obtained audio data into a predetermined frequency, perform superposition processing. Since the present invention adopts audio data in stereo format, the MCU needs to ensure that the audio data format of each conference end is in stereo format. When the audio data format of a conference end is not in stereo format, it is first converted into a stereo format and then participates in the sound mixing. Processing to ensure the consistency of the audio data format. Preferably, the MCU is configured to acquire audio data from a predetermined number of telepresence video conferencing systems in the at least two telepresence video conferencing systems, wherein a predetermined number of telepresence video conferencing systems are in the at least two networks. The audio signal output in the true video conferencing system is the strongest. When the number of telepresence video conferencing systems participating in video conferencing is large, it is necessary to limit the number of audio data participating in the mixing process to ensure that the sound is clear and easy to recognize. In the practice of the present invention, a certain number of audio data with the highest volume is selected for mixing, and specific audio data may be selected according to specific conditions. Actually, There are various strategies to be selected. Different selection strategies can be formulated according to the different priorities of the conferences and the signal transmission quality, and different selection strategies can be formulated for the application of telepresence video conferences in different environments. In summary, with the above embodiments provided by the present invention, the problem of accent can be solved in the telepresence high-definition video conferencing system, the sound effect is well improved, and the effect of listening to the sound is actually achieved. Obviously, those skilled in the art should understand that the above modules or steps of the present invention can be implemented by a general-purpose computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device, such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein. The steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps are fabricated as a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software. The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims

Claim

An audio data transmission method, applied to a telepresence video conference system, comprising the following steps:

Receiving audio data from a terminal in the telepresence video conferencing system, wherein the audio data is input to the terminal in a stereo manner;

The audio data is encoded and decoded, and sent to another of the telepresence video conferencing systems.

The method according to claim 1, wherein the one terminal is in an intermediate position of a plurality of terminals of the telepresence video conference system, and before receiving the audio data, the method further includes the following steps:

The telepresence video conference system determines a difference between left and right channel sounds of the audio signal, and when the difference between the left channel volume and the right channel volume is greater than a first threshold, transmitting the audio signal to a left screen of the opposite end Side output; when the difference between the right channel volume and the left channel volume is greater than the second threshold, the audio signal is transmitted to the right side of the opposite end.

3. The method according to claim 1, wherein, when the telepresence video conferencing system is at least two, the following steps are further included before the encoding processing after the decoding processing:

Mixing audio data from the at least two telepresence video conferencing systems.

4. The method according to claim 3, wherein the mixing process comprises the steps of: obtaining decoded audio data input in a stereo format from the at least two telepresence video conferencing systems;

After converting the sampling rate of the acquired audio data to a predetermined frequency, superimposition processing is performed.

5. The method according to claim 4, wherein, in the mixing process, when the audio data input in the stereo format in the telepresence video conference system is acquired,

The obtained audio data is from a predetermined number of telepresence video conference systems in the at least two telepresence video conference systems, wherein the predetermined number of telepresence video conference systems are in the at least two telepresence systems The audio signal output in the video conferencing system is the strongest.

6. An audio data transmission system, comprising: a telepresence video conference system and an MCU;

The telepresence video conference system includes a plurality of terminals, and one of the plurality of terminals is an audio transmission terminal, and is configured to receive and output audio signals in a stereo manner; The MCU is configured to receive the audio data from the audio transmission terminal in the telepresence video conference system, perform codec processing on the audio data, and send the audio data to another telepresence video conference system. The system according to claim 6, wherein the audio transmission terminal is located at an intermediate position of the plurality of terminals of the telepresence video conference system, and is further configured to determine a difference between the volume of the left and right channels of the audio signal. When the difference between the left channel volume and the right channel volume is greater than the first threshold, the audio signal is transmitted to the left screen side output of the opposite end; when the difference between the right channel volume and the left channel volume is greater than the second threshold And transmitting the audio signal to the right screen side output of the opposite end. The system according to claim 6, wherein the MCU is further configured to mix audio data from the at least two telepresence video conferencing systems when the telepresence video conferencing system is at least two deal with. The system according to claim 8, wherein

The MCU is configured to obtain decoded audio data input in a stereo format from the at least two telepresence video conference systems, convert the sampling rate of the acquired audio data into a predetermined frequency, and perform superposition processing. . The system according to claim 9, wherein the MCU is configured to acquire audio data from a predetermined number of telepresence video conference systems in the at least two telepresence video conference systems, wherein the predetermined ones The number of telepresence video conferencing systems output the highest audio signal strength in the at least two telepresence video conferencing systems.