WO2019128204A1

WO2019128204A1 - Conference implementation method, device, apparatus and system, and computer readable storage medium

Info

Publication number: WO2019128204A1
Application number: PCT/CN2018/096532
Authority: WO
Inventors: 胡景欣
Original assignee: 中兴通讯股份有限公司
Priority date: 2017-12-29
Filing date: 2018-07-20
Publication date: 2019-07-04
Also published as: CN108076306A; CN108076306B

Abstract

A conference implementation method, device, apparatus and system, and a computer readable storage medium. The conference realization method comprises: acquiring an original speech from a conference location at which a terminal is located and a translated speech corresponding to the original speech to generate a multi-channel audio stream, wherein one channel of the audio stream corresponds to one language type; performing encoding of the multi-channel audio stream to obtain an encoded multi-channel packet; and transmitting the encoded multi-channel packet to a multipoint control unit.

Description

Conference implementation method, device, device and system, computer readable storage medium

Cross-reference to related applications

The present application is based on a Chinese patent application filed on Jan. 29, 2017, the entire disclosure of which is hereby incorporated by reference.

Technical field

The present application relates to communication technologies, and more particularly to a conference implementation method, apparatus, device and system, and a computer readable storage medium.

Background technique

Today, with the globalization of information exchange, video conferencing services are becoming more and more widely used, and cross-regional exchanges are increasing. People from different countries and regions use different languages to communicate and achieve communication, which is political, economic and cultural. The impact of the aspects is huge.

Video conferencing is a means of multimedia communication that enables simultaneous interaction of images, voice, and data between two or more locations, using television equipment and communication networks to hold meetings. The video conferencing system generally consists of a video terminal, a transmission network, and a Multipoint Control Unit (MCU). The basic voice service of video conference involves two parties. The two parties can hear each other's voices. In the conference service of multi-party calls, there may be more than two parties at the same time. The voice signal of any one of them can be transmitted to All other parties, and each party can hear the voices of multiple other parties at the same time. When multi-language is used in the multi-party video conference service, the speech content of the speaker needs to be translated and converted as needed, so that people who use different languages in different countries and regions participating in the conference can communicate and communicate with each other.

In the related art, the technology for implementing multi-language video conference is: each terminal can translate the original voice of the venue according to its own translation language type, and output two channels of audio: one source is the source voice, the other is the translation voice, and two channels are needed for transmission. To the MCU, the MCU will judge the language type of the translated speech uploaded by the translation site. When the MCU performs the mixing process, multiple mixers are needed to mix the translated speech according to the recognized language type and mix the original speech. After mixing the voices of each language type, for each venue, the MCU needs to send two channels of audio data to the venue: all the way is the original voice mix of each venue, all the way is the translation mix of the corresponding language type of the destination venue. The network traffic of the above solution is relatively large, which is likely to cause network congestion.

Summary of the invention

At least one embodiment of the present application provides a conference implementation method, apparatus, device, and system, and a computer readable storage medium, which reduce network resource occupation and improve conference efficiency.

At least one embodiment of the present application provides a conference implementation method, including:

Acquiring the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generating a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;

Encoding the multi-channel audio stream to obtain a multi-channel encoded packet;

The multi-channel encoded packet is transmitted to a multi-point control unit.

At least one embodiment of the present application provides a conference implementation apparatus, including:

The code stream generating unit is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;

a first encoding unit configured to encode the multi-channel audio stream to obtain a multi-channel encoded packet;

The first transmitting unit is configured to transmit the multi-channel encoded packet to the multi-point control unit.

Receiving a multi-channel encoded packet sent by the terminal participating in the conference, and performing parsing to obtain a multi-channel audio stream, wherein each channel of the multi-channel audio stream corresponds to an audio stream of one language type;

Determining a terminal that needs to be mixed, and mixing the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream;

Encoding the multi-channel mixed audio stream to obtain a multi-channel mixing and encoding package;

The multi-channel mixing and encoding package is transmitted to the terminal participating in the conference.

a second receiving unit, configured to receive a multi-channel encoded packet sent by the terminal participating in the conference;

a second parsing unit, configured to parse the multi-channel encoded packet to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type;

a mixing unit, configured to determine a terminal that needs to be mixed, and mix the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream;

a second encoding unit configured to encode the multi-channel mixed audio stream to obtain a multi-channel mixing encoding package;

The second transmitting unit is configured to send the multi-channel mixing and encoding packet to the terminal participating in the conference.

Receiving a multi-channel mixing and encoding package sent by the multi-point control unit;

The multi-channel mixing and encoding package is parsed to obtain a multi-channel mixed audio stream.

The first receiving unit is configured to receive the multi-channel mixing and encoding package sent by the multi-point control unit;

The first parsing unit is configured to parse the multi-channel mixing encoded packet to obtain a multi-channel mixed audio stream.

At least one embodiment of the present application provides a conference implementation device including a memory and a processor, the memory storing a program, when the program is read and executed by the processor, implementing any of the above embodiments. Meeting implementation method.

An embodiment of the present application provides a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the foregoing A conference implementation method according to any of the embodiments.

An embodiment of the present application provides a conference implementation system, including: a multipoint control unit and at least one terminal, where:

The terminal is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type; The channel audio stream is encoded to obtain a multi-channel encoded packet; the multi-channel encoded packet is transmitted to the multi-point control unit, and the multi-channel mixing and encoding packet transmitted by the multi-point control unit is received; Channel mixing encoding package to obtain a multi-channel mixed audio stream;

The multi-point control unit is configured to receive a multi-channel encoded packet sent by the terminal participating in the conference, and perform parsing to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to one language type Audio stream; determining a terminal that needs to be mixed, mixing the audio stream of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream; and the multi-channel mixed audio The stream is encoded to obtain a multi-channel mixing encoding packet; the multi-channel mixing encoding packet is transmitted to a terminal participating in the conference.

The method for implementing the conference provided in at least one embodiment of the present application is different from the related art in that the original voice and the translated voice are respectively transmitted by using different channels. In this embodiment, the voices of different language types are generated by the multi-channel encoded packet, and only A channel is required to reduce the occupation of the port.

Other features and advantages of the present application will be set forth in the description which follows. The objectives and other advantages of the present invention can be realized and obtained by the structure of the invention.

DRAWINGS

The drawings are used to provide a further understanding of the technical solutions of the present application, and constitute a part of the specification, which is used together with the embodiments of the present application to explain the technical solutions of the present application, and does not constitute a limitation of the technical solutions of the present application.

FIG. 1 is a schematic diagram of networking of a multilingual conference according to an embodiment of the present application.

FIG. 2 is a flowchart of a method for implementing a conference according to an embodiment of the present application.

FIG. 3 is a schematic diagram of processing on the terminal side of a method for implementing a conference according to an embodiment of the present disclosure.

FIG. 4 is a flowchart of a multilingual conference language negotiation according to an embodiment of the present application.

FIG. 5 is a flowchart of a terminal call based on the H323 protocol according to an embodiment of the present application.

FIG. 6 is a flowchart of a SIP-based terminal call according to an embodiment of the present application.

FIG. 7(a) is a schematic diagram of a format of an RTP message provided in the related art.

FIG. 7(b) is a schematic diagram of an extension of an RTP message according to an embodiment of the present application.

FIG. 8( a ) is a schematic diagram of a multi-session conference multi-channel encoding and packaging process according to an embodiment of the present application.

FIG. 8(b) is a schematic diagram of a multi-session conference multi-channel unpacking decoding process according to an embodiment of the present application.

FIG. 9 is a flowchart of an implementation of a conference implementation method on the MCU side according to the first embodiment of the present disclosure.

FIG. 10 is a schematic diagram of an MCU mixing process according to an embodiment of the present application.

FIG. 11 is a flowchart (terminal side) of a method for implementing a conference according to an embodiment of the present application.

FIG. 12 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present application.

FIG. 13 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present application.

FIG. 14 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present application.

FIG. 15 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present disclosure.

FIG. 16 is a block diagram of a conference implementation apparatus according to an embodiment of the present application.

FIG. 17 is a block diagram of a conference implementation apparatus according to an embodiment of the present application.

FIG. 18 is a block diagram of a conference implementation apparatus according to an embodiment of the present application.

FIG. 19 is a schematic diagram of a conference implementation system according to an embodiment of the present application.

Detailed ways

In order to make the objects, technical solutions and advantages of the present application more clear, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.

The steps illustrated in the flowchart of the figures may be executed in a computer system such as a set of computer executable instructions. Also, although logical sequences are shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.

The video conference implementation solution in the related art has the following disadvantages:

The terminal directly connects the original voice channel and the translated voice channel with the MCU, and uses different channels for transmission, transmits the original voice through the original voice channel, and transmits the translated voice through the translated voice channel, and the network traffic is increased, which easily causes network congestion, and the terminal and the terminal The number of occupied audio input and output interfaces between MCUs is doubled.

A conference room mixes the translated speech according to the language type of the translated speech, and needs to create multiple mixing processes. As the type of speech increases, the number of mixers increases linearly, resulting in a mixing process. Repeated calculations increase the CPU usage of the mixing process.

FIG. 1 is a schematic diagram of networking of a multilingual conference according to an embodiment of the present application. As shown in FIG. 1, a system supporting multi-session conference television includes at least one terminal and an MCU connected to each terminal. The terminal and the MCU are connected through a communication network. The terminal is connected to the MCU through a standard interface to realize mixing and exchange of images and voice. The MCU determines the language type supported by the conference according to the voice requirements and translation capabilities of each conference site. The language type supported by the conference may be one or more. The communication language used by the participants in each meeting in the conference can be diversified. Each venue can set up translators as needed to translate the voices sent and received. Each venue can translate and convert the voice of the conference according to the type of conference language determined by the conference. The conference mix received by each conference site can be directly listened to or played as the language type that can be understood by the venue at the venue. The method in which the terminal converts the original voice of the site into the translated voice of the conference language type may be manually or adopt a non-manual intelligent manner such as software and hardware. The current international conference basically adopts the method of artificial oral translation (referred to as interpreting) for the speaker's speech. In the present application, the translation method is manually translated as an example, but it is not limited to the translation method or means.

An embodiment of the present application provides a conference implementation method, as shown in FIG. 2, including:

Step 201: Acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;

Step 202: Encode the multi-channel audio stream to obtain a multi-channel encoded packet.

Step 203: Send the multi-channel encoded packet to the MCU.

The conference implementation method provided by this embodiment transmits audio channels of different language types to the MCU by combining the multi-channel encoded packets. In this embodiment, the multi-channel encoded packet can be transmitted through one audio channel. Therefore, only one audio channel is required to implement multiple audio transmissions of different language types, and the network traffic is reduced compared to the multi-channel implementation. Reduce the occupation of resources and improve communication efficiency.

In an embodiment, before step 201, the method further includes:

Step 200: Obtain a language type supported by the conference;

The translated voice in the step 201 is a translated voice of a language type supported by the conference corresponding to the original voice.

Generally, the multi-point control unit negotiates with the terminal to determine the language type supported by the conference, and after determining the language type supported by the conference, the multi-point control unit sends the language type supported by the conference to the terminal. Therefore, one way for the terminal to obtain the language type supported by the conference is that the terminal receives the language type supported by the conference sent by the multipoint control unit.

It should be noted that the language types supported by the conference can also be configured in advance.

The audio stream of each language type supported by the conference is included in the multi-channel audio stream. For example, when the language type supported by the conference is three, the multi-channel audio stream includes three channels, and each channel is an audio stream of one language type.

In an embodiment, the terminal acquires the original voice of the site and the translated voice corresponding to the original voice, and generates a multi-channel audio stream, including:

When the language type of the original voice is one of the language types supported by the conference, the audio stream corresponding to the original voice is used as an audio stream of one channel, and the conference supported by the original voice is supported by the conference. The translated speech of other language types other than the language type of the original speech is used as the audio stream of other channels to generate the multi-channel audio stream; for example, the original voice of the venue is Chinese, the conference supports Chinese and English, and the Chinese voice is As an audio stream of one channel, the English speech obtained by the Chinese speech translation is used as an audio stream of one channel, and a multi-channel audio stream composed of Chinese speech and English speech is obtained;

When the language type of the original voice is not the language type supported by the conference, the translated voice of each language type supported by the conference corresponding to the original voice is respectively used as an audio stream of one channel, and the multiple Channel audio stream. For example, the original voice of the venue is French, and the conference supports Chinese and English. The Chinese voice translated from the original French voice is used as the audio stream of one channel, and the English voice translated from the original French voice is used as the audio of one channel. Stream, get a multi-channel audio stream composed of Chinese speech and English speech.

Wherein, each language type corresponding to each channel can be preset, or can be indicated by language type information, specifically, including:

Each channel of the multi-channel encoded packet corresponds to an audio stream of a preset language type; or, the multi-channel encoded packet carries language type information of each channel. For example, if the first channel corresponds to Chinese and the second channel corresponds to English, when the multi-channel encoding package is formed, the audio stream of the Chinese voice is used as the audio stream of the first channel, and the audio stream of the English voice is used as the first Two-channel audio stream. Alternatively, each channel carries language type information indicating the language type of the audio stream for that channel.

FIG. 3 is a schematic diagram of a process of processing a terminal supporting a multilingual conference according to an embodiment of the present application. In the actual application, the original voice of the conference speaker is picked up by the microphone of the conference site. If the original voice of the conference site is only one language type and is consistent with the language type supported by the conference, the original voice can be directly used as the sound source of one channel of the terminal. If the original voice of the site is in a multi-language type or is inconsistent with the language type supported by the conference, you need to perform translation translation through the translation mechanism set by the site, and convert the original voice of the site into the translated voice of the language type supported by the conference. Each type of translated speech corresponds to one channel input of the terminal, and is multi-channel encoded by the encoder of the terminal, packaged to form an audio output, and sent to the MCU. Since each terminal has translated the original voice of the site in which it is located, the MCU does not need to perform translation, and directly performs multi-channel mixing processing on the translated voice of each terminal, and then transmits the mixed result to each terminal. For details, see the embodiment. After receiving the multi-channel mixing and encoding package sent by the MCU, the terminal first unpacks and decodes the encoded data of each channel, then decodes the encoded data of each channel, and then determines the language type and the local language in each channel. The language types supported by the venue are the same. The conference mix can be translated and converted as needed. It can be output via the speaker or connected to different peripherals. It can also be converted to text after the terminal's internal voice text conversion module. Subtitle output on the screen.

In the negotiation process of multilingual conferences, the MCU first negotiates with the terminal to determine the language type supported by the conference. The number of channels of the MCU mix is also the number of language types supported by the conference, and the multi-language multi-channel is completed in the same mixer. Mixing.

FIG. 4 is a flowchart of a negotiation process of a language type supported by a conference according to an embodiment of the present disclosure. As shown in Figure 4, it includes:

Step 401: In the signaling control phase when the conference is held, the MCU starts the conference language negotiation with each terminal, and the MCU obtains the information of each site, including at least one of the following: the site priority, the listening and speaking attribute of the site (only as a speaking end) Or only as a listening end or both to speak and listen), the type of language supported by the venue (the language of the venue), the translation capabilities of the venue, and so on. Each terminal can freely choose the language to be listened to from different languages. For the right of speaking, the rights can be set. For example, for a terminal with a general identity, the language of the speech can only use the default language, and the identity is translated or Advanced terminals can choose to speak in other languages.

Step 402: The MCU traverses all terminals participating in the conference;

Step 403, it is determined whether all the terminals participating in the conference have been traversed, and if so, the language types L1 to Ln supported by the conference are obtained; otherwise, the process proceeds to step 404;

Step 404, determining whether the priority of the site where the terminal is located is the highest, if yes, go to step 407, otherwise, go to step 405;

Step 405: Determine whether the speaking language of the site where the terminal is located is the language type most used by all the sites in the conference, and if yes, go to step 407, otherwise, go to step 405;

In step 406, it is determined whether the speaking language used by the site has the function of translation with other sites. If yes, go to step 407. Otherwise, return to step 402.

For example, if a venue uses English and other venues can translate English into their own language, English can be used as a language type supported by the conference.

In step 407, the speaking language of the venue is used as a language type supported by the conference, and the process proceeds to step 402.

It should be noted that the manner of the language types supported by the above negotiation conference is only an example, and the language types supported by the conference may be determined by other methods. For example, use the default language type directly.

In order to enable the terminal to effectively obtain the language type of the conference, the protocol procedure of the new terminal can be used, as shown in FIG. 5 and FIG. 6, respectively.

Figure 5 shows the H.323 protocol call flow. According to the H.323 protocol standard, the call flow can be roughly divided into Q931 phase and H245 phase. Here is the description of the MCU calling terminal:

Step 501, first entering the Q931 phase, that is, the MCU sends a setup message;

Steps 502-504, after receiving the call, the terminal sequentially responds to call processing (callProceeding), alarm (alerting), and connect (connect) messages to the MCU; at this point, the Q931 phase is completed, and then the H245 phase is entered.

The H245 phase mainly includes three parts: capability set interaction, master-slave confirmation and open logical channel. The capability interaction includes mutual terminal capability setting (terminalCapabilitySet) and terminal capability setting confirmation (terminalCapabilitySetAck). The master-slave confirmation process includes sending master-slave determination requests to each other ( masterSlaveDetermination) and master-slave determination (masterSlaveDeterminationAck), open logical channels including mutual open logical channel request (openLogicalChannel) and open logical channel response (openLogicalChannelAck). When the two-way audio logical channel is opened, it means that the upper end is successful. The above is the standard upper-end process. On the basis of this, a non-standard private message is added. After the capability interaction is completed, the non-standard message type of the conference described in this application is notified to the terminal by sending a nonStandard message. Specifically include:

Step 505: The MCU sends a terminalCapabilitySet message to the terminal.

Step 506, the MCU sends a masterSlaveDetermination message to the terminal.

Step 507, the terminal sends a terminalCapabilitySet message to the MCU;

Step 508, the MCU sends a terminalCapabilitySetAck message to the terminal;

Step 509, the terminal sends a masterSlaveDetermination message to the MCU;

Step 510: The MCU sends a masterSlaveDeterminationAck message to the terminal.

Step 511, the terminal sends a terminalCapabilitySetAck message to the MCU;

Step 512, the terminal sends a masterSlaveDeterminationAck message to the MCU;

Step 513: The MCU sends an atypical (non-Standard) message to the terminal, and carries the language type supported by the conference.

Step 514, the terminal sends an openLogicalChannel message to the MCU;

Step 515, the MCU sends an openLogicalChannelAck message to the terminal.

Step 516, the MCU sends an openLogicalChannel message to the terminal.

In step 517, the terminal sends an openLogicalChannelAck message to the MCU.

In the above manner, the MCU notifies the terminal of the language type supported by the conference. It should be noted that the manner shown in FIG. 5 is only an example, and the language type supported by the conference may be carried by other messages other than the non-Standard message. In addition, the message carrying the language type supported by the conference is not limited to step 513, and may also be Other locations, such as before or after other steps.

Figure 6 is a SIP protocol call flow chart. Compared with the H.323 protocol, the SIP protocol call flow is simple, as shown in Figure 6, including:

Step 601: The MCU sends an INVITE message to the terminal.

Step 602: The terminal sends a 100 Trying message to the MCU.

Step 603, the terminal sends a 180 Ringing message to the MCU;

Step 604, the terminal sends a 200 OK message to the MCU;

Step 605: The MCU sends an ACK message to the terminal; the upper end process is completed.

In this application, a non-standard private message is added on the basis of this, and an INFO message is sent to notify the terminal of the language type supported by the conference, including:

Step 606: The MCU sends an INFO message to the terminal, and carries the language type supported by the conference.

In step 607, the terminal sends a 200 OK message to the MCU.

In the above manner, the MCU notifies the terminal of the language type supported by the conference. It should be noted that the language type supported by the conference through the IFO message is only an example, and other messages may be used to carry the language type supported by the conference.

The packet format used for audio packets transmitted over the Internet complies with the Real-time Transport Protocol (RTP or RTTP). The RTP protocol is a network transmission protocol published by the IETF Multimedia Transmission Working Group in RFC 1889 in 1996.

The RTP packet includes: an RTP header and an RTP payload, wherein, as shown in FIG. 7(a), the RTP header includes:

1.V: The version number of the RTP protocol, which is 2 digits, and the current protocol version number is 2.

2.P: padding flag, occupying 1 bit. If P=1, one or more extra octets are filled at the end of the message, they are not part of the payload.

3.X: The extended flag, which occupies 1 bit. If X=1, there is an extended header followed by the RTP header.

4.CC: Contributing Source (CSRC) counter, which is 4 bits, indicating the number of CSRC identifiers.

5.M: Mark, which occupies 1 bit. Different payloads have different meanings. For video, mark the end of a frame; for audio, mark the beginning of a session.

6. PT: The payload type, which is 7 bits, is used to describe the type of payload in the RTP message, such as the Global System for Mobile Communications (GSM) audio, Joint Photographic Experts Group (Joint Photographic Experts Group, JPEG) images, etc., are mostly used to distinguish audio streams and video streams in streaming media, so that the client can parse them.

7. Sequence number: 16 bits, used to identify the serial number of the RTP message sent by the sender. Each time a message is sent, the sequence number is incremented by 1. This field can be used to check for packet loss when the underlying bearer protocol uses UDP. At the same time, the situation of network jitter can be used to reorder the data. In the helix server, this field starts from 0, and the sequence of the audio package and the video package are counted separately.

8. Timestamp: 32 bits, the time stamp reflects the sampling time of the first octet of the RTP message. The receiver uses the time stamp to calculate the delay and delay jitter and performs synchronous control.

9. Synchronization source (SSRC) identifier: 32 bits, used to identify the synchronization source. The identifier is randomly selected, and the two sync sources participating in the same video conference cannot have the same SSRC.

10. Special Source (CSRC) Identifier: Each CSRC identifier is 32 bits and can have 0 to 15. Each CSRC identifies all the special sources contained in the RTP message payload.

In an embodiment of the present application, the RTP message is extended, and the extension flag (X) of the fixed header in the RTP protocol is set to "1", the extension header is enabled, and a variable length header is followed by the header. The extension uses the header to extend the information of each channel language and encoding length of the multi-channel speech, and its format is shown in Figure 7(b).

As shown in FIG. 7(b), the header extension includes a plurality of extensions, specifically, including:

The identifier information field is used to identify the type of the current header extension; it can occupy 16 bits; the specific implementation of the identifier information field is defined by the upper layer protocol.

Length field (length), used to indicate how many kinds of channel coded data are shared by the channel, that is, the number of subsequent frame information fields.

A plurality of frame information fields may be followed by a length field, each frame information field including: a frame type field and a frame length field, wherein the frame type field is used to indicate a frame type, that is, a language type of one channel, and a frame length The field is used to indicate the frame length, that is, the encoded data length of the channel. Each frame information field can occupy 32 bits, and the first 16 bits are frame type fields, and the last 16 bits are frame length fields. Of course, the length of each frame information field can be set as needed, which is not limited in this application.

It should be noted that the extension information may also not include the identifier information field. In addition, other fields may be extended as needed, which is not limited in this application.

It should be noted that the format of the foregoing extended header is only an example, and the extended headers of other formats may be defined to carry information about each channel according to requirements, which is not limited in this application. In one embodiment of the present application, the multi-channel encoded packet or the multi-channel mixed encoded packet may be transmitted using the extended RTP message described above. Of course, in other embodiments of the present application, the multi-channel mixing and encoding package of the message transmission in other formats may also be used, which is not limited in this application.

FIG. 8(a) is a flowchart of multi-session conference multi-channel coding and packaging according to an embodiment of the present application. As shown in FIG. 8(a), on the terminal side, the original voice of the venue speaker is first picked up by the microphone of the venue, and transmitted to the translation room on the terminal side, and translated and converted by the translator to input different channels of the terminal. When the conference is a bilingual conference, the left and right channels of the terminal respectively place the terminal voices of the terminal corresponding to the conference language, and stereo coding is adopted. When the language of the conference is more than two languages, the terminal separately encodes each channel to obtain channel information and encoded data, as shown in FIG. 8(a), acquiring channel information 1 to channel information n, and The data 1 to the encoded data n are encoded, wherein the vocal tract information includes language type information and encoded data length information. The language type of each channel is used as the frame type, and the encoded data length is filled as the frame length into each corresponding field of the RTP header extension. Finally, the encoded data of each channel is sequentially arranged as the RTP payload according to the language type of the header extension, and finally A multi-channel encoded packet including an RTP header, an RTP header extension, and an RTP payload is generated, wherein the extended flag bit in the RTP header indicates that there is an extended header that is transmitted over the network. On the MCU side, the multi-channel mixing data output by the mixer is sent to the encoder of the MCU, and the encoder corresponds to the multi-channel mixing data of the mixer according to the language information of each channel when the mixer is created. After encoding, the information such as the language identifier of the RTP header extension and the length of the code are added before the encoded data. In other embodiments, the multi-channel encoded packet may also include only the RTP header and the RTP payload, and the RTP payload carries data of each channel, and the data of each channel includes the language identifier information of the channel and the channel. Encoded data, wherein the language identification information indicates a language type of the channel.

FIG. 8(b) is a flowchart of multi-session conference multi-channel unpacking decoding according to an embodiment of the present application. As shown in FIG. 8(b), on the terminal side, the conference mix coding packet including the RTP header, the RTP header extension, and the RTP payload sent by the MCU is received, and according to the extended flag bit in the RTP header, the encoder packet existence header is known. The extension, according to the header extension format, unpacks the data of each channel in the encoded packet. Firstly, the language type of each channel in the header extension and the encoded data length of each channel are parsed, and the encoded data of each channel in the RTP payload is found, and sent to a plurality of decoders to decode and output the conference sound of each channel. On the MCU side, the MCU receives the multi-channel encoded packet sent by the terminal. According to the extended flag bit in the RTP header, it can be known that the encoded packet has a header extension, and according to the header extended format, for each channel in the multi-channel encoded packet. The data is unpacked. Firstly, the language type of each channel in the header expansion and the length of the encoded data of each channel are analyzed, and the channel information 1 to the channel information n are obtained, and the encoded data of each channel in the RTP payload is found correspondingly, and the encoded data 1 to encoding is obtained. The data n is sent to a plurality of decoders to decode and output the conference sound of each channel to the mixer module for multi-channel mixing.

The present application is further illustrated by the following specific examples.

FIG. 9 is a flowchart of a method for implementing a conference according to a first embodiment of the present application. The execution body of this embodiment is an MCU. As shown in Figure 9, it includes:

Step 901: The MCU receives the multi-channel encoded packet sent by the terminal participating in the conference, and performs unpacking and decoding on the multi-channel encoded packet to obtain a multi-channel audio stream; each channel in the multi-channel audio stream corresponds to An audio stream of a language type;

Since each terminal has translated the original voice of the site in which it is located, the MCU does not need to perform translation, and the decoder of the MCU unpacks the multi-channel encoded packets of each terminal according to the steps shown in FIG. 8(b). Decode and then stream the audio from each channel into the mixer.

Step 902: The MCU determines a terminal that needs to be mixed, and mixes the audio streams of the same language type in the determined multi-channel audio stream of each terminal to obtain a multi-channel mixed audio stream, and one channel corresponds to a mixed audio stream of a language type;

The MCU creates a multi-channel mixer. The number of channels of the mixer is the same as the number of language types supported by the conference. The audio stream of each channel obtained by the MCU will be decoded according to the corresponding language type. Mix the individual channels of the mixer.

Taking the language types supported by the conference as Chinese and English as an example, it is assumed that the terminal that needs to be mixed is the first terminal and the second terminal, and the Chinese voice and the English voice of the first terminal are acquired, and the Chinese voice and the English voice of the second terminal are obtained. Mixing the Chinese voice of the first terminal with the Chinese voice of the second terminal to generate a Chinese voice mixed audio stream, mixing the English voice of the first terminal and the English voice of the second terminal to generate an English voice mixing audio The stream, Chinese speech mix audio stream and English speech mix audio stream constitute a multi-channel mixed audio stream.

Step 903: The MCU encodes the multi-channel mixed audio stream to obtain a multi-channel mixing and encoding packet, and sends the multi-channel mixing and encoding packet to the terminal.

For a terminal in a conference, the MCU only needs to establish one audio channel for audio transmission with the terminal. The multi-channel mixing code package sent by the MCU to a terminal does not include the audio information of the site where the terminal itself is located.

The conference implementation method provided in this embodiment only needs to create a mixing process, the number of mixers is small, the calculation amount of the mixing process is small, the occupation of the CPU is reduced, and the original language and the translation language are sent to the same venue. The expression of the mix is consistent.

Step 902 is further described below. The mixing unit of the MCU includes: a VAD module, a sorting module, a selection module, a mixing and an exchange processing module. The VAD module estimates the SNR (Signal to Noise Ratio) and energy of the conference speaker. The sorting module sorts all the terminals of the conference according to the operation result of the VAD module. The selection module picks out a number of terminals (such as the three with the highest volume) that match the mixing condition from the sorted terminals for mixing. The mixing and exchange processing module performs mixing processing on the speech speakers participating in the mixing, and obtains the data of the transmitting end of each port participating in the mixing. The MCU's mixing processing strategy should make it impossible for any venue to hear its own voice and only hear the sound of other venues. As shown in Figure 10, it includes:

Step 1001: Acquire an SNR and an energy of the terminal.

Step 1002: Determine, according to the SNR and the energy of the terminal, whether the terminal participates in sorting and mixing, and generates a terminal linked list.

This process is to exclude the terminal that does not speak, that is, only the terminal that speaks to participate in sorting and mixing, so that the terminal that does not speak does not participate in sorting and mixing. There are two conditions for the terminal to participate in the sorting: First, the terminal is not muted. Second, the terminal is judged to be the speaker. The terminal list includes terminals that participate in sorting.

Step 1003: Sort the terminal list of the conference;

The terminal list can be sorted according to the energy and signal to noise ratio of the terminal.

Step 1004: Select a terminal that satisfies the condition from the sorted terminal linked list to participate in the mixing.

For example, you can select the three terminals with the highest volume as the terminal to participate in the mix. Of course, more or fewer terminals can be selected as the terminal to participate in the mixing as needed.

After selecting the terminal to participate in the mixing, two different ways of mixing processing are performed according to the selected number of mixing ports, which are divided into two cases: the number of mixing ports is equal to 1 and the number of mixing ports is greater than 1, wherein:

When the number of mixing ports is equal to 1, it is important to select the output of the mixing port at this time. Select the input of an unmute input port as the output of the mixing port in the non-participating mixing port.

When the number of mixing ports is greater than 1, the internal mixing ports are mixed with each other, and the output of the non-mixing port is the sum of all mixing ports. For example: A, B, C, D four terminals, through the SNR and energy judgment A, B, C three terminals meet the sorting conditions, can participate in the mix; the result of the mix output is: A_out: B + C; B_out: A+C; C_out: B+A; D_out: A+B+C. The meaning of A_out:B+C is: the audio stream sent by the terminal B and the terminal C to the MCU is mixed and sent to the terminal A.

Since the mixer supports a plurality of different channel numbers and terminal mixing of different sampling rates, data conversion between channel data and uniform sampling rate between mixing ports are required in the mixing process.

In the following detailed embodiments, the mixing processing strategy of the multi-language multi-channel speech can be performed by using the same mixing processing strategy as in the present embodiment, and details are not described herein again.

FIG. 11 is a flowchart of a method for implementing a conference according to an embodiment of the present disclosure. As shown in FIG.

Step 1101: The terminal receives the multi-channel mixing and encoding package sent by the MCU.

Step 1102: The terminal parses the multi-channel mixing and encoding packet to obtain a multi-channel mixed audio stream.

In the conference implementation method provided in this embodiment, the terminal can obtain the mixed audio stream of each channel through one channel, thereby reducing the occupation of resources.

In an optional embodiment, the method further includes: determining, by the terminal, the language type used by the audio stream of each channel according to the language type information carried in the multi-channel mixing and encoding package, or according to The correspondence between the set channel and the language type determines the language type corresponding to each channel. That is, there are two ways to obtain the language type of the audio stream of each channel, one is preset, such as presetting the first channel to be Chinese, the second channel to English, the third channel to French, etc. Etc., the MCU uses the first channel to carry the Chinese audio stream in the multi-channel mixing and encoding package, the second channel carries the English audio stream, and the third channel carries the French audio stream. One channel acquires the Chinese audio stream, the second channel acquires the English audio stream, and the third channel acquires the sent audio stream.

In an optional embodiment, the multi-channel encoded packet is a real-time transport protocol message, and the terminal determines, according to the language type information carried in the multi-channel mixing and encoding packet, the audio stream of each channel is used. Language types include:

The terminal acquires language type information used by the audio stream of each channel from a frame type field in the real-time transport protocol header extension in the multi-channel encoded packet.

In an optional embodiment, the method further includes: after the terminal obtains an audio stream of each channel, if the language type of the audio stream of one of the channels is a language type used by the site where the terminal is located, playing The audio stream of this channel.

In the related art, the original voice mix received by the terminal is a sound mix of a plurality of speakers whose voices are the loudest among all the speakers in the conference, and the translated voice mix received by the terminal is only the translated voice corresponding to the language type supported by the terminal. The remixing, unable to obtain valid information of all the translated speech of the venue, that is, the expression information of the original speech remix received by the same venue is inconsistent with the expression information of the transliterated speech remix. In this embodiment, the expression information of the original voice received by the terminal is consistent with the expression information of the voice mixture mixed.

FIG. 12 is a schematic diagram of a conference television application scenario according to a second embodiment of the present application. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.

In the application scenario shown in Figure 12, the conference venues and the English conference venues hold conference TV. Each conference site translates the original voice of the conference according to the language type supported by the conference. For example, the language negotiated in the conference is Chinese and English. The original voice of the venue is Chinese, the source of the left channel of the Chinese venue is the original Chinese voice, the source of the right channel is the English voice after the original voice is translated, and the original voice of the English venue is English, then the left of the English venue The sound source of the channel is the Chinese voice translated by the original English voice, and the sound source of the right channel is the original voice of the venue, that is, the English voice. In this embodiment, the types of languages supported by the terminals of each site are two. The interaction between the MCU and the terminals of each site is as follows:

The terminal of each site obtains the original voice of the site where it is located. According to the language type supported by the multilingual conference, the original voice is directly input as a left channel or directly as a right channel input or translated as a sound source of the left and right channels. It is sent to the MCU through the stereo encoding of the terminal. As shown in Figure 12, the terminal of the Chinese venue directly inputs the original voice (Chinese) in the venue as the left channel source. After the original voice is translated into English, it is used as the right channel source input, and stereo audio is used to form one channel of audio through one channel. Sent to the MCU. The terminal of the English conference site directly inputs the original voice (English) in the conference as the right channel audio source. After the original voice is translated into Chinese, it is input as the left channel sound source, and is encoded to form one channel of audio, which is sent to the MCU through one audio channel.

The MCU performs a mixing process on the multi-channel audio stream sent by each terminal to generate a multi-channel mixing code packet, and transmits it to each terminal. The MCU's mixing processing strategy should make it impossible for any venue to hear its own voice and only hear the sound of other venues. In addition, when there are multiple sites in the conference, you can set a number of sites that participate in the mix as the site with the highest volume, such as the three sites with the highest volume.

Each terminal receives a multi-channel mixing and encoding packet (which may also be referred to as a stereo mixing in the present embodiment) transmitted by the MCU, and decodes the received multi-channel mixed audio stream (two channels in this embodiment). The playback is selectively performed through the left and right channels of the terminal.

If the sound of a channel in the stereo mix of the left and right channels of the terminal sent by the MCU is the same as the language supported by the site, the terminal of the site can directly play the sound of the channel in the venue. For example, the left channel of the stereo mix received by the Chinese venue is a Chinese conference mix. The terminal of the Chinese venue can play the left channel sound directly in the venue; for example, the right channel of the stereo mix received by the English venue is English conference mixing, the terminal of the English conference venue can play the right channel sound directly in the venue.

For the received stereo conference mix, the venue can be processed according to specific needs. For example, some venues need to be translated and then played, and some venues can play the original mix without translation. In this embodiment, each terminal can directly listen to the sound of the left channel or the right channel without translating. In another embodiment, the mix of the conferences can also be processed by voice-to-text processing to form subtitles for display on the terminal screen.

FIG. 13 is a schematic diagram of a conference television application scenario according to a third embodiment of the present disclosure. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.

The difference between this embodiment and the second embodiment is that the exchange language of the site itself in the conference is multiple. In the application scenario shown in Figure 13, the Chinese conference venue, the English conference venue, and the Chinese-English mixed conference venue hold a conference TV. Participants in the same venue can use a variety of language types, such as Chinese and English mixed venues. According to the language type supported by the conference, each venue translates the original voice of the conference. For example, in the scenario, the language supported by the multilingual conference is Chinese and English, and the original voice of the Chinese conference venue is Chinese, and the left channel of the Chinese venue is The sound source is the original Chinese voice, the sound source of the right channel is the English voice for the original voice, and the original voice of the English venue is English. The source of the left channel of the English venue is the Chinese voice for the original English voice, right. The source of the channel is the original English voice. A venue may use multiple languages to speak. For example, the original voice of the venue is multi-language. The left channel of the venue is the Chinese voice obtained by translating the original voice. The right channel sound source is the English voice obtained by translating the original voice. In this embodiment, the translated language type of the terminal of each site is fixed, that is, the terminal translates the original voice of the site into a voice of a fixed language type supported by the conference. In this embodiment, when the conference is in progress, the working methods of the MCU and the terminals of each conference site are as follows:

The terminal of each site obtains the original voice of the site where it is located. According to the language type supported by the multilingual conference, the original voice is directly input as a left channel or directly as a right channel input or translated as a sound source of the left and right channels. It is sent to the MCU through the stereo encoding of the terminal. For example, the terminal of the Chinese venue will input the original voice (Chinese) in the venue as the left channel source. After the original voice is translated into English, it will be used as the right channel source input to form one channel of audio (ie, multi-channel encoding package). ), sent to the MCU through one audio channel. The terminal of the English venue will input the original voice (English) in the venue as the right channel source. After the original voice is translated into Chinese, it will be input as the left channel source, and one channel of audio (ie, multi-channel code pack) will be formed by stereo coding. Send to the MCU through one audio channel. The terminal of the mixed Chinese and English venue translates the original voice (Chinese-English mixed) in this venue into Chinese as the left channel source input, and translates the original voice (Chinese-English mixture) in this venue into English as the right channel source. Input, form one channel of audio through stereo encoding, and send it to the MCU through one audio channel.

The MCU mixes the stereo multi-lingual speech transmitted by each terminal, and transmits a stereo mix (multi-channel mixed coded packet) to each terminal. The MCU's mixing processing strategy should make it impossible for any venue to hear its own voice and only hear the sound of other venues. In addition, when there are multiple sites in the speech, it is possible to set a plurality of sites participating in the mixing to be three sites with a large volume.

Each terminal receives a stereo mix sent by the MCU. At the same time, each terminal decodes and receives the stereo mix sent by the MCU, and selectively plays the same through the left and right channels of the terminal.

The sound in the stereo mix of the left and right channels of the terminal sent by the MCU to the conference site may be the same as the language supported by the site: for example, the left channel of the stereo mix received by the Chinese venue is the Chinese conference mix. Therefore, the terminal of the Chinese venue can play the left channel sound directly in the venue. For example, the right channel of the stereo mix received by the English venue is an English conference mix. Therefore, the terminal of the English venue can play the right channel sound directly in the venue. For example, the left channel of the stereo mix received by the Chinese and English mixed venues is the Chinese conference mix, and the right channel is the English conference mix. Therefore, the participants in the Chinese and English mixed venues can listen to each other according to their familiar language types. Conference mix for different channels.

It can be seen from the second embodiment and the third embodiment that the method for implementing the conference proposed by the present application can effectively reduce the workload of the terminal-side translators and effectively satisfy the language listening of the participants in the multi-language conference environment. Say the demand.

FIG. 14 is a schematic diagram of a conference television application scenario according to a fourth embodiment of the present disclosure. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.

The difference between this embodiment and the third embodiment is that the type of language supported by the conference is three, and the ability of each conference to translate with each other is determined to be a Chinese-English trilingual conference. In this example, after the original venue sound collected by the audio of the English and French venues, there will be two translation conversions, such as the Chinese venue, the original voice as the input of the terminal channel 1, and the translated speech of the Chinese to English as the input of the terminal channel 2. The translation sound of the Chinese translation method is used as the input of the terminal channel 3, and is separately encoded and then subjected to packing processing. The specific implementation manner is shown in Fig. 8(a). The sound input processing steps in English and French venues are the same.

In this embodiment, when the conference is in progress, the working methods of the MCU and the terminals of each conference site are as follows:

(1) The terminal of each venue obtains the original voice of the venue where it is located and the language type supported by the conference, determines the type of translation, such as English-Chinese translation or Chinese-English translation or other, and uses the original voice together with the two-way translated voice as the channel. The input source of channel 1 to channel 3 is packaged to generate a multi-channel encoded packet and sent to the MCU as an audio stream.

(2) The MCU decodes and decodes the multi-channel encoded packet sent by each terminal, and sends it to the multi-channel mixer according to the channel language type obtained after decoding to obtain a multi-channel mixed audio stream. Encoding and packing to obtain a multi-channel mixing and encoding package, the specific implementation is shown in Figure 8(a). Finally, the multi-channel mixing code packet is sent to each terminal. In this embodiment, the mixing process of the MCU can be quickly implemented, and only one mixer is required to complete the multi-channel mixing.

(3) Each terminal receives the multi-channel mixing and encoding packet sent by the MCU, performs decoding and splits into three channels of speech, and respectively decodes into a linear code stream, and the terminal side listens to the conference mixing as needed.

Since the multi-channel mixing and encoding package of the terminal sent by the MCU to a certain site covers the language types supported by each site, the Chinese terminal, the English terminal, and the French terminal in the conference can directly listen to or play the conference by selecting different channels. Voice. In this application scenario, each terminal can freely choose the language to listen to from a variety of different languages.

FIG. 15 is a schematic diagram of a conference television application scenario according to a fifth embodiment of the present disclosure. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.

The difference between this embodiment and the fourth embodiment is that there are five types of venue languages in the conference, namely Chinese, English, German, Japanese, and French. In the signaling interaction phase of the conference, the German-Japanese mixed forum has the ability to translate Chinese and English, but other venues do not have the translation ability of German-Japanese and French. Therefore, the language types supported by multi-lingual conferences are Chinese and English. When attending the conference, the German-Japanese law venue first translated the voice of the venue into Chinese and English, as the audio source input for channel 1 and channel 2, as shown in Figure 15. The voices sent by the Chinese venue are the original voice Chinese and the English voice translated by the original voice. The voices sent by the English venue are the Chinese translated by the original voice and the original voice English. The mixed Chinese and English mixed speech is translated into Chinese speech and English speech respectively. The German-Japanese-French mixed conference is the same, and the German-Japanese mixed speech is translated into Chinese speech and English speech respectively.

(1) The terminal of each venue obtains the original voice of the venue where it is located and the language type supported by the multilingual conference, and determines the type of translation, such as English to Chinese or Chinese to English or the like. The two languages of the output are used as the sound source of the left and right channels, and the multi-channel encoded packet is encoded as one audio stream to be sent to the MCU.

(2) The MCU mixes the multi-channel encoded packets sent by each terminal to obtain a multi-channel mixed audio stream, and encodes and packs the multi-channel mixed audio stream and transmits it to each terminal. MCU's mixing process can be implemented quickly, requiring only one mixer to complete a multilingual mix.

(3) Each terminal receives the multi-channel mixing and encoding packet sent by the MCU, decodes it, splits the left and right channels, and listens to the multi-lingual voice separately according to the requirements.

If the mixed audio stream in the mix encoding package of the terminal sent by the MCU to the site is the same as the language type supported by the site, for example, the multi-lingual audio received by the Chinese venue in the Chinese and English multilingual conferences is Chinese and English. Chinese terminals, English terminals, Chinese and English terminals in the conference can directly listen to or play the conference voice by selecting different channels.

For the terminal in which the language type of the received mixing code package is inconsistent with the language type of the site, the translation can be converted according to the specific requirements. For example, the German-Japanese venue needs to translate the received Chinese mix or English mix. Convert and then play.

In this embodiment, each site has a translation function, and the original voice of the site can be translated and output. After the MCU mixes the multi-lingual voice outputted by the terminal, only one mixer is needed, and only one processing flow is needed. Before the mixing, there is no need to add a voice recognition module, and the language type identification can be performed through the frame type information. Participants in each meeting in the conference can use one or more communication languages, and can communicate with each other without barriers. The solution is easy to implement and has high practicability.

An embodiment of the present application provides a conference implementation apparatus, as shown in FIG. 16, including:

The code stream generating unit 1601 is configured to acquire the original voice of the site and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;

The first encoding unit 1602 is configured to encode the multi-channel audio stream to obtain a multi-channel encoded packet;

The first transmitting unit 1603 is configured to transmit the multi-channel encoded packet to the multi-point control unit.

The conference implementing apparatus further includes: a language type determining unit 1604, the language type determining unit 1604 being configured to determine a language type supported by the conference. For example, the language type determining unit 1604 acquires the language type supported by the conference from the multipoint control unit.

The multi-channel audio stream generated by the code stream generating unit 1601 includes an audio stream of each language type supported by the conference.

In an embodiment, the code stream generating unit acquires the original voice of the site and the translated voice corresponding to the original voice, and generates a multi-channel audio stream, including one or a combination thereof:

When the language type of the original voice is one of the language types supported by the conference, the audio stream corresponding to the original voice is used as an audio stream of one channel, and the conference supported by the original voice is supported by the conference. Translating speech of other language types other than the language type of the original speech as an audio stream of other channels, generating the multi-channel audio stream;

When the language type of the original voice is not the language type supported by the conference, the translated voice of each language type supported by the conference corresponding to the original voice is respectively used as an audio stream of one channel, and the multiple Channel audio stream.

An embodiment of the present application provides a conference implementation apparatus, as shown in FIG. 17, including:

The first receiving unit 1701 is configured to receive the multi-channel mixing and encoding packet sent by the multi-point control unit;

The first parsing unit 1702 is configured to parse the multi-channel mixing and encoding packet to acquire a multi-channel mixed audio stream.

In an embodiment, the first parsing unit 1702 is further configured to determine, according to the language type information carried in the multi-channel mixing and encoding package, a language type used by each channel of the audio stream, or according to the pre- The correspondence between the set channel and the language type determines the language type corresponding to each channel.

In an embodiment, the obtaining, by the first parsing unit 1702, the encoded data of each channel includes:

The first parsing unit 1702 acquires language type information used by an audio stream of each channel from a frame type field in a real-time transport protocol header extension in the multi-channel encoded packet, and expands from the real-time transport protocol header The frame length field in the field acquires the length of the encoded data of each channel, and the encoded data of each channel is obtained from the real-time transport protocol payload of the multi-channel encoded packet according to the length of the encoded data of each channel.

An embodiment of the present application provides a conference implementation apparatus, as shown in FIG. 18, including:

The second receiving unit 1801 is configured to receive a multi-channel encoded packet sent by the terminal participating in the conference;

The second parsing unit 1802 is configured to parse the multi-channel encoded packet to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type;

The mixing unit 1803 is configured to determine a terminal that needs to be mixed, and mix the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream; and one channel a mixed audio stream corresponding to one language type;

a second encoding unit 1804, configured to encode the multi-channel mixed audio stream to obtain a multi-channel mixed encoding package;

The second transmitting unit 1805 is configured to transmit the multi-channel mixing and encoding packet to the terminal participating in the conference.

An embodiment of the present application provides a conference implementation device, including a memory and a processor, where the memory stores a program, and when the program is read and executed by the processor, the conference implementation method of any of the foregoing is implemented.

An embodiment of the present application is a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement any of the above The conference implementation method.

The computer readable storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like, which can store program codes. Medium.

An embodiment of the present application provides a conference implementation system, as shown in FIG. 19, including: an MCU 1901 and at least one terminal 1902, where:

The terminal 1902 is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type; The channel audio stream is encoded to obtain a multi-channel encoded packet; the multi-channel encoded packet is transmitted to the multi-point control unit, and the multi-channel mixing encoded packet transmitted by the multi-point control unit is received; Multi-channel mixing and encoding package to obtain a multi-channel mixed audio stream;

The MCU 1901 is configured to receive a multi-channel encoded packet sent by a terminal participating in the conference, and perform parsing to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type; Determining a terminal that needs to be mixed, mixing the audio stream of the same language type in the determined multi-channel audio stream of the terminal, obtaining a multi-channel mixed audio stream; encoding the multi-channel mixed audio stream Obtaining a multi-channel mixing encoding packet; transmitting the multi-channel mixing encoding packet to a terminal participating in the conference.

For detailed implementation of the terminal and the MCU, refer to the previous embodiment, and details are not described herein.

The embodiments disclosed in the present application are as described above, but the description is only for the purpose of understanding the present application, and is not intended to limit the present application. Any modifications and changes in the form and details of the embodiments may be made by those skilled in the art without departing from the spirit and scope of the disclosure. The scope defined by the appended claims shall prevail.

Industrial applicability

Different from the related art, the original voice and the translated voice are separately transmitted by using different channels. In the embodiment of the present application, the voices of different language types are generated by multi-channel encoding, and only one channel is needed, thereby reducing the occupation of the port. .

Claims

A meeting implementation method includes:

Acquiring the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generating a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;

Encoding the multi-channel audio stream to obtain a multi-channel encoded packet;

The multi-channel encoded packet is transmitted to a multi-point control unit.
The conference implementation method of claim 1, wherein the method further comprises: determining a language type supported by the conference;

The multi-channel audio stream includes an audio stream of each language type supported by the conference.
The conference implementation method of claim 2, wherein the determining the language type supported by the conference comprises:

The language type supported by the conference is obtained from the multipoint control unit.
The method for implementing a conference according to claim 2, wherein the acquiring the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, and generating a multi-channel audio stream, including one or a combination thereof:

When the language type of the original voice is one of the language types supported by the conference, the audio stream corresponding to the original voice is used as an audio stream of one channel, and the conference supported by the original voice is supported by the conference. Translating speech of other language types other than the language type of the original speech as an audio stream of other channels, generating the multi-channel audio stream;

When the language type of the original voice is not the language type supported by the conference, the translated voice of each language type supported by the conference corresponding to the original voice is respectively used as an audio stream of one channel, and the multiple Channel audio stream.
The conference implementation method according to any one of claims 1 to 4, wherein each of the multi-channel encoded packets corresponds to an audio stream of a preset language type; or, the multi-channel encoded packet carries each Language type information for the channel.
The method for implementing a conference according to any one of claims 1 to 4, wherein the multi-channel encoded packet is a real-time transport protocol packet, and the multi-channel encoded packet includes a real-time transport protocol header and a real-time transport protocol header extension. And a real-time transport protocol header, where the real-time transport protocol header extension includes a length field and one or more frame information fields, where the length field carries the number of the frame information fields, and the frame information field includes a frame type field And a frame length field carrying a language type of a channel, the frame length field carrying an encoded data length of the channel, the real-time transport protocol payload carrying encoded data for each channel.
A conference implementation device, comprising:

The code stream generating unit is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;

a first encoding unit configured to encode the multi-channel audio stream to obtain a multi-channel encoded packet;

The first transmitting unit is configured to transmit the multi-channel encoded packet to the multi-point control unit.
A meeting implementation method includes:

Receiving a multi-channel encoded packet sent by the terminal participating in the conference, and performing parsing to obtain a multi-channel audio stream, wherein each channel of the multi-channel audio stream corresponds to an audio stream of one language type;

Determining a terminal that needs to be mixed, and mixing the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream;

Encoding the multi-channel mixed audio stream to obtain a multi-channel mixing and encoding package;

The multi-channel mixing and encoding package is transmitted to the terminal participating in the conference.
The method for implementing a conference according to claim 8, wherein the method further comprises:

The multi-channel mixing and encoding package carries language type information indicating a language type used by the mixed audio stream of each channel; or, each channel of the multi-channel mixing and encoding package corresponds to a preset A mixed audio stream of the language type.
The conference implementation method according to claim 8, wherein the multi-channel mixing and encoding packet is a real-time transmission protocol packet, and the multi-channel mixing and encoding packet includes a real-time transmission protocol header and a real-time transmission protocol header extension. And a real-time transport protocol header, the real-time transport protocol header extension including a length field and one or more frame information fields, the length field indicating a number of the frame information fields, the frame information field including a frame type field and A frame length field carrying a language type of a channel, the frame length field carrying an encoded data length of the channel, the real-time transport protocol payload carrying encoded data for each channel.
The conference implementation method according to any one of claims 8 to 10, wherein the method further comprises: determining a language type supported by the conference, and transmitting the language type supported by the conference to the terminal participating in the conference.
The conference implementation method according to claim 11, wherein the determining the language type supported by the conference comprises at least one of the following:

When the site has the highest priority, the speaking language of the site is used as a language type supported by the conference;

When the speaking language of the site is the most used language type in all the conference sites, the speaking language of the site is used as a language type supported by the conference;

When the speaking language of the venue and the other venues have the ability to translate, the speaking language of the venue is a language type supported by the conference.
The conference implementing method according to claim 11, wherein said multi-channel mixed audio stream includes a mixed audio stream of each language type supported by said conference.
A conference implementation device, comprising:

a second receiving unit, configured to receive a multi-channel encoded packet sent by the terminal participating in the conference;

a second parsing unit, configured to parse the multi-channel encoded packet to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type;

a mixing unit, configured to determine a terminal that needs to be mixed, and mix the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream;

a second encoding unit configured to encode the multi-channel mixed audio stream to obtain a multi-channel mixing encoding package;

The second transmitting unit is configured to send the multi-channel mixing and encoding packet to the terminal participating in the conference.
A meeting implementation method includes:

Receiving a multi-channel mixing and encoding package sent by the multi-point control unit;

The multi-channel mixing and encoding package is parsed to obtain a multi-channel mixed audio stream.
The conference implementing method according to claim 15, wherein the method further comprises: determining a language type of each channel according to language type information carried in the multi-channel mixing encoded packet, or according to a preset The correspondence between the channel and the language type determines the language type corresponding to each channel.
The conference implementation method according to claim 15 or 16, wherein said parsing said multi-channel mixing and encoding packet to obtain a multi-channel mixed audio stream comprises:

When the multi-channel encoded packet is a real-time transport protocol message, the language type information used by the audio stream of each channel is obtained from a frame type field in the real-time transport protocol header extension in the multi-channel encoded packet. Acquiring the length of the encoded data of each channel from the frame length field in the real-time transport protocol header extension, and obtaining each from the real-time transport protocol payload of the multi-channel encoded packet according to the length of the encoded data of each channel The encoded data of each channel is decoded for each channel of the encoded data to obtain a mixed audio stream of the channel.
A conference implementation device, comprising:

The first receiving unit is configured to receive the multi-channel mixing and encoding package sent by the multi-point control unit;

The first parsing unit is configured to parse the multi-channel mixing and encoding packet to obtain a multi-channel mixed audio stream.
A conference implementation device comprising a memory and a processor, the memory storing a program, the program implementing the conference according to any one of claims 1 to 6, 15 to 17 when read by the processor Implementation.
A conference implementation device, comprising a memory and a processor, the memory storing a program, the program implementing the conference implementation method according to any one of claims 8 to 13 when read by the processor.
A computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the claims 1 to 6, 8 The conference implementation method as described in any of 13, 15 to 17.
A conference implementation system includes: a multipoint control unit and at least one terminal, wherein:

The terminal is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type; The channel audio stream is encoded to obtain a multi-channel encoded packet; the multi-channel encoded packet is transmitted to the multi-point control unit, and the multi-channel mixing and encoding packet transmitted by the multi-point control unit is received; Channel mixing encoding package to obtain a multi-channel mixed audio stream;

The multi-point control unit is configured to receive a multi-channel encoded packet sent by the terminal participating in the conference, and perform parsing to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to one language type Audio stream; determining a terminal that needs to be mixed, mixing the audio stream of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream; and the multi-channel mixed audio The stream is encoded to obtain a multi-channel mixing encoding packet; the multi-channel mixing encoding packet is transmitted to a terminal participating in the conference.