WO2019128204A1 - Conference implementation method, device, apparatus and system, and computer readable storage medium - Google Patents

Conference implementation method, device, apparatus and system, and computer readable storage medium Download PDF

Info

Publication number
WO2019128204A1
WO2019128204A1 PCT/CN2018/096532 CN2018096532W WO2019128204A1 WO 2019128204 A1 WO2019128204 A1 WO 2019128204A1 CN 2018096532 W CN2018096532 W CN 2018096532W WO 2019128204 A1 WO2019128204 A1 WO 2019128204A1
Authority
WO
WIPO (PCT)
Prior art keywords
channel
conference
audio stream
language type
terminal
Prior art date
Application number
PCT/CN2018/096532
Other languages
French (fr)
Chinese (zh)
Inventor
胡景欣
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2019128204A1 publication Critical patent/WO2019128204A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/152Multipoint control units therefor

Definitions

  • the present application relates to communication technologies, and more particularly to a conference implementation method, apparatus, device and system, and a computer readable storage medium.
  • Video conferencing is a means of multimedia communication that enables simultaneous interaction of images, voice, and data between two or more locations, using television equipment and communication networks to hold meetings.
  • the video conferencing system generally consists of a video terminal, a transmission network, and a Multipoint Control Unit (MCU).
  • MCU Multipoint Control Unit
  • the basic voice service of video conference involves two parties. The two parties can hear each other's voices. In the conference service of multi-party calls, there may be more than two parties at the same time. The voice signal of any one of them can be transmitted to All other parties, and each party can hear the voices of multiple other parties at the same time.
  • multi-language is used in the multi-party video conference service, the speech content of the speaker needs to be translated and converted as needed, so that people who use different languages in different countries and regions participating in the conference can communicate and communicate with each other.
  • each terminal can translate the original voice of the venue according to its own translation language type, and output two channels of audio: one source is the source voice, the other is the translation voice, and two channels are needed for transmission.
  • the MCU will judge the language type of the translated speech uploaded by the translation site.
  • multiple mixers are needed to mix the translated speech according to the recognized language type and mix the original speech.
  • the MCU needs to send two channels of audio data to the venue: all the way is the original voice mix of each venue, all the way is the translation mix of the corresponding language type of the destination venue.
  • the network traffic of the above solution is relatively large, which is likely to cause network congestion.
  • At least one embodiment of the present application provides a conference implementation method, apparatus, device, and system, and a computer readable storage medium, which reduce network resource occupation and improve conference efficiency.
  • At least one embodiment of the present application provides a conference implementation method, including:
  • the multi-channel encoded packet is transmitted to a multi-point control unit.
  • At least one embodiment of the present application provides a conference implementation apparatus, including:
  • the code stream generating unit is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
  • a first encoding unit configured to encode the multi-channel audio stream to obtain a multi-channel encoded packet
  • the first transmitting unit is configured to transmit the multi-channel encoded packet to the multi-point control unit.
  • At least one embodiment of the present application provides a conference implementation method, including:
  • each channel of the multi-channel audio stream corresponds to an audio stream of one language type
  • the multi-channel mixing and encoding package is transmitted to the terminal participating in the conference.
  • At least one embodiment of the present application provides a conference implementation apparatus, including:
  • a second receiving unit configured to receive a multi-channel encoded packet sent by the terminal participating in the conference
  • a second parsing unit configured to parse the multi-channel encoded packet to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type;
  • a mixing unit configured to determine a terminal that needs to be mixed, and mix the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream;
  • a second encoding unit configured to encode the multi-channel mixed audio stream to obtain a multi-channel mixing encoding package
  • the second transmitting unit is configured to send the multi-channel mixing and encoding packet to the terminal participating in the conference.
  • At least one embodiment of the present application provides a conference implementation method, including:
  • the multi-channel mixing and encoding package is parsed to obtain a multi-channel mixed audio stream.
  • At least one embodiment of the present application provides a conference implementation apparatus, including:
  • the first receiving unit is configured to receive the multi-channel mixing and encoding package sent by the multi-point control unit;
  • the first parsing unit is configured to parse the multi-channel mixing encoded packet to obtain a multi-channel mixed audio stream.
  • At least one embodiment of the present application provides a conference implementation device including a memory and a processor, the memory storing a program, when the program is read and executed by the processor, implementing any of the above embodiments. Meeting implementation method.
  • An embodiment of the present application provides a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the foregoing A conference implementation method according to any of the embodiments.
  • An embodiment of the present application provides a conference implementation system, including: a multipoint control unit and at least one terminal, where:
  • the terminal is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
  • the channel audio stream is encoded to obtain a multi-channel encoded packet;
  • the multi-channel encoded packet is transmitted to the multi-point control unit, and the multi-channel mixing and encoding packet transmitted by the multi-point control unit is received;
  • Channel mixing encoding package to obtain a multi-channel mixed audio stream;
  • the multi-point control unit is configured to receive a multi-channel encoded packet sent by the terminal participating in the conference, and perform parsing to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to one language type Audio stream; determining a terminal that needs to be mixed, mixing the audio stream of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream; and the multi-channel mixed audio
  • the stream is encoded to obtain a multi-channel mixing encoding packet; the multi-channel mixing encoding packet is transmitted to a terminal participating in the conference.
  • the method for implementing the conference provided in at least one embodiment of the present application is different from the related art in that the original voice and the translated voice are respectively transmitted by using different channels.
  • the voices of different language types are generated by the multi-channel encoded packet, and only A channel is required to reduce the occupation of the port.
  • FIG. 1 is a schematic diagram of networking of a multilingual conference according to an embodiment of the present application.
  • FIG. 2 is a flowchart of a method for implementing a conference according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of processing on the terminal side of a method for implementing a conference according to an embodiment of the present disclosure.
  • FIG. 4 is a flowchart of a multilingual conference language negotiation according to an embodiment of the present application.
  • FIG. 5 is a flowchart of a terminal call based on the H323 protocol according to an embodiment of the present application.
  • FIG. 6 is a flowchart of a SIP-based terminal call according to an embodiment of the present application.
  • FIG. 7(a) is a schematic diagram of a format of an RTP message provided in the related art.
  • FIG. 7(b) is a schematic diagram of an extension of an RTP message according to an embodiment of the present application.
  • FIG. 8( a ) is a schematic diagram of a multi-session conference multi-channel encoding and packaging process according to an embodiment of the present application.
  • FIG. 8(b) is a schematic diagram of a multi-session conference multi-channel unpacking decoding process according to an embodiment of the present application.
  • FIG. 9 is a flowchart of an implementation of a conference implementation method on the MCU side according to the first embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of an MCU mixing process according to an embodiment of the present application.
  • FIG. 11 is a flowchart (terminal side) of a method for implementing a conference according to an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present disclosure.
  • FIG. 16 is a block diagram of a conference implementation apparatus according to an embodiment of the present application.
  • FIG. 17 is a block diagram of a conference implementation apparatus according to an embodiment of the present application.
  • FIG. 18 is a block diagram of a conference implementation apparatus according to an embodiment of the present application.
  • FIG. 19 is a schematic diagram of a conference implementation system according to an embodiment of the present application.
  • the terminal directly connects the original voice channel and the translated voice channel with the MCU, and uses different channels for transmission, transmits the original voice through the original voice channel, and transmits the translated voice through the translated voice channel, and the network traffic is increased, which easily causes network congestion, and the terminal and the terminal The number of occupied audio input and output interfaces between MCUs is doubled.
  • a conference room mixes the translated speech according to the language type of the translated speech, and needs to create multiple mixing processes. As the type of speech increases, the number of mixers increases linearly, resulting in a mixing process. Repeated calculations increase the CPU usage of the mixing process.
  • FIG. 1 is a schematic diagram of networking of a multilingual conference according to an embodiment of the present application.
  • a system supporting multi-session conference television includes at least one terminal and an MCU connected to each terminal.
  • the terminal and the MCU are connected through a communication network.
  • the terminal is connected to the MCU through a standard interface to realize mixing and exchange of images and voice.
  • the MCU determines the language type supported by the conference according to the voice requirements and translation capabilities of each conference site.
  • the language type supported by the conference may be one or more.
  • the communication language used by the participants in each meeting in the conference can be diversified.
  • Each venue can set up translators as needed to translate the voices sent and received.
  • Each venue can translate and convert the voice of the conference according to the type of conference language determined by the conference.
  • the conference mix received by each conference site can be directly listened to or played as the language type that can be understood by the venue at the venue.
  • the method in which the terminal converts the original voice of the site into the translated voice of the conference language type may be manually or adopt a non-manual intelligent manner such as software and hardware.
  • the current international conference basically adopts the method of artificial oral translation (referred to as interpreting) for the speaker's speech.
  • the translation method is manually translated as an example, but it is not limited to the translation method or means.
  • An embodiment of the present application provides a conference implementation method, as shown in FIG. 2, including:
  • Step 201 Acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
  • Step 202 Encode the multi-channel audio stream to obtain a multi-channel encoded packet.
  • Step 203 Send the multi-channel encoded packet to the MCU.
  • the conference implementation method provided by this embodiment transmits audio channels of different language types to the MCU by combining the multi-channel encoded packets.
  • the multi-channel encoded packet can be transmitted through one audio channel. Therefore, only one audio channel is required to implement multiple audio transmissions of different language types, and the network traffic is reduced compared to the multi-channel implementation. Reduce the occupation of resources and improve communication efficiency.
  • the method before step 201, the method further includes:
  • Step 200 Obtain a language type supported by the conference
  • the translated voice in the step 201 is a translated voice of a language type supported by the conference corresponding to the original voice.
  • the multi-point control unit negotiates with the terminal to determine the language type supported by the conference, and after determining the language type supported by the conference, the multi-point control unit sends the language type supported by the conference to the terminal. Therefore, one way for the terminal to obtain the language type supported by the conference is that the terminal receives the language type supported by the conference sent by the multipoint control unit.
  • the audio stream of each language type supported by the conference is included in the multi-channel audio stream.
  • the multi-channel audio stream includes three channels, and each channel is an audio stream of one language type.
  • the terminal acquires the original voice of the site and the translated voice corresponding to the original voice, and generates a multi-channel audio stream, including:
  • the audio stream corresponding to the original voice is used as an audio stream of one channel
  • the conference supported by the original voice is supported by the conference.
  • the translated speech of other language types other than the language type of the original speech is used as the audio stream of other channels to generate the multi-channel audio stream;
  • the original voice of the venue is Chinese
  • the Chinese voice is As an audio stream of one channel
  • the English speech obtained by the Chinese speech translation is used as an audio stream of one channel
  • a multi-channel audio stream composed of Chinese speech and English speech is obtained;
  • the translated voice of each language type supported by the conference corresponding to the original voice is respectively used as an audio stream of one channel, and the multiple Channel audio stream.
  • the original voice of the venue is French
  • the conference supports Chinese and English.
  • the Chinese voice translated from the original French voice is used as the audio stream of one channel
  • the English voice translated from the original French voice is used as the audio of one channel.
  • Stream get a multi-channel audio stream composed of Chinese speech and English speech.
  • each language type corresponding to each channel can be preset, or can be indicated by language type information, specifically, including:
  • Each channel of the multi-channel encoded packet corresponds to an audio stream of a preset language type; or, the multi-channel encoded packet carries language type information of each channel. For example, if the first channel corresponds to Chinese and the second channel corresponds to English, when the multi-channel encoding package is formed, the audio stream of the Chinese voice is used as the audio stream of the first channel, and the audio stream of the English voice is used as the first Two-channel audio stream.
  • each channel carries language type information indicating the language type of the audio stream for that channel.
  • FIG. 3 is a schematic diagram of a process of processing a terminal supporting a multilingual conference according to an embodiment of the present application.
  • the original voice of the conference speaker is picked up by the microphone of the conference site. If the original voice of the conference site is only one language type and is consistent with the language type supported by the conference, the original voice can be directly used as the sound source of one channel of the terminal. If the original voice of the site is in a multi-language type or is inconsistent with the language type supported by the conference, you need to perform translation translation through the translation mechanism set by the site, and convert the original voice of the site into the translated voice of the language type supported by the conference.
  • Each type of translated speech corresponds to one channel input of the terminal, and is multi-channel encoded by the encoder of the terminal, packaged to form an audio output, and sent to the MCU. Since each terminal has translated the original voice of the site in which it is located, the MCU does not need to perform translation, and directly performs multi-channel mixing processing on the translated voice of each terminal, and then transmits the mixed result to each terminal. For details, see the embodiment. After receiving the multi-channel mixing and encoding package sent by the MCU, the terminal first unpacks and decodes the encoded data of each channel, then decodes the encoded data of each channel, and then determines the language type and the local language in each channel. The language types supported by the venue are the same.
  • the conference mix can be translated and converted as needed. It can be output via the speaker or connected to different peripherals. It can also be converted to text after the terminal's internal voice text conversion module. Subtitle output on the screen.
  • the MCU In the negotiation process of multilingual conferences, the MCU first negotiates with the terminal to determine the language type supported by the conference. The number of channels of the MCU mix is also the number of language types supported by the conference, and the multi-language multi-channel is completed in the same mixer. Mixing.
  • FIG. 4 is a flowchart of a negotiation process of a language type supported by a conference according to an embodiment of the present disclosure. As shown in Figure 4, it includes:
  • Step 401 In the signaling control phase when the conference is held, the MCU starts the conference language negotiation with each terminal, and the MCU obtains the information of each site, including at least one of the following: the site priority, the listening and speaking attribute of the site (only as a speaking end) Or only as a listening end or both to speak and listen), the type of language supported by the venue (the language of the venue), the translation capabilities of the venue, and so on.
  • Each terminal can freely choose the language to be listened to from different languages. For the right of speaking, the rights can be set. For example, for a terminal with a general identity, the language of the speech can only use the default language, and the identity is translated or Advanced terminals can choose to speak in other languages.
  • Step 402 The MCU traverses all terminals participating in the conference
  • Step 403 it is determined whether all the terminals participating in the conference have been traversed, and if so, the language types L1 to Ln supported by the conference are obtained; otherwise, the process proceeds to step 404;
  • Step 404 determining whether the priority of the site where the terminal is located is the highest, if yes, go to step 407, otherwise, go to step 405;
  • Step 405 Determine whether the speaking language of the site where the terminal is located is the language type most used by all the sites in the conference, and if yes, go to step 407, otherwise, go to step 405;
  • step 406 it is determined whether the speaking language used by the site has the function of translation with other sites. If yes, go to step 407. Otherwise, return to step 402.
  • English can be used as a language type supported by the conference.
  • step 407 the speaking language of the venue is used as a language type supported by the conference, and the process proceeds to step 402.
  • the manner of the language types supported by the above negotiation conference is only an example, and the language types supported by the conference may be determined by other methods. For example, use the default language type directly.
  • the protocol procedure of the new terminal can be used, as shown in FIG. 5 and FIG. 6, respectively.
  • FIG. 5 shows the H.323 protocol call flow.
  • the call flow can be roughly divided into Q931 phase and H245 phase.
  • the description of the MCU calling terminal is the description of the MCU calling terminal:
  • Step 501 first entering the Q931 phase, that is, the MCU sends a setup message
  • Steps 502-504 after receiving the call, the terminal sequentially responds to call processing (callProceeding), alarm (alerting), and connect (connect) messages to the MCU; at this point, the Q931 phase is completed, and then the H245 phase is entered.
  • call Processing callProceeding
  • alarm alarm
  • connect connect
  • the H245 phase mainly includes three parts: capability set interaction, master-slave confirmation and open logical channel.
  • the capability interaction includes mutual terminal capability setting (terminalCapabilitySet) and terminal capability setting confirmation (terminalCapabilitySetAck).
  • the master-slave confirmation process includes sending master-slave determination requests to each other (masterSlaveDetermination) and master-slave determination (masterSlaveDeterminationAck), open logical channels including mutual open logical channel request (openLogicalChannel) and open logical channel response (openLogicalChannelAck).
  • openLogicalChannel open logical channel response
  • the above is the standard upper-end process.
  • a non-standard private message is added.
  • the non-standard message type of the conference described in this application is notified to the terminal by sending a nonStandard message. Specifically include:
  • Step 505 The MCU sends a terminalCapabilitySet message to the terminal.
  • Step 506 the MCU sends a masterSlaveDetermination message to the terminal.
  • Step 507 the terminal sends a terminalCapabilitySet message to the MCU;
  • Step 508 the MCU sends a terminalCapabilitySetAck message to the terminal;
  • Step 509 the terminal sends a masterSlaveDetermination message to the MCU;
  • Step 510 The MCU sends a masterSlaveDeterminationAck message to the terminal.
  • Step 511 the terminal sends a terminalCapabilitySetAck message to the MCU;
  • Step 512 the terminal sends a masterSlaveDeterminationAck message to the MCU;
  • Step 513 The MCU sends an atypical (non-Standard) message to the terminal, and carries the language type supported by the conference.
  • Step 514 the terminal sends an openLogicalChannel message to the MCU;
  • Step 515 the MCU sends an openLogicalChannelAck message to the terminal.
  • Step 516 the MCU sends an openLogicalChannel message to the terminal.
  • step 517 the terminal sends an openLogicalChannelAck message to the MCU.
  • the MCU notifies the terminal of the language type supported by the conference.
  • the manner shown in FIG. 5 is only an example, and the language type supported by the conference may be carried by other messages other than the non-Standard message.
  • the message carrying the language type supported by the conference is not limited to step 513, and may also be Other locations, such as before or after other steps.
  • Figure 6 is a SIP protocol call flow chart. Compared with the H.323 protocol, the SIP protocol call flow is simple, as shown in Figure 6, including:
  • Step 601 The MCU sends an INVITE message to the terminal.
  • Step 602 The terminal sends a 100 Trying message to the MCU.
  • Step 603 the terminal sends a 180 Ringing message to the MCU;
  • Step 604 the terminal sends a 200 OK message to the MCU;
  • Step 605 The MCU sends an ACK message to the terminal; the upper end process is completed.
  • a non-standard private message is added on the basis of this, and an INFO message is sent to notify the terminal of the language type supported by the conference, including:
  • Step 606 The MCU sends an INFO message to the terminal, and carries the language type supported by the conference.
  • step 607 the terminal sends a 200 OK message to the MCU.
  • the MCU notifies the terminal of the language type supported by the conference.
  • the language type supported by the conference through the IFO message is only an example, and other messages may be used to carry the language type supported by the conference.
  • the packet format used for audio packets transmitted over the Internet complies with the Real-time Transport Protocol (RTP or RTTP).
  • RTP Real-time Transport Protocol
  • the RTP protocol is a network transmission protocol published by the IETF Multimedia Transmission Working Group in RFC 1889 in 1996.
  • the RTP packet includes: an RTP header and an RTP payload, wherein, as shown in FIG. 7(a), the RTP header includes:
  • 1.V The version number of the RTP protocol, which is 2 digits, and the current protocol version number is 2.
  • CSRC Contributing Source
  • PT The payload type, which is 7 bits, is used to describe the type of payload in the RTP message, such as the Global System for Mobile Communications (GSM) audio, Joint Photographic Experts Group (Joint Photographic Experts Group, JPEG) images, etc., are mostly used to distinguish audio streams and video streams in streaming media, so that the client can parse them.
  • GSM Global System for Mobile Communications
  • JPEG Joint Photographic Experts Group
  • Sequence number 16 bits, used to identify the serial number of the RTP message sent by the sender. Each time a message is sent, the sequence number is incremented by 1. This field can be used to check for packet loss when the underlying bearer protocol uses UDP. At the same time, the situation of network jitter can be used to reorder the data. In the helix server, this field starts from 0, and the sequence of the audio package and the video package are counted separately.
  • Timestamp 32 bits, the time stamp reflects the sampling time of the first octet of the RTP message.
  • the receiver uses the time stamp to calculate the delay and delay jitter and performs synchronous control.
  • Synchronization source (SSRC) identifier 32 bits, used to identify the synchronization source. The identifier is randomly selected, and the two sync sources participating in the same video conference cannot have the same SSRC.
  • Each CSRC identifier is 32 bits and can have 0 to 15. Each CSRC identifies all the special sources contained in the RTP message payload.
  • the RTP message is extended, and the extension flag (X) of the fixed header in the RTP protocol is set to "1", the extension header is enabled, and a variable length header is followed by the header.
  • the extension uses the header to extend the information of each channel language and encoding length of the multi-channel speech, and its format is shown in Figure 7(b).
  • the header extension includes a plurality of extensions, specifically, including:
  • the identifier information field is used to identify the type of the current header extension; it can occupy 16 bits; the specific implementation of the identifier information field is defined by the upper layer protocol.
  • Length field used to indicate how many kinds of channel coded data are shared by the channel, that is, the number of subsequent frame information fields.
  • a plurality of frame information fields may be followed by a length field, each frame information field including: a frame type field and a frame length field, wherein the frame type field is used to indicate a frame type, that is, a language type of one channel, and a frame length The field is used to indicate the frame length, that is, the encoded data length of the channel.
  • Each frame information field can occupy 32 bits, and the first 16 bits are frame type fields, and the last 16 bits are frame length fields.
  • the length of each frame information field can be set as needed, which is not limited in this application.
  • extension information may also not include the identifier information field.
  • other fields may be extended as needed, which is not limited in this application.
  • the format of the foregoing extended header is only an example, and the extended headers of other formats may be defined to carry information about each channel according to requirements, which is not limited in this application.
  • the multi-channel encoded packet or the multi-channel mixed encoded packet may be transmitted using the extended RTP message described above.
  • the multi-channel mixing and encoding package of the message transmission in other formats may also be used, which is not limited in this application.
  • FIG. 8(a) is a flowchart of multi-session conference multi-channel coding and packaging according to an embodiment of the present application.
  • the original voice of the venue speaker is first picked up by the microphone of the venue, and transmitted to the translation room on the terminal side, and translated and converted by the translator to input different channels of the terminal.
  • the conference is a bilingual conference
  • the left and right channels of the terminal respectively place the terminal voices of the terminal corresponding to the conference language, and stereo coding is adopted.
  • the terminal separately encodes each channel to obtain channel information and encoded data, as shown in FIG.
  • the vocal tract information includes language type information and encoded data length information.
  • the language type of each channel is used as the frame type, and the encoded data length is filled as the frame length into each corresponding field of the RTP header extension.
  • the encoded data of each channel is sequentially arranged as the RTP payload according to the language type of the header extension, and finally A multi-channel encoded packet including an RTP header, an RTP header extension, and an RTP payload is generated, wherein the extended flag bit in the RTP header indicates that there is an extended header that is transmitted over the network.
  • the multi-channel mixing data output by the mixer is sent to the encoder of the MCU, and the encoder corresponds to the multi-channel mixing data of the mixer according to the language information of each channel when the mixer is created.
  • the information such as the language identifier of the RTP header extension and the length of the code are added before the encoded data.
  • the multi-channel encoded packet may also include only the RTP header and the RTP payload, and the RTP payload carries data of each channel, and the data of each channel includes the language identifier information of the channel and the channel. Encoded data, wherein the language identification information indicates a language type of the channel.
  • FIG. 8(b) is a flowchart of multi-session conference multi-channel unpacking decoding according to an embodiment of the present application.
  • the conference mix coding packet including the RTP header, the RTP header extension, and the RTP payload sent by the MCU is received, and according to the extended flag bit in the RTP header, the encoder packet existence header is known.
  • the extension according to the header extension format, unpacks the data of each channel in the encoded packet.
  • the language type of each channel in the header extension and the encoded data length of each channel are parsed, and the encoded data of each channel in the RTP payload is found, and sent to a plurality of decoders to decode and output the conference sound of each channel.
  • the MCU receives the multi-channel encoded packet sent by the terminal.
  • the extended flag bit in the RTP header it can be known that the encoded packet has a header extension, and according to the header extended format, for each channel in the multi-channel encoded packet. The data is unpacked.
  • the language type of each channel in the header expansion and the length of the encoded data of each channel are analyzed, and the channel information 1 to the channel information n are obtained, and the encoded data of each channel in the RTP payload is found correspondingly, and the encoded data 1 to encoding is obtained.
  • the data n is sent to a plurality of decoders to decode and output the conference sound of each channel to the mixer module for multi-channel mixing.
  • FIG. 9 is a flowchart of a method for implementing a conference according to a first embodiment of the present application.
  • the execution body of this embodiment is an MCU. As shown in Figure 9, it includes:
  • Step 901 The MCU receives the multi-channel encoded packet sent by the terminal participating in the conference, and performs unpacking and decoding on the multi-channel encoded packet to obtain a multi-channel audio stream; each channel in the multi-channel audio stream corresponds to An audio stream of a language type;
  • the MCU Since each terminal has translated the original voice of the site in which it is located, the MCU does not need to perform translation, and the decoder of the MCU unpacks the multi-channel encoded packets of each terminal according to the steps shown in FIG. 8(b). Decode and then stream the audio from each channel into the mixer.
  • Step 902 The MCU determines a terminal that needs to be mixed, and mixes the audio streams of the same language type in the determined multi-channel audio stream of each terminal to obtain a multi-channel mixed audio stream, and one channel corresponds to a mixed audio stream of a language type;
  • the MCU creates a multi-channel mixer.
  • the number of channels of the mixer is the same as the number of language types supported by the conference.
  • the audio stream of each channel obtained by the MCU will be decoded according to the corresponding language type. Mix the individual channels of the mixer.
  • the terminal that needs to be mixed is the first terminal and the second terminal, and the Chinese voice and the English voice of the first terminal are acquired, and the Chinese voice and the English voice of the second terminal are obtained.
  • the stream, Chinese speech mix audio stream and English speech mix audio stream constitute a multi-channel mixed audio stream.
  • Step 903 The MCU encodes the multi-channel mixed audio stream to obtain a multi-channel mixing and encoding packet, and sends the multi-channel mixing and encoding packet to the terminal.
  • the MCU For a terminal in a conference, the MCU only needs to establish one audio channel for audio transmission with the terminal.
  • the multi-channel mixing code package sent by the MCU to a terminal does not include the audio information of the site where the terminal itself is located.
  • the conference implementation method provided in this embodiment only needs to create a mixing process, the number of mixers is small, the calculation amount of the mixing process is small, the occupation of the CPU is reduced, and the original language and the translation language are sent to the same venue.
  • the expression of the mix is consistent.
  • the mixing unit of the MCU includes: a VAD module, a sorting module, a selection module, a mixing and an exchange processing module.
  • the VAD module estimates the SNR (Signal to Noise Ratio) and energy of the conference speaker.
  • the sorting module sorts all the terminals of the conference according to the operation result of the VAD module.
  • the selection module picks out a number of terminals (such as the three with the highest volume) that match the mixing condition from the sorted terminals for mixing.
  • the mixing and exchange processing module performs mixing processing on the speech speakers participating in the mixing, and obtains the data of the transmitting end of each port participating in the mixing.
  • the MCU's mixing processing strategy should make it impossible for any venue to hear its own voice and only hear the sound of other venues. As shown in Figure 10, it includes:
  • Step 1001 Acquire an SNR and an energy of the terminal.
  • Step 1002 Determine, according to the SNR and the energy of the terminal, whether the terminal participates in sorting and mixing, and generates a terminal linked list.
  • This process is to exclude the terminal that does not speak, that is, only the terminal that speaks to participate in sorting and mixing, so that the terminal that does not speak does not participate in sorting and mixing.
  • the terminal list includes terminals that participate in sorting.
  • Step 1003 Sort the terminal list of the conference
  • the terminal list can be sorted according to the energy and signal to noise ratio of the terminal.
  • Step 1004 Select a terminal that satisfies the condition from the sorted terminal linked list to participate in the mixing.
  • more or fewer terminals can be selected as the terminal to participate in the mixing as needed.
  • the number of mixing ports is equal to 1 and the number of mixing ports is greater than 1, wherein:
  • the internal mixing ports are mixed with each other, and the output of the non-mixing port is the sum of all mixing ports.
  • the meaning of A_out:B+C is: the audio stream sent by the terminal B and the terminal C to the MCU is mixed and sent to the terminal A.
  • the mixer supports a plurality of different channel numbers and terminal mixing of different sampling rates, data conversion between channel data and uniform sampling rate between mixing ports are required in the mixing process.
  • the mixing processing strategy of the multi-language multi-channel speech can be performed by using the same mixing processing strategy as in the present embodiment, and details are not described herein again.
  • FIG. 11 is a flowchart of a method for implementing a conference according to an embodiment of the present disclosure. As shown in FIG.
  • Step 1101 The terminal receives the multi-channel mixing and encoding package sent by the MCU.
  • Step 1102 The terminal parses the multi-channel mixing and encoding packet to obtain a multi-channel mixed audio stream.
  • the terminal can obtain the mixed audio stream of each channel through one channel, thereby reducing the occupation of resources.
  • the method further includes: determining, by the terminal, the language type used by the audio stream of each channel according to the language type information carried in the multi-channel mixing and encoding package, or according to The correspondence between the set channel and the language type determines the language type corresponding to each channel. That is, there are two ways to obtain the language type of the audio stream of each channel, one is preset, such as presetting the first channel to be Chinese, the second channel to English, the third channel to French, etc. Etc., the MCU uses the first channel to carry the Chinese audio stream in the multi-channel mixing and encoding package, the second channel carries the English audio stream, and the third channel carries the French audio stream. One channel acquires the Chinese audio stream, the second channel acquires the English audio stream, and the third channel acquires the sent audio stream.
  • the multi-channel encoded packet is a real-time transport protocol message
  • the terminal determines, according to the language type information carried in the multi-channel mixing and encoding packet, the audio stream of each channel is used.
  • Language types include:
  • the terminal acquires language type information used by the audio stream of each channel from a frame type field in the real-time transport protocol header extension in the multi-channel encoded packet.
  • the method further includes: after the terminal obtains an audio stream of each channel, if the language type of the audio stream of one of the channels is a language type used by the site where the terminal is located, playing The audio stream of this channel.
  • the original voice mix received by the terminal is a sound mix of a plurality of speakers whose voices are the loudest among all the speakers in the conference, and the translated voice mix received by the terminal is only the translated voice corresponding to the language type supported by the terminal.
  • the remixing unable to obtain valid information of all the translated speech of the venue, that is, the expression information of the original speech remix received by the same venue is inconsistent with the expression information of the transliterated speech remix.
  • the expression information of the original voice received by the terminal is consistent with the expression information of the voice mixture mixed.
  • FIG. 12 is a schematic diagram of a conference television application scenario according to a second embodiment of the present application. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.
  • each conference site translates the original voice of the conference according to the language type supported by the conference.
  • the language negotiated in the conference is Chinese and English.
  • the original voice of the venue is Chinese
  • the source of the left channel of the Chinese venue is the original Chinese voice
  • the source of the right channel is the English voice after the original voice is translated
  • the original voice of the English venue is English, then the left of the English venue
  • the sound source of the channel is the Chinese voice translated by the original English voice
  • the sound source of the right channel is the original voice of the venue, that is, the English voice.
  • the terminal of each site obtains the original voice of the site where it is located.
  • the original voice is directly input as a left channel or directly as a right channel input or translated as a sound source of the left and right channels. It is sent to the MCU through the stereo encoding of the terminal.
  • the terminal of the Chinese venue directly inputs the original voice (Chinese) in the venue as the left channel source. After the original voice is translated into English, it is used as the right channel source input, and stereo audio is used to form one channel of audio through one channel. Sent to the MCU.
  • the terminal of the English conference site directly inputs the original voice (English) in the conference as the right channel audio source. After the original voice is translated into Chinese, it is input as the left channel sound source, and is encoded to form one channel of audio, which is sent to the MCU through one audio channel.
  • the MCU performs a mixing process on the multi-channel audio stream sent by each terminal to generate a multi-channel mixing code packet, and transmits it to each terminal.
  • the MCU's mixing processing strategy should make it impossible for any venue to hear its own voice and only hear the sound of other venues.
  • Each terminal receives a multi-channel mixing and encoding packet (which may also be referred to as a stereo mixing in the present embodiment) transmitted by the MCU, and decodes the received multi-channel mixed audio stream (two channels in this embodiment).
  • the playback is selectively performed through the left and right channels of the terminal.
  • the terminal of the site can directly play the sound of the channel in the venue.
  • the left channel of the stereo mix received by the Chinese venue is a Chinese conference mix.
  • the terminal of the Chinese venue can play the left channel sound directly in the venue; for example, the right channel of the stereo mix received by the English venue is English conference mixing, the terminal of the English conference venue can play the right channel sound directly in the venue.
  • the venue can be processed according to specific needs. For example, some venues need to be translated and then played, and some venues can play the original mix without translation. In this embodiment, each terminal can directly listen to the sound of the left channel or the right channel without translating. In another embodiment, the mix of the conferences can also be processed by voice-to-text processing to form subtitles for display on the terminal screen.
  • FIG. 13 is a schematic diagram of a conference television application scenario according to a third embodiment of the present disclosure. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.
  • each venue translates the original voice of the conference.
  • the language supported by the multilingual conference is Chinese and English
  • the original voice of the Chinese conference venue is Chinese
  • the left channel of the Chinese venue is The sound source is the original Chinese voice
  • the sound source of the right channel is the English voice for the original voice
  • the original voice of the English venue is English.
  • the source of the left channel of the English venue is the Chinese voice for the original English voice, right.
  • the source of the channel is the original English voice.
  • a venue may use multiple languages to speak.
  • the original voice of the venue is multi-language.
  • the left channel of the venue is the Chinese voice obtained by translating the original voice.
  • the right channel sound source is the English voice obtained by translating the original voice.
  • the translated language type of the terminal of each site is fixed, that is, the terminal translates the original voice of the site into a voice of a fixed language type supported by the conference.
  • the working methods of the MCU and the terminals of each conference site are as follows:
  • the terminal of each site obtains the original voice of the site where it is located.
  • the original voice is directly input as a left channel or directly as a right channel input or translated as a sound source of the left and right channels. It is sent to the MCU through the stereo encoding of the terminal.
  • the terminal of the Chinese venue will input the original voice (Chinese) in the venue as the left channel source.
  • the original voice is translated into English, it will be used as the right channel source input to form one channel of audio (ie, multi-channel encoding package). ), sent to the MCU through one audio channel.
  • the terminal of the English venue will input the original voice (English) in the venue as the right channel source.
  • the original voice After the original voice is translated into Chinese, it will be input as the left channel source, and one channel of audio (ie, multi-channel code pack) will be formed by stereo coding.
  • the terminal of the mixed Chinese and English venue translates the original voice (Chinese-English mixed) in this venue into Chinese as the left channel source input, and translates the original voice (Chinese-English mixture) in this venue into English as the right channel source.
  • Input form one channel of audio through stereo encoding, and send it to the MCU through one audio channel.
  • the MCU mixes the stereo multi-lingual speech transmitted by each terminal, and transmits a stereo mix (multi-channel mixed coded packet) to each terminal.
  • the MCU's mixing processing strategy should make it impossible for any venue to hear its own voice and only hear the sound of other venues.
  • Each terminal receives a stereo mix sent by the MCU. At the same time, each terminal decodes and receives the stereo mix sent by the MCU, and selectively plays the same through the left and right channels of the terminal.
  • the sound in the stereo mix of the left and right channels of the terminal sent by the MCU to the conference site may be the same as the language supported by the site: for example, the left channel of the stereo mix received by the Chinese venue is the Chinese conference mix. Therefore, the terminal of the Chinese venue can play the left channel sound directly in the venue.
  • the right channel of the stereo mix received by the English venue is an English conference mix. Therefore, the terminal of the English venue can play the right channel sound directly in the venue.
  • the left channel of the stereo mix received by the Chinese and English mixed venues is the Chinese conference mix
  • the right channel is the English conference mix. Therefore, the participants in the Chinese and English mixed venues can listen to each other according to their familiar language types. Conference mix for different channels.
  • the method for implementing the conference proposed by the present application can effectively reduce the workload of the terminal-side translators and effectively satisfy the language listening of the participants in the multi-language conference environment. Say the demand.
  • FIG. 14 is a schematic diagram of a conference television application scenario according to a fourth embodiment of the present disclosure. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.
  • the type of language supported by the conference is three, and the ability of each conference to translate with each other is determined to be a Chinese-English trilingual conference.
  • the original venue sound collected by the audio of the English and French venues there will be two translation conversions, such as the Chinese venue, the original voice as the input of the terminal channel 1, and the translated speech of the Chinese to English as the input of the terminal channel 2.
  • the translation sound of the Chinese translation method is used as the input of the terminal channel 3, and is separately encoded and then subjected to packing processing.
  • the specific implementation manner is shown in Fig. 8(a).
  • the sound input processing steps in English and French venues are the same.
  • the working methods of the MCU and the terminals of each conference site are as follows:
  • the terminal of each venue obtains the original voice of the venue where it is located and the language type supported by the conference, determines the type of translation, such as English-Chinese translation or Chinese-English translation or other, and uses the original voice together with the two-way translated voice as the channel.
  • the input source of channel 1 to channel 3 is packaged to generate a multi-channel encoded packet and sent to the MCU as an audio stream.
  • the MCU decodes and decodes the multi-channel encoded packet sent by each terminal, and sends it to the multi-channel mixer according to the channel language type obtained after decoding to obtain a multi-channel mixed audio stream. Encoding and packing to obtain a multi-channel mixing and encoding package, the specific implementation is shown in Figure 8(a). Finally, the multi-channel mixing code packet is sent to each terminal. In this embodiment, the mixing process of the MCU can be quickly implemented, and only one mixer is required to complete the multi-channel mixing.
  • Each terminal receives the multi-channel mixing and encoding packet sent by the MCU, performs decoding and splits into three channels of speech, and respectively decodes into a linear code stream, and the terminal side listens to the conference mixing as needed.
  • each terminal can freely choose the language to listen to from a variety of different languages.
  • FIG. 15 is a schematic diagram of a conference television application scenario according to a fifth embodiment of the present disclosure. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.
  • the difference between this embodiment and the fourth embodiment is that there are five types of venue languages in the conference, namely Chinese, English, German, Japanese, and French.
  • the German-Japanese mixed forum has the ability to translate Chinese and English, but other venues do not have the translation ability of German-Japanese and French. Therefore, the language types supported by multi-lingual conferences are Chinese and English.
  • the German-Japanese law venue first translated the voice of the venue into Chinese and English, as the audio source input for channel 1 and channel 2, as shown in Figure 15.
  • the voices sent by the Chinese venue are the original voice Chinese and the English voice translated by the original voice.
  • the voices sent by the English venue are the Chinese translated by the original voice and the original voice English.
  • the mixed Chinese and English mixed speech is translated into Chinese speech and English speech respectively.
  • the German-Japanese-French mixed conference is the same, and the German-Japanese mixed speech is translated into Chinese speech and English speech respectively.
  • the working methods of the MCU and the terminals of each conference site are as follows:
  • the terminal of each venue obtains the original voice of the venue where it is located and the language type supported by the multilingual conference, and determines the type of translation, such as English to Chinese or Chinese to English or the like.
  • the two languages of the output are used as the sound source of the left and right channels, and the multi-channel encoded packet is encoded as one audio stream to be sent to the MCU.
  • the MCU mixes the multi-channel encoded packets sent by each terminal to obtain a multi-channel mixed audio stream, and encodes and packs the multi-channel mixed audio stream and transmits it to each terminal.
  • MCU's mixing process can be implemented quickly, requiring only one mixer to complete a multilingual mix.
  • Each terminal receives the multi-channel mixing and encoding packet sent by the MCU, decodes it, splits the left and right channels, and listens to the multi-lingual voice separately according to the requirements.
  • the mixed audio stream in the mix encoding package of the terminal sent by the MCU to the site is the same as the language type supported by the site, for example, the multi-lingual audio received by the Chinese venue in the Chinese and English multilingual conferences is Chinese and English.
  • Chinese terminals, English terminals, Chinese and English terminals in the conference can directly listen to or play the conference voice by selecting different channels.
  • the translation can be converted according to the specific requirements.
  • the German-Japanese venue needs to translate the received Chinese mix or English mix. Convert and then play.
  • each site has a translation function, and the original voice of the site can be translated and output.
  • the MCU mixes the multi-lingual voice outputted by the terminal, only one mixer is needed, and only one processing flow is needed. Before the mixing, there is no need to add a voice recognition module, and the language type identification can be performed through the frame type information. Participants in each meeting in the conference can use one or more communication languages, and can communicate with each other without barriers. The solution is easy to implement and has high practicability.
  • An embodiment of the present application provides a conference implementation apparatus, as shown in FIG. 16, including:
  • the code stream generating unit 1601 is configured to acquire the original voice of the site and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
  • the first encoding unit 1602 is configured to encode the multi-channel audio stream to obtain a multi-channel encoded packet
  • the first transmitting unit 1603 is configured to transmit the multi-channel encoded packet to the multi-point control unit.
  • the conference implementing apparatus further includes: a language type determining unit 1604, the language type determining unit 1604 being configured to determine a language type supported by the conference. For example, the language type determining unit 1604 acquires the language type supported by the conference from the multipoint control unit.
  • the multi-channel audio stream generated by the code stream generating unit 1601 includes an audio stream of each language type supported by the conference.
  • the code stream generating unit acquires the original voice of the site and the translated voice corresponding to the original voice, and generates a multi-channel audio stream, including one or a combination thereof:
  • the audio stream corresponding to the original voice is used as an audio stream of one channel, and the conference supported by the original voice is supported by the conference.
  • the translated voice of each language type supported by the conference corresponding to the original voice is respectively used as an audio stream of one channel, and the multiple Channel audio stream.
  • An embodiment of the present application provides a conference implementation apparatus, as shown in FIG. 17, including:
  • the first receiving unit 1701 is configured to receive the multi-channel mixing and encoding packet sent by the multi-point control unit;
  • the first parsing unit 1702 is configured to parse the multi-channel mixing and encoding packet to acquire a multi-channel mixed audio stream.
  • the first parsing unit 1702 is further configured to determine, according to the language type information carried in the multi-channel mixing and encoding package, a language type used by each channel of the audio stream, or according to the pre- The correspondence between the set channel and the language type determines the language type corresponding to each channel.
  • the obtaining, by the first parsing unit 1702, the encoded data of each channel includes:
  • the first parsing unit 1702 acquires language type information used by an audio stream of each channel from a frame type field in a real-time transport protocol header extension in the multi-channel encoded packet, and expands from the real-time transport protocol header
  • the frame length field in the field acquires the length of the encoded data of each channel, and the encoded data of each channel is obtained from the real-time transport protocol payload of the multi-channel encoded packet according to the length of the encoded data of each channel.
  • An embodiment of the present application provides a conference implementation apparatus, as shown in FIG. 18, including:
  • the second receiving unit 1801 is configured to receive a multi-channel encoded packet sent by the terminal participating in the conference;
  • the second parsing unit 1802 is configured to parse the multi-channel encoded packet to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type;
  • the mixing unit 1803 is configured to determine a terminal that needs to be mixed, and mix the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream; and one channel a mixed audio stream corresponding to one language type;
  • a second encoding unit 1804 configured to encode the multi-channel mixed audio stream to obtain a multi-channel mixed encoding package
  • the second transmitting unit 1805 is configured to transmit the multi-channel mixing and encoding packet to the terminal participating in the conference.
  • An embodiment of the present application provides a conference implementation device, including a memory and a processor, where the memory stores a program, and when the program is read and executed by the processor, the conference implementation method of any of the foregoing is implemented.
  • An embodiment of the present application is a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement any of the above The conference implementation method.
  • the computer readable storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like, which can store program codes. Medium.
  • An embodiment of the present application provides a conference implementation system, as shown in FIG. 19, including: an MCU 1901 and at least one terminal 1902, where:
  • the terminal 1902 is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
  • the channel audio stream is encoded to obtain a multi-channel encoded packet;
  • the multi-channel encoded packet is transmitted to the multi-point control unit, and the multi-channel mixing encoded packet transmitted by the multi-point control unit is received; Multi-channel mixing and encoding package to obtain a multi-channel mixed audio stream;
  • the MCU 1901 is configured to receive a multi-channel encoded packet sent by a terminal participating in the conference, and perform parsing to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type; Determining a terminal that needs to be mixed, mixing the audio stream of the same language type in the determined multi-channel audio stream of the terminal, obtaining a multi-channel mixed audio stream; encoding the multi-channel mixed audio stream Obtaining a multi-channel mixing encoding packet; transmitting the multi-channel mixing encoding packet to a terminal participating in the conference.
  • terminal and the MCU For detailed implementation of the terminal and the MCU, refer to the previous embodiment, and details are not described herein.
  • the original voice and the translated voice are separately transmitted by using different channels.
  • the voices of different language types are generated by multi-channel encoding, and only one channel is needed, thereby reducing the occupation of the port. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Machine Translation (AREA)

Abstract

A conference implementation method, device, apparatus and system, and a computer readable storage medium. The conference realization method comprises: acquiring an original speech from a conference location at which a terminal is located and a translated speech corresponding to the original speech to generate a multi-channel audio stream, wherein one channel of the audio stream corresponds to one language type; performing encoding of the multi-channel audio stream to obtain an encoded multi-channel packet; and transmitting the encoded multi-channel packet to a multipoint control unit.

Description

会议实现方法、装置、设备和系统、计算机可读存储介质Conference implementation method, device, device and system, computer readable storage medium
相关申请的交叉引用Cross-reference to related applications
本申请基于申请号为201711482042.3、申请日为2017年12月29日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。The present application is based on a Chinese patent application filed on Jan. 29, 2017, the entire disclosure of which is hereby incorporated by reference.
技术领域Technical field
本申请涉及通信技术,尤指一种会议实现方法、装置、设备和系统、以及一种计算机可读存储介质。The present application relates to communication technologies, and more particularly to a conference implementation method, apparatus, device and system, and a computer readable storage medium.
背景技术Background technique
在信息交流全球化的今天,视频会议业务应用也越来越广泛,跨地区化交流不断增多,多国多地区使用不同语言的人共同交流,实现沟通无障碍化,这对政治、经济、文化各方面的影响都是巨大的。Today, with the globalization of information exchange, video conferencing services are becoming more and more widely used, and cross-regional exchanges are increasing. People from different countries and regions use different languages to communicate and achieve communication, which is political, economic and cultural. The impact of the aspects is huge.
视频会议是一种多媒体通信手段,可以同时实现两个或多个地点之间的图像、语音和数据的交互功能,利用电视设备和通信网络召开会议。视频会议系统一般由视频终端、传输网络和多点控制单元(Multipoint Control Unit,简称MCU)等几部分组成。视频会议基本的语音业务涉及两个通话方,通话双方可以相互听到对方的声音,而在多方通话的会议业务中,同时通话的可能不止两方,其中任意一方的语音信号够可以被传送至所有其他通话方,而每一通话方都同时可听到多个其他通话方的声音。而当上述多方通话视频会议业务中使用多种语言时,需要对发言者的讲话内容按需进行翻译转换,使参与会议的不同国家不同地区使用不同语言的人能够一起交流,沟通无障碍。Video conferencing is a means of multimedia communication that enables simultaneous interaction of images, voice, and data between two or more locations, using television equipment and communication networks to hold meetings. The video conferencing system generally consists of a video terminal, a transmission network, and a Multipoint Control Unit (MCU). The basic voice service of video conference involves two parties. The two parties can hear each other's voices. In the conference service of multi-party calls, there may be more than two parties at the same time. The voice signal of any one of them can be transmitted to All other parties, and each party can hear the voices of multiple other parties at the same time. When multi-language is used in the multi-party video conference service, the speech content of the speaker needs to be translated and converted as needed, so that people who use different languages in different countries and regions participating in the conference can communicate and communicate with each other.
相关技术中,实现多语种视频会议的技术是:各终端能够按照自身的翻译语言类型对会场的原始语音进行翻译,输出两路音频:一路是源语音,一路是翻译语音,需要两个通道传输到MCU,MCU会判断翻译会场上传 的翻译后的语音的语言类型,MCU进行混音处理时,需要多个混音器,按照识别的语言类型对翻译语音进行混音以及对原始语音进行混音,在对各语言类型的语音进行混音处理后,针对每一个会场,MCU需发送两路音频数据给会场:一路是各会场原始语音混音,一路是目的会场相应语言类型的翻译混音。上述方案网络通信量比较大,容易造成网络拥塞。In the related art, the technology for implementing multi-language video conference is: each terminal can translate the original voice of the venue according to its own translation language type, and output two channels of audio: one source is the source voice, the other is the translation voice, and two channels are needed for transmission. To the MCU, the MCU will judge the language type of the translated speech uploaded by the translation site. When the MCU performs the mixing process, multiple mixers are needed to mix the translated speech according to the recognized language type and mix the original speech. After mixing the voices of each language type, for each venue, the MCU needs to send two channels of audio data to the venue: all the way is the original voice mix of each venue, all the way is the translation mix of the corresponding language type of the destination venue. The network traffic of the above solution is relatively large, which is likely to cause network congestion.
发明内容Summary of the invention
本申请至少一实施例提供了一种会议实现方法、装置、设备和系统、计算机可读存储介质,降低网络资源的占用,提高会议效率。At least one embodiment of the present application provides a conference implementation method, apparatus, device, and system, and a computer readable storage medium, which reduce network resource occupation and improve conference efficiency.
本申请至少一实施例提供了一种会议实现方法,包括:At least one embodiment of the present application provides a conference implementation method, including:
获取终端所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,且一个声道对应一种语言类型的音频流;Acquiring the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generating a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
将所述多声道音频流进行编码获得多声道编码包;Encoding the multi-channel audio stream to obtain a multi-channel encoded packet;
将所述多声道编码包发送给多点控制单元。The multi-channel encoded packet is transmitted to a multi-point control unit.
本申请至少一实施例提供一种会议实现装置,包括:At least one embodiment of the present application provides a conference implementation apparatus, including:
码流生成单元,设置为,获取终端所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,且一个声道对应一种语言类型的音频流;The code stream generating unit is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
第一编码单元,设置为,将所述多声道音频流进行编码获得多声道编码包;a first encoding unit configured to encode the multi-channel audio stream to obtain a multi-channel encoded packet;
第一发送单元,设置为,将所述多声道编码包发送给多点控制单元。The first transmitting unit is configured to transmit the multi-channel encoded packet to the multi-point control unit.
本申请至少一实施例提供一种会议实现方法,包括:At least one embodiment of the present application provides a conference implementation method, including:
接收参与会议的终端发送的多声道编码包,进行解析获得多声道音频流,其中,所述多声道音频流中每个声道对应一个语言类型的音频流;Receiving a multi-channel encoded packet sent by the terminal participating in the conference, and performing parsing to obtain a multi-channel audio stream, wherein each channel of the multi-channel audio stream corresponds to an audio stream of one language type;
确定需要混音的终端,将所确定的终端的多声道音频流中同一语言类型的音频流进行混音,获得多声道混音音频流;Determining a terminal that needs to be mixed, and mixing the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream;
将所述多声道混音音频流进行编码获得多声道混音编码包;Encoding the multi-channel mixed audio stream to obtain a multi-channel mixing and encoding package;
将所述多声道混音编码包发送给参与会议的终端。The multi-channel mixing and encoding package is transmitted to the terminal participating in the conference.
本申请至少一实施例提供一种会议实现装置,包括:At least one embodiment of the present application provides a conference implementation apparatus, including:
第二接收单元,设置为,接收参与会议的终端发送的多声道编码包;a second receiving unit, configured to receive a multi-channel encoded packet sent by the terminal participating in the conference;
第二解析单元,设置为,对所述多声道编码包进行解析获得多声道音频流,其中,所述多声道音频流中每个声道对应一个语言类型的音频流;a second parsing unit, configured to parse the multi-channel encoded packet to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type;
混音单元,设置为,确定需要混音的终端,将所确定的终端的多声道音频流中同一语言类型的音频流进行混音,获得多声道混音音频流;a mixing unit, configured to determine a terminal that needs to be mixed, and mix the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream;
第二编码单元,设置为,将所述多声道混音音频流进行编码获得多声道混音编码包;a second encoding unit configured to encode the multi-channel mixed audio stream to obtain a multi-channel mixing encoding package;
第二发送单元,设置为,将所述多声道混音编码包发送给参与会议的终端。The second transmitting unit is configured to send the multi-channel mixing and encoding packet to the terminal participating in the conference.
本申请至少一实施例提供一种会议实现方法,包括:At least one embodiment of the present application provides a conference implementation method, including:
接收多点控制单元发送的多声道混音编码包;Receiving a multi-channel mixing and encoding package sent by the multi-point control unit;
解析所述多声道混音编码包,获取多声道混音音频流。The multi-channel mixing and encoding package is parsed to obtain a multi-channel mixed audio stream.
本申请至少一实施例提供一种会议实现装置,包括:At least one embodiment of the present application provides a conference implementation apparatus, including:
第一接收单元,设置为,接收多点控制单元发送的多声道混音编码包;The first receiving unit is configured to receive the multi-channel mixing and encoding package sent by the multi-point control unit;
第一解析单元,设置为,解析所述多声道混音编码包,获取多声道混音音频流。The first parsing unit is configured to parse the multi-channel mixing encoded packet to obtain a multi-channel mixed audio stream.
本申请至少一实施例提供一种会议实现设备,包括存储器和处理器,所述存储器存储有程序,所述程序在被所述处理器读取执行时,实现上述任一实施例任一所述的会议实现方法。At least one embodiment of the present application provides a conference implementation device including a memory and a processor, the memory storing a program, when the program is read and executed by the processor, implementing any of the above embodiments. Meeting implementation method.
本申请一实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现上述任一实施例任一所述的会议实现方法。An embodiment of the present application provides a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the foregoing A conference implementation method according to any of the embodiments.
本申请一实施例提供一种会议实现系统,包括:多点控制单元和至少 一个终端,其中:An embodiment of the present application provides a conference implementation system, including: a multipoint control unit and at least one terminal, where:
所述终端设置为,获取所述终端所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,且一个声道对应一种语言类型的音频流;将所述多声道音频流进行编码获得多声道编码包;将所述多声道编码包发送给所述多点控制单元,以及,接收多点控制单元发送的多声道混音编码包;解析所述多声道混音编码包,获取多声道混音音频流;The terminal is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type; The channel audio stream is encoded to obtain a multi-channel encoded packet; the multi-channel encoded packet is transmitted to the multi-point control unit, and the multi-channel mixing and encoding packet transmitted by the multi-point control unit is received; Channel mixing encoding package to obtain a multi-channel mixed audio stream;
所述多点控制单元设置为,接收参与会议的终端发送的多声道编码包,进行解析获得多声道音频流,其中,所述多声道音频流中每个声道对应一个语言类型的音频流;确定需要混音的终端,将所确定的终端的多声道音频流中同一语言类型的音频流进行混音,获得多声道混音音频流;将所述多声道混音音频流进行编码获得多声道混音编码包;将所述多声道混音编码包发送给参与会议的终端。The multi-point control unit is configured to receive a multi-channel encoded packet sent by the terminal participating in the conference, and perform parsing to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to one language type Audio stream; determining a terminal that needs to be mixed, mixing the audio stream of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream; and the multi-channel mixed audio The stream is encoded to obtain a multi-channel mixing encoding packet; the multi-channel mixing encoding packet is transmitted to a terminal participating in the conference.
本申请至少一实施例提供的会议实现方法,不同于相关技术中,使用不同通道分别传输原始语音和翻译语音,本实施例中,将不同语言类型的语音生成多声道编码包后发送,仅需要一路通道,减少了对端口的占用。The method for implementing the conference provided in at least one embodiment of the present application is different from the related art in that the original voice and the translated voice are respectively transmitted by using different channels. In this embodiment, the voices of different language types are generated by the multi-channel encoded packet, and only A channel is required to reduce the occupation of the port.
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present application will be set forth in the description which follows. The objectives and other advantages of the present invention can be realized and obtained by the structure of the invention.
附图说明DRAWINGS
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。The drawings are used to provide a further understanding of the technical solutions of the present application, and constitute a part of the specification, which is used together with the embodiments of the present application to explain the technical solutions of the present application, and does not constitute a limitation of the technical solutions of the present application.
图1为本申请一实施例提供的多语会议的组网示意图。FIG. 1 is a schematic diagram of networking of a multilingual conference according to an embodiment of the present application.
图2为本申请一实施例提供的会议实现方法流程图。FIG. 2 is a flowchart of a method for implementing a conference according to an embodiment of the present application.
图3为本申请一实施例提供的会议实现方法终端侧处理示意图。FIG. 3 is a schematic diagram of processing on the terminal side of a method for implementing a conference according to an embodiment of the present disclosure.
图4为本申请一实施例提供的多语会议语种协商流程图。FIG. 4 is a flowchart of a multilingual conference language negotiation according to an embodiment of the present application.
图5为本申请一实施例提供的基于H323协议终端呼叫流程图。FIG. 5 is a flowchart of a terminal call based on the H323 protocol according to an embodiment of the present application.
图6为本申请一实施例提供的基于SIP协议终端呼叫流程图。FIG. 6 is a flowchart of a SIP-based terminal call according to an embodiment of the present application.
图7(a)为相关技术中提供的RTP报文格式示意图。FIG. 7(a) is a schematic diagram of a format of an RTP message provided in the related art.
图7(b)为本申请一实施例提供的对RTP报文扩展示意图。FIG. 7(b) is a schematic diagram of an extension of an RTP message according to an embodiment of the present application.
图8(a)为本申请一实施例提供的多语会议多声道编码打包流程示意图。FIG. 8( a ) is a schematic diagram of a multi-session conference multi-channel encoding and packaging process according to an embodiment of the present application.
图8(b)为本申请一实施例提供的多语会议多声道解包解码流程示意图。FIG. 8(b) is a schematic diagram of a multi-session conference multi-channel unpacking decoding process according to an embodiment of the present application.
图9为本申请第一实施例提供的会议实现方法MCU侧实现的流程图。FIG. 9 is a flowchart of an implementation of a conference implementation method on the MCU side according to the first embodiment of the present disclosure.
图10为本申请一实施例提供的MCU混音过程原理图。FIG. 10 is a schematic diagram of an MCU mixing process according to an embodiment of the present application.
图11为本申请一实施例提供的会议实现方法流程图(终端侧)。FIG. 11 is a flowchart (terminal side) of a method for implementing a conference according to an embodiment of the present application.
图12为本申请一实施例提供的多语会议电视应用场景示意图。FIG. 12 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present application.
图13为本申请一实施例提供的多语会议电视应用场景示意图。FIG. 13 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present application.
图14为本申请一实施例提供的多语会议电视应用场景示意图。FIG. 14 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present application.
图15为本申请一实施例提供的多语会议电视应用场景示意图。FIG. 15 is a schematic diagram of a multi-session conference television application scenario according to an embodiment of the present disclosure.
图16为本申请一实施例提供的会议实现装置框图。FIG. 16 is a block diagram of a conference implementation apparatus according to an embodiment of the present application.
图17为本申请一实施例提供的会议实现装置框图。FIG. 17 is a block diagram of a conference implementation apparatus according to an embodiment of the present application.
图18为本申请一实施例提供的会议实现装置框图。FIG. 18 is a block diagram of a conference implementation apparatus according to an embodiment of the present application.
图19为本申请一实施例提供的会议实现系统示意图。FIG. 19 is a schematic diagram of a conference implementation system according to an embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚明白,下文中将结合附图对本申请的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。In order to make the objects, technical solutions and advantages of the present application more clear, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行。并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。The steps illustrated in the flowchart of the figures may be executed in a computer system such as a set of computer executable instructions. Also, although logical sequences are shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.
相关技术中的视频会议实现方案,存在如下缺点:The video conference implementation solution in the related art has the following disadvantages:
终端与MCU直接连接原始语音通道和翻译语音通道,分别采用不同的通道进行传输,通过原始语音通道传输原始语音,通过翻译语音通道传输翻译语音,网络通信量加大,容易造成网络拥塞,终端与MCU之间的音频输入输出接口的占用数量加倍。The terminal directly connects the original voice channel and the translated voice channel with the MCU, and uses different channels for transmission, transmits the original voice through the original voice channel, and transmits the translated voice through the translated voice channel, and the network traffic is increased, which easily causes network congestion, and the terminal and the terminal The number of occupied audio input and output interfaces between MCUs is doubled.
一个会议室根据翻译语音的语言类型对所述翻译语音进行混音处理,需要创建多个混音过程,随着语音种类的增加,混音器个数随之线性增加,导致了混音过程的重复计算,加大了混音过程对CPU性能的占用。A conference room mixes the translated speech according to the language type of the translated speech, and needs to create multiple mixing processes. As the type of speech increases, the number of mixers increases linearly, resulting in a mixing process. Repeated calculations increase the CPU usage of the mixing process.
图1为本申请一实施例提供的多语会议的组网示意图。如图1所示,支持多语会议电视的系统中,包括至少一个终端以及与各终端连接的MCU。终端与MCU通过通信网络连接。终端通过标准接口连接到MCU,实现图像和语音的混合与交换。MCU根据会议各会场的语音需求以及翻译能力确定会议支持的语言类型,其中,会议支持的语言类型可以是一种或多种。会议中的各个会场中的与会人员所使用的交流语言可以多样化,各个会场可以根据需要设置翻译人员,对发送以及接收到的语音进行翻译。各会场可以根据会议确定的会议语言类型对本会场的声音进行翻译转换,各会场接收的会议混音可以按需直接收听播放或翻译成自身所在会场可懂的语言类型再进行收听播放。其中终端对本会场的原始语音进行转换成会议语言类型的翻译语音的方法,可采用人工方式或者采用软件硬件等非人工的智能方式。目前的国际会议基本上采用对发言者的讲话内容进行人工口头翻译(简称口译)的方式。本申请中以人工翻译这种翻译方式为例进行方案的描述,但是不限于该种翻译方式或者手段。FIG. 1 is a schematic diagram of networking of a multilingual conference according to an embodiment of the present application. As shown in FIG. 1, a system supporting multi-session conference television includes at least one terminal and an MCU connected to each terminal. The terminal and the MCU are connected through a communication network. The terminal is connected to the MCU through a standard interface to realize mixing and exchange of images and voice. The MCU determines the language type supported by the conference according to the voice requirements and translation capabilities of each conference site. The language type supported by the conference may be one or more. The communication language used by the participants in each meeting in the conference can be diversified. Each venue can set up translators as needed to translate the voices sent and received. Each venue can translate and convert the voice of the conference according to the type of conference language determined by the conference. The conference mix received by each conference site can be directly listened to or played as the language type that can be understood by the venue at the venue. The method in which the terminal converts the original voice of the site into the translated voice of the conference language type may be manually or adopt a non-manual intelligent manner such as software and hardware. The current international conference basically adopts the method of artificial oral translation (referred to as interpreting) for the speaker's speech. In the present application, the translation method is manually translated as an example, but it is not limited to the translation method or means.
本申请一实施例提供一种会议实现方法,如图2所示,包括:An embodiment of the present application provides a conference implementation method, as shown in FIG. 2, including:
步骤201,获取终端所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,且一个声道对应一种语言类型的音频流;Step 201: Acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
步骤202,将所述多声道音频流进行编码获得多声道编码包;Step 202: Encode the multi-channel audio stream to obtain a multi-channel encoded packet.
步骤203,将所述多声道编码包发送给MCU。Step 203: Send the multi-channel encoded packet to the MCU.
本实施例提供的会议实现方法,将不同语言类型的音频流将其组成多声道编码包传输给MCU。本实施例中,多声道编码包可以通过一路音频通道进行发送,因此,仅需一路音频通道实现了多路不同语言类型的音频传输,相比多通道的实现方式,降低了网络通信量,减少了对资源的占用,提高了通信效率。The conference implementation method provided by this embodiment transmits audio channels of different language types to the MCU by combining the multi-channel encoded packets. In this embodiment, the multi-channel encoded packet can be transmitted through one audio channel. Therefore, only one audio channel is required to implement multiple audio transmissions of different language types, and the network traffic is reduced compared to the multi-channel implementation. Reduce the occupation of resources and improve communication efficiency.
在一实施例中,步骤201之前还包括:In an embodiment, before step 201, the method further includes:
步骤200,获取会议支持的语言类型;Step 200: Obtain a language type supported by the conference;
所述步骤201中的所述翻译语音为所述原始语音对应的所述会议支持的语言类型的翻译语音。The translated voice in the step 201 is a translated voice of a language type supported by the conference corresponding to the original voice.
一般地,由多点控制单元与终端进行协商确定会议支持的语言类型,并在确定会议支持的语言类型后,由多点控制单元将会议支持的语言类型发送给终端。因此,终端获取会议支持的语言类型的一种方式为:所述终端接收所述多点控制单元发送的所述会议支持的语言类型。Generally, the multi-point control unit negotiates with the terminal to determine the language type supported by the conference, and after determining the language type supported by the conference, the multi-point control unit sends the language type supported by the conference to the terminal. Therefore, one way for the terminal to obtain the language type supported by the conference is that the terminal receives the language type supported by the conference sent by the multipoint control unit.
需要说明的是,会议支持的语言类型也可以预先进行配置。It should be noted that the language types supported by the conference can also be configured in advance.
其中,所述多声道音频流中包括所述会议支持的每个语言类型的音频流。比如,会议支持的语言类型为3个时,多声道音频流中包括3个声道,每个声道为一个语言类型的音频流。The audio stream of each language type supported by the conference is included in the multi-channel audio stream. For example, when the language type supported by the conference is three, the multi-channel audio stream includes three channels, and each channel is an audio stream of one language type.
在一实施例中,所述终端获取所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,包括:In an embodiment, the terminal acquires the original voice of the site and the translated voice corresponding to the original voice, and generates a multi-channel audio stream, including:
当所述原始语音的语言类型为所述会议支持的语言类型之一时,将所 述原始语音对应的音频流作为一个声道的音频流,将所述原始语音对应的所述会议支持的除所述原始语音的语言类型外的其他语言类型的翻译语音作为其他声道的音频流,生成所述多声道音频流;比如,会场的原始语音为中文,会议支持中文和英文,则将中文语音作为一个声道的音频流,将该中文语音翻译得到的英文语音作为一个声道的音频流,得到中文语音和英文语音组成的多声道音频流;When the language type of the original voice is one of the language types supported by the conference, the audio stream corresponding to the original voice is used as an audio stream of one channel, and the conference supported by the original voice is supported by the conference. The translated speech of other language types other than the language type of the original speech is used as the audio stream of other channels to generate the multi-channel audio stream; for example, the original voice of the venue is Chinese, the conference supports Chinese and English, and the Chinese voice is As an audio stream of one channel, the English speech obtained by the Chinese speech translation is used as an audio stream of one channel, and a multi-channel audio stream composed of Chinese speech and English speech is obtained;
当所述原始语音的语言类型非所述会议支持的语言类型时,将所述原始语音对应的所述会议支持的每个语言类型的翻译语音分别作为一个声道的音频流,生成所述多声道音频流。比如,会场的原始语音为法语,会议支持中文和英文,则将原始的法语语音翻译得到的中文语音作为一个声道的音频流,将原始的法语语音翻译得到的英文语音作为一个声道的音频流,得到中文语音和英文语音组成的多声道音频流。When the language type of the original voice is not the language type supported by the conference, the translated voice of each language type supported by the conference corresponding to the original voice is respectively used as an audio stream of one channel, and the multiple Channel audio stream. For example, the original voice of the venue is French, and the conference supports Chinese and English. The Chinese voice translated from the original French voice is used as the audio stream of one channel, and the English voice translated from the original French voice is used as the audio of one channel. Stream, get a multi-channel audio stream composed of Chinese speech and English speech.
其中,每个声道具体对应哪个语言类型可以预先设置,也可以通过语言类型信息进行指示,具体的,包括:Wherein, each language type corresponding to each channel can be preset, or can be indicated by language type information, specifically, including:
所述多声道编码包中每个声道对应预设语言类型的音频流;或者,所述多声道编码包中携带每个声道的语言类型信息。比如,预设第一声道对应中文,第二声道对应英文,则组成多声道编码包时,将中文语音的音频流作为第一声道的音频流,将英文语音的音频流作为第二声道的音频流。另外一种方式,每个声道携带语言类型信息,该语言类型信息指示该声道的音频流的语言类型。Each channel of the multi-channel encoded packet corresponds to an audio stream of a preset language type; or, the multi-channel encoded packet carries language type information of each channel. For example, if the first channel corresponds to Chinese and the second channel corresponds to English, when the multi-channel encoding package is formed, the audio stream of the Chinese voice is used as the audio stream of the first channel, and the audio stream of the English voice is used as the first Two-channel audio stream. Alternatively, each channel carries language type information indicating the language type of the audio stream for that channel.
图3为本申请实施例提供的支持多语会议的终端处理过程示意图。实际应用中,由会场的麦克风拾取会场发言者的原始语音,如果会场的原始语音仅为一种语言类型且与会议支持的语言类型一致,即可直接将原始语音作为终端一个声道的声音源输入,若会场的原始语音为多种语言类型或者与会议支持的语言类型不一致,则需要经过会场设置的翻译机构进行翻 译转换,将本会场的原始语音转换成会议支持的语言类型的翻译语音,每一种翻译语音对应终端的一个声道输入,经过终端的编码器进行多声道编码,打包形成一路音频输出,发送给MCU。由于各终端已对自身所在的会场发言的原始语音进行了翻译,因此,MCU无需再进行翻译,直接将各终端的翻译语音进行多声道混音处理后,将混音结果发送给各终端即可,具体实现方式见实施例。终端接收到MCU发送的多声道混音编码包后,首先进行解包解出各个声道的编码数据,然后对各个声道的编码数据进行解码,然后判断各个声道中的语言类型与本会场支持的语言类型是否一致,按需对会议混音进行翻译转换,可以经扬声器输出,也可以接入不同的外设收听,也可以通过终端内部的语音文字转换模块转换为文字后,在终端的屏幕上进行字幕输出。FIG. 3 is a schematic diagram of a process of processing a terminal supporting a multilingual conference according to an embodiment of the present application. In the actual application, the original voice of the conference speaker is picked up by the microphone of the conference site. If the original voice of the conference site is only one language type and is consistent with the language type supported by the conference, the original voice can be directly used as the sound source of one channel of the terminal. If the original voice of the site is in a multi-language type or is inconsistent with the language type supported by the conference, you need to perform translation translation through the translation mechanism set by the site, and convert the original voice of the site into the translated voice of the language type supported by the conference. Each type of translated speech corresponds to one channel input of the terminal, and is multi-channel encoded by the encoder of the terminal, packaged to form an audio output, and sent to the MCU. Since each terminal has translated the original voice of the site in which it is located, the MCU does not need to perform translation, and directly performs multi-channel mixing processing on the translated voice of each terminal, and then transmits the mixed result to each terminal. For details, see the embodiment. After receiving the multi-channel mixing and encoding package sent by the MCU, the terminal first unpacks and decodes the encoded data of each channel, then decodes the encoded data of each channel, and then determines the language type and the local language in each channel. The language types supported by the venue are the same. The conference mix can be translated and converted as needed. It can be output via the speaker or connected to different peripherals. It can also be converted to text after the terminal's internal voice text conversion module. Subtitle output on the screen.
召开多语会议的协商过程中,首先由MCU与终端协商确定会议支持的语言类型,MCU混音的声道数量也是会议支持的语言类型数量,在同一个混音器中完成多语种多声道混音。In the negotiation process of multilingual conferences, the MCU first negotiates with the terminal to determine the language type supported by the conference. The number of channels of the MCU mix is also the number of language types supported by the conference, and the multi-language multi-channel is completed in the same mixer. Mixing.
图4为本申请一实施例提供的会议支持的语言类型的协商过程流程图。如图4所示,包括:FIG. 4 is a flowchart of a negotiation process of a language type supported by a conference according to an embodiment of the present disclosure. As shown in Figure 4, it includes:
步骤401,会议召开时的信令控制阶段,MCU与各个终端开始进行会议语言协商,MCU获取每个会场的信息,包括如下至少之一:会场优先级,会场的听说属性(仅作为发言端或者仅作为收听端或者既要发言又要收听),会场支持的语言类型(会场的发言语种),会场具备的翻译能力等等。每个终端都可以从多个不同的语种中自由选择收听的语种,对于发言权,可以进行权限设定,比如对于一般身份的终端,其发言的语种只能使用默认的语种,身份为翻译或者高级的终端可以选择发言的语种为其他的语种。Step 401: In the signaling control phase when the conference is held, the MCU starts the conference language negotiation with each terminal, and the MCU obtains the information of each site, including at least one of the following: the site priority, the listening and speaking attribute of the site (only as a speaking end) Or only as a listening end or both to speak and listen), the type of language supported by the venue (the language of the venue), the translation capabilities of the venue, and so on. Each terminal can freely choose the language to be listened to from different languages. For the right of speaking, the rights can be set. For example, for a terminal with a general identity, the language of the speech can only use the default language, and the identity is translated or Advanced terminals can choose to speak in other languages.
步骤402,MCU遍历所有参与会议的终端;Step 402: The MCU traverses all terminals participating in the conference;
步骤403,判断是否已遍历完所有参与会议的终端,如果是,结束,得 到会议支持的语言类型L1~Ln;否则,转步骤404; Step 403, it is determined whether all the terminals participating in the conference have been traversed, and if so, the language types L1 to Ln supported by the conference are obtained; otherwise, the process proceeds to step 404;
步骤404,判断终端所在会场的优先级是否最高,如果是,转步骤407,否则,转步骤405; Step 404, determining whether the priority of the site where the terminal is located is the highest, if yes, go to step 407, otherwise, go to step 405;
步骤405,判断终端所在会场的发言语种是否为会议中所有会场使用最多的语言类型,如果是,转步骤407,否则,转步骤405;Step 405: Determine whether the speaking language of the site where the terminal is located is the language type most used by all the sites in the conference, and if yes, go to step 407, otherwise, go to step 405;
步骤406,判断会场使用的发言语种是否与其他会场具备互译功能力,如果是,转步骤407,否则,返回步骤402。In step 406, it is determined whether the speaking language used by the site has the function of translation with other sites. If yes, go to step 407. Otherwise, return to step 402.
比如,比如,某个会场使用英文,其他会场均能将英文翻译为自身使用的语言,则可以将英文作为会议支持的一种语言类型。For example, if a venue uses English and other venues can translate English into their own language, English can be used as a language type supported by the conference.
步骤407,将该会场的发言语种作为会议支持的一种语言类型,转步骤402。In step 407, the speaking language of the venue is used as a language type supported by the conference, and the process proceeds to step 402.
需要说明的是,上述协商会议支持的语言类型的方式仅为示例,可以通过其他方式确定会议支持的语言类型。比如,直接使用默认的语言类型。It should be noted that the manner of the language types supported by the above negotiation conference is only an example, and the language types supported by the conference may be determined by other methods. For example, use the default language type directly.
为了使终端能有效获取会议的语种类型,可以使用新的终端上会协议流程,分别如图5和图6所示。In order to enable the terminal to effectively obtain the language type of the conference, the protocol procedure of the new terminal can be used, as shown in FIG. 5 and FIG. 6, respectively.
图5为H.323协议呼叫流程,根据H.323协议标准,呼叫流程大体上可以分为Q931阶段和H245阶段。这里以MCU呼叫终端进行说明:Figure 5 shows the H.323 protocol call flow. According to the H.323 protocol standard, the call flow can be roughly divided into Q931 phase and H245 phase. Here is the description of the MCU calling terminal:
步骤501,首先进入Q931阶段,即MCU发送建立(setup)消息;Step 501, first entering the Q931 phase, that is, the MCU sends a setup message;
步骤502~504,终端收到后依次回复呼叫处理(callProceeding)、报警(alerting)、连接(connect)消息给MCU;至此Q931阶段完成,接着进入H245阶段。Steps 502-504, after receiving the call, the terminal sequentially responds to call processing (callProceeding), alarm (alerting), and connect (connect) messages to the MCU; at this point, the Q931 phase is completed, and then the H245 phase is entered.
H245阶段主要包括能力集交互、主从确认和打开逻辑通道三部分,能力交互包括互相发送终端能力设置(terminalCapabilitySet)和终端能力设置确认(terminalCapabilitySetAck),主从确认过程包括相互发送主从确定请求(masterSlaveDetermination)和主从确定响应 (masterSlaveDeterminationAck),打开逻辑通道包括相互发送打开逻辑信道请求(openLogicalChannel)和打开逻辑信道响应(openLogicalChannelAck)。当双向音频逻辑通道打开后,即表示上端成功。以上是标准的上端流程,在此基础上添加一条非标准的私有消息,即在能力交互完成后,通过发送一条nonStandard消息,将本申请中所述的会议支持的语言类型通知给终端。具体包括:The H245 phase mainly includes three parts: capability set interaction, master-slave confirmation and open logical channel. The capability interaction includes mutual terminal capability setting (terminalCapabilitySet) and terminal capability setting confirmation (terminalCapabilitySetAck). The master-slave confirmation process includes sending master-slave determination requests to each other ( masterSlaveDetermination) and master-slave determination (masterSlaveDeterminationAck), open logical channels including mutual open logical channel request (openLogicalChannel) and open logical channel response (openLogicalChannelAck). When the two-way audio logical channel is opened, it means that the upper end is successful. The above is the standard upper-end process. On the basis of this, a non-standard private message is added. After the capability interaction is completed, the non-standard message type of the conference described in this application is notified to the terminal by sending a nonStandard message. Specifically include:
步骤505,MCU发送terminalCapabilitySet消息给终端;Step 505: The MCU sends a terminalCapabilitySet message to the terminal.
步骤506,MCU发送masterSlaveDetermination消息给终端;Step 506, the MCU sends a masterSlaveDetermination message to the terminal.
步骤507,终端发送terminalCapabilitySet消息给MCU;Step 507, the terminal sends a terminalCapabilitySet message to the MCU;
步骤508,MCU发送terminalCapabilitySetAck消息给终端;Step 508, the MCU sends a terminalCapabilitySetAck message to the terminal;
步骤509,终端发送masterSlaveDetermination消息给MCU;Step 509, the terminal sends a masterSlaveDetermination message to the MCU;
步骤510,MCU发送masterSlaveDeterminationAck消息给终端;Step 510: The MCU sends a masterSlaveDeterminationAck message to the terminal.
步骤511,终端发送terminalCapabilitySetAck消息给MCU;Step 511, the terminal sends a terminalCapabilitySetAck message to the MCU;
步骤512,终端发送masterSlaveDeterminationAck消息给MCU;Step 512, the terminal sends a masterSlaveDeterminationAck message to the MCU;
步骤513,MCU发送非典型(nonStandard)消息给终端,携带会议支持的语言类型;Step 513: The MCU sends an atypical (non-Standard) message to the terminal, and carries the language type supported by the conference.
步骤514,终端发送openLogicalChannel消息给MCU;Step 514, the terminal sends an openLogicalChannel message to the MCU;
步骤515,MCU发送openLogicalChannelAck消息给终端;Step 515, the MCU sends an openLogicalChannelAck message to the terminal.
步骤516,MCU发送openLogicalChannel消息给终端;Step 516, the MCU sends an openLogicalChannel message to the terminal.
步骤517,终端发送openLogicalChannelAck消息给MCU。In step 517, the terminal sends an openLogicalChannelAck message to the MCU.
通过上述方式,MCU将会议支持的语言类型通知终端。需要说明的是,图5所示方式仅为示例,也可以通过nonStandard消息外的其他消息携带会议支持的语言类型,另外,携带会议支持的语言类型的消息也不限于步骤513中,也可以在其他位置,比如其他步骤之前或之后。In the above manner, the MCU notifies the terminal of the language type supported by the conference. It should be noted that the manner shown in FIG. 5 is only an example, and the language type supported by the conference may be carried by other messages other than the non-Standard message. In addition, the message carrying the language type supported by the conference is not limited to step 513, and may also be Other locations, such as before or after other steps.
图6为SIP协议呼叫流程图,相对于H.323协议,SIP协议呼叫流程较 为简单,如图6所示,包括:Figure 6 is a SIP protocol call flow chart. Compared with the H.323 protocol, the SIP protocol call flow is simple, as shown in Figure 6, including:
步骤601,MCU发送邀请(INVITE)消息给终端;Step 601: The MCU sends an INVITE message to the terminal.
步骤602,终端发送100Trying消息给MCU;Step 602: The terminal sends a 100 Trying message to the MCU.
步骤603,终端发送180Ringing消息给MCU; Step 603, the terminal sends a 180 Ringing message to the MCU;
步骤604,终端发送200OK消息给MCU;Step 604, the terminal sends a 200 OK message to the MCU;
步骤605,MCU发送ACK消息给终端;至此上端流程完成。Step 605: The MCU sends an ACK message to the terminal; the upper end process is completed.
本申请中,在此基础上添加一条非标准的私有消息,通过发送一条INFO消息,将会议支持的语言类型通知给终端,包括:。In this application, a non-standard private message is added on the basis of this, and an INFO message is sent to notify the terminal of the language type supported by the conference, including:
步骤606,MCU发送INFO消息给终端,携带会议支持的语言类型;Step 606: The MCU sends an INFO message to the terminal, and carries the language type supported by the conference.
步骤607,终端发送200OK消息给MCU。In step 607, the terminal sends a 200 OK message to the MCU.
通过上述方式,MCU将会议支持的语言类型通知终端。需要说明的是,通过IFO消息携带会议支持的语言类型仅为示例,也可以使用其他消息携带会议支持的语言类型。In the above manner, the MCU notifies the terminal of the language type supported by the conference. It should be noted that the language type supported by the conference through the IFO message is only an example, and other messages may be used to carry the language type supported by the conference.
音频包在互联网上传输采用的数据包格式遵守实时传送协议(Real-time Transport Protocol,简称RTP或RTTP)。其中RTP协议是一个网络传输协议,由IETF的多媒体传输工作小组1996年在RFC 1889中公布。The packet format used for audio packets transmitted over the Internet complies with the Real-time Transport Protocol (RTP or RTTP). The RTP protocol is a network transmission protocol published by the IETF Multimedia Transmission Working Group in RFC 1889 in 1996.
RTP报文包括:RTP报头和RTP负载,其中,如图7(a)所示,RTP报头包括:The RTP packet includes: an RTP header and an RTP payload, wherein, as shown in FIG. 7(a), the RTP header includes:
1.V:RTP协议的版本号,占2位,当前协议版本号为2。1.V: The version number of the RTP protocol, which is 2 digits, and the current protocol version number is 2.
2.P:填充标志,占1位,如果P=1,则在该报文的尾部填充一个或多个额外的八位组,它们不是有效载荷的一部分。2.P: padding flag, occupying 1 bit. If P=1, one or more extra octets are filled at the end of the message, they are not part of the payload.
3.X:扩展标志,占1位,如果X=1,则在RTP报头后跟有一个扩展报头。3.X: The extended flag, which occupies 1 bit. If X=1, there is an extended header followed by the RTP header.
4.CC:特约信源(Contributing Source,CSRC)计数器,占4位,指示CSRC标识符的个数。4.CC: Contributing Source (CSRC) counter, which is 4 bits, indicating the number of CSRC identifiers.
5.M:标记,占1位,不同的有效载荷有不同的含义,对于视频,标记一帧的结束;对于音频,标记会话的开始。5.M: Mark, which occupies 1 bit. Different payloads have different meanings. For video, mark the end of a frame; for audio, mark the beginning of a session.
6.PT:有效载荷类型,占7位,用于说明RTP报文中有效载荷的类型,如全球移动通信系统(Global System for mobile Communications,GSM)音频、联合图像专家小组(Joint Photographic Experts Group,JPEG)图像等,在流媒体中大部分是用来区分音频流和视频流,便于客户端进行解析。6. PT: The payload type, which is 7 bits, is used to describe the type of payload in the RTP message, such as the Global System for Mobile Communications (GSM) audio, Joint Photographic Experts Group (Joint Photographic Experts Group, JPEG) images, etc., are mostly used to distinguish audio streams and video streams in streaming media, so that the client can parse them.
7.序列号(sequence number):占16位,用于标识发送者所发送的RTP报文的序列号,每发送一个报文,序列号增1。这个字段当下层的承载协议用UDP的时候,网络状况不好的时候可以用来检查丢包。同时出现网络抖动的情况可以用来对数据进行重新排序,在helix服务器中这个字段是从0开始的,同时音频包和视频包的sequence是分别记数的。7. Sequence number: 16 bits, used to identify the serial number of the RTP message sent by the sender. Each time a message is sent, the sequence number is incremented by 1. This field can be used to check for packet loss when the underlying bearer protocol uses UDP. At the same time, the situation of network jitter can be used to reorder the data. In the helix server, this field starts from 0, and the sequence of the audio package and the video package are counted separately.
8.时戳(Timestamp):占32位,时戳反映了该RTP报文的第一个八位组的采样时刻。接收者使用时戳来计算延迟和延迟抖动,并进行同步控制。8. Timestamp: 32 bits, the time stamp reflects the sampling time of the first octet of the RTP message. The receiver uses the time stamp to calculate the delay and delay jitter and performs synchronous control.
9.同步信源(synchronization source,简称SSRC)标识符:占32位,用于标识同步信源。该标识符是随机选择的,参加同一视频会议的两个同步信源不能有相同的SSRC。9. Synchronization source (SSRC) identifier: 32 bits, used to identify the synchronization source. The identifier is randomly selected, and the two sync sources participating in the same video conference cannot have the same SSRC.
10.特约信源(CSRC)标识符:每个CSRC标识符占32位,可以有0~15个。每个CSRC标识了包含在该RTP报文有效载荷中的所有特约信源。10. Special Source (CSRC) Identifier: Each CSRC identifier is 32 bits and can have 0 to 15. Each CSRC identifies all the special sources contained in the RTP message payload.
本申请的一个实施例中,对RTP报文进行扩展,将RTP协议中固定头的扩展标志位(X)置为“1”,启用扩展头,紧跟在报头后面是一个长度可变的头扩展,利用头扩展传递多通道语音的各个通道语种以及编码长度等信息,其格式如图7(b)所示。In an embodiment of the present application, the RTP message is extended, and the extension flag (X) of the fixed header in the RTP protocol is set to "1", the extension header is enabled, and a variable length header is followed by the header. The extension uses the header to extend the information of each channel language and encoding length of the multi-channel speech, and its format is shown in Figure 7(b).
如图7(b)所示,头扩展包括多个扩展项,具体的,包括:As shown in FIG. 7(b), the header extension includes a plurality of extensions, specifically, including:
标识信息字段,用于标识当前头扩展的类型;可以占用16比特;标识信息字段具体实现由上层协议定义。The identifier information field is used to identify the type of the current header extension; it can occupy 16 bits; the specific implementation of the identifier information field is defined by the upper layer protocol.
长度字段(length),用于指示该通道一共有多少种声道编码数据,即指示后续的帧信息字段的个数。Length field (length), used to indicate how many kinds of channel coded data are shared by the channel, that is, the number of subsequent frame information fields.
多个帧信息字段,可以跟随在length字段之后,每个帧信息字段包括:帧类型字段和帧长度字段,其中,帧类型字段用于指示一种帧类型即一个声道的语言类型,帧长度字段用于指示帧长度即该声道的编码数据长度。其中,每个帧信息字段可以占用32比特,其前16个比特为帧类型字段,后16比特为帧长度字段。当然,每个帧信息字段的长度可以根据需要设定,本申请对此不作限定。A plurality of frame information fields may be followed by a length field, each frame information field including: a frame type field and a frame length field, wherein the frame type field is used to indicate a frame type, that is, a language type of one channel, and a frame length The field is used to indicate the frame length, that is, the encoded data length of the channel. Each frame information field can occupy 32 bits, and the first 16 bits are frame type fields, and the last 16 bits are frame length fields. Of course, the length of each frame information field can be set as needed, which is not limited in this application.
需要说明的是,扩展项中也可以不包含标识信息字段。另外,也可以根据需要扩展其他字段,本申请对此不作限定。It should be noted that the extension information may also not include the identifier information field. In addition, other fields may be extended as needed, which is not limited in this application.
需要说明的是,上述扩展头的格式仅为示例,可以根据需要定义其他格式的扩展头携带各通道相关信息,本申请对此不作限定。在本申请的一个实施例中,可以使用上述扩展的RTP报文传输多声道编码包或者多声道混音编码包。当然,在本申请的其他实施例中,也可以使用其他格式的报文传输多声道混音编码包,本申请对此不作限定。It should be noted that the format of the foregoing extended header is only an example, and the extended headers of other formats may be defined to carry information about each channel according to requirements, which is not limited in this application. In one embodiment of the present application, the multi-channel encoded packet or the multi-channel mixed encoded packet may be transmitted using the extended RTP message described above. Of course, in other embodiments of the present application, the multi-channel mixing and encoding package of the message transmission in other formats may also be used, which is not limited in this application.
图8(a)为本申请一实施例提供的多语会议多声道编码打包流程图。如图8(a)所示,终端侧,首先由会场的麦克风拾取会场发言者的原始语音,传输到终端侧的翻译室,经翻译人员翻译转换后,输入到终端的不同声道。当会议为双语会议时,终端的左右声道分别放置终端对应会议语种的终端语音,采用立体声编码。当会议的语种多于两种语言的时候,终端分别对各个声道进行编码,获得声道信息和编码数据,如图8(a)所示,获取声道信息1至声道信息n,以及编码数据1至编码数据n,其中,声道信息中包括语言类型信息和编码数据长度信息。将各个声道的语言类型作为帧类型,编码数据长度作为帧长度分别填入RTP头扩展的各相应字段,最后将各声道的编码数据按照头扩展的语言类型顺序依次排列作为RTP负 载,最终生成包括RTP头、RTP头扩展和RTP负载的多声道编码包,其中,RTP头中的扩展标志位指示存在扩展头,经网络发送出去。MCU侧同理,混音器输出的多声道混音数据发送给MCU的编码器,编码器按照混音器创建时各个声道的语种信息,对应混音器的多声道混音数据,进行编码后,在编码数据前加上RTP头扩展的语种标识以及编码长度等信息。在其他实施例中,多声道编码包也可以只包括RTP头和RTP负载,RTP负载中携带各声道的数据,每个声道的数据包括该声道的语种标识信息和该声道的编码数据,其中,语种标识信息指示该声道的语言类型。FIG. 8(a) is a flowchart of multi-session conference multi-channel coding and packaging according to an embodiment of the present application. As shown in FIG. 8(a), on the terminal side, the original voice of the venue speaker is first picked up by the microphone of the venue, and transmitted to the translation room on the terminal side, and translated and converted by the translator to input different channels of the terminal. When the conference is a bilingual conference, the left and right channels of the terminal respectively place the terminal voices of the terminal corresponding to the conference language, and stereo coding is adopted. When the language of the conference is more than two languages, the terminal separately encodes each channel to obtain channel information and encoded data, as shown in FIG. 8(a), acquiring channel information 1 to channel information n, and The data 1 to the encoded data n are encoded, wherein the vocal tract information includes language type information and encoded data length information. The language type of each channel is used as the frame type, and the encoded data length is filled as the frame length into each corresponding field of the RTP header extension. Finally, the encoded data of each channel is sequentially arranged as the RTP payload according to the language type of the header extension, and finally A multi-channel encoded packet including an RTP header, an RTP header extension, and an RTP payload is generated, wherein the extended flag bit in the RTP header indicates that there is an extended header that is transmitted over the network. On the MCU side, the multi-channel mixing data output by the mixer is sent to the encoder of the MCU, and the encoder corresponds to the multi-channel mixing data of the mixer according to the language information of each channel when the mixer is created. After encoding, the information such as the language identifier of the RTP header extension and the length of the code are added before the encoded data. In other embodiments, the multi-channel encoded packet may also include only the RTP header and the RTP payload, and the RTP payload carries data of each channel, and the data of each channel includes the language identifier information of the channel and the channel. Encoded data, wherein the language identification information indicates a language type of the channel.
图8(b)为本申请一实施例提供的多语会议多声道解包解码流程图。如图8(b)所示,在终端侧,接收MCU发送过来的包括RTP头、RTP头扩展和RTP负载的会议混音编码包,根据RTP头中的扩展标志位,可知该编码包存在头扩展,按照头扩展格式,对编码包中各个声道的数据进行解包。首先解析头扩展中各声道的语言类型以及各声道的编码数据长度,对应找到RTP负载中各个声道的编码数据,送入多个解码器解码后输出各个声道的会议声音。MCU侧同理,MCU接收到终端发送的多声道编码包,根据RTP头中的扩展标志位,可知该编码包存在头扩展,按照头扩展格式,对多声道编码包中各个声道的数据进行解包。首先解析头扩展中各声道的语言类型以及各声道道的编码数据长度,获得声道信息1至声道信息n,对应找到RTP负载中各个声道的编码数据,获得编码数据1至编码数据n,送入多个解码器解码后输出各个声道的会议声音给混音器模块进行多声道混音。FIG. 8(b) is a flowchart of multi-session conference multi-channel unpacking decoding according to an embodiment of the present application. As shown in FIG. 8(b), on the terminal side, the conference mix coding packet including the RTP header, the RTP header extension, and the RTP payload sent by the MCU is received, and according to the extended flag bit in the RTP header, the encoder packet existence header is known. The extension, according to the header extension format, unpacks the data of each channel in the encoded packet. Firstly, the language type of each channel in the header extension and the encoded data length of each channel are parsed, and the encoded data of each channel in the RTP payload is found, and sent to a plurality of decoders to decode and output the conference sound of each channel. On the MCU side, the MCU receives the multi-channel encoded packet sent by the terminal. According to the extended flag bit in the RTP header, it can be known that the encoded packet has a header extension, and according to the header extended format, for each channel in the multi-channel encoded packet. The data is unpacked. Firstly, the language type of each channel in the header expansion and the length of the encoded data of each channel are analyzed, and the channel information 1 to the channel information n are obtained, and the encoded data of each channel in the RTP payload is found correspondingly, and the encoded data 1 to encoding is obtained. The data n is sent to a plurality of decoders to decode and output the conference sound of each channel to the mixer module for multi-channel mixing.
下面通过具体实施例进一步说明本申请。The present application is further illustrated by the following specific examples.
图9为本申请第一实施例提供的会议实现方法流程图。本实施例的执行主体为MCU。如图9所示,包括:FIG. 9 is a flowchart of a method for implementing a conference according to a first embodiment of the present application. The execution body of this embodiment is an MCU. As shown in Figure 9, it includes:
步骤901,MCU接收参与会议的终端发送的多声道编码包,对所述多 声道编码包进行拆包解码,获得多声道音频流;所述多声道音频流中每个声道对应一个语言类型的音频流;Step 901: The MCU receives the multi-channel encoded packet sent by the terminal participating in the conference, and performs unpacking and decoding on the multi-channel encoded packet to obtain a multi-channel audio stream; each channel in the multi-channel audio stream corresponds to An audio stream of a language type;
由于各终端已对自身所在会场发言的原始语音进行了翻译,因此,MCU无需再进行翻译,MCU的解码器按照图8(b)所示步骤,对各个终端的多声道编码包进行解包解码,然后将各个声道的音频流送入混音器。Since each terminal has translated the original voice of the site in which it is located, the MCU does not need to perform translation, and the decoder of the MCU unpacks the multi-channel encoded packets of each terminal according to the steps shown in FIG. 8(b). Decode and then stream the audio from each channel into the mixer.
步骤902,所述MCU确定需要混音的终端,将所确定的各终端的多声道音频流中同一语言类型的音频流进行混音,获得多声道混音音频流,且一个声道对应一种语言类型的混音音频流;Step 902: The MCU determines a terminal that needs to be mixed, and mixes the audio streams of the same language type in the determined multi-channel audio stream of each terminal to obtain a multi-channel mixed audio stream, and one channel corresponds to a mixed audio stream of a language type;
MCU创建一个多声道的混音器,该混音器的声道数量与会议支持的语言类型的数量一致,MCU将解码后得到的各个声道的音频流,根据对应的语言类型,对应送入混音器的各个声道进行混音。The MCU creates a multi-channel mixer. The number of channels of the mixer is the same as the number of language types supported by the conference. The audio stream of each channel obtained by the MCU will be decoded according to the corresponding language type. Mix the individual channels of the mixer.
以会议支持的语言类型为中文和英文为例,假设确定需要混音的终端为第一终端和第二终端,获取第一终端的中文语音和英文语音,第二终端的中文语音和英文语音,将第一终端的中文语音和第二终端的中文语音进行混音,生成中文语音混音音频流,将第一终端的英文语音和第二终端的英文语音进行混音,生成英文语音混音音频流,中文语音混音音频流和英文语音混音音频流构成多声道混音音频流。Taking the language types supported by the conference as Chinese and English as an example, it is assumed that the terminal that needs to be mixed is the first terminal and the second terminal, and the Chinese voice and the English voice of the first terminal are acquired, and the Chinese voice and the English voice of the second terminal are obtained. Mixing the Chinese voice of the first terminal with the Chinese voice of the second terminal to generate a Chinese voice mixed audio stream, mixing the English voice of the first terminal and the English voice of the second terminal to generate an English voice mixing audio The stream, Chinese speech mix audio stream and English speech mix audio stream constitute a multi-channel mixed audio stream.
步骤903,所述MCU将所述多声道混音音频流进行编码获得多声道混音编码包,将所述多声道混音编码包发送给终端。Step 903: The MCU encodes the multi-channel mixed audio stream to obtain a multi-channel mixing and encoding packet, and sends the multi-channel mixing and encoding packet to the terminal.
对于会议中的终端,MCU仅需建立一路音频通道进行与终端的音频传输。MCU发送给某个终端的多声道混音编码包中不包括该终端自身所在的会场的音频信息。For a terminal in a conference, the MCU only needs to establish one audio channel for audio transmission with the terminal. The multi-channel mixing code package sent by the MCU to a terminal does not include the audio information of the site where the terminal itself is located.
本实施例提供的会议实现方法,只需创建一个混音过程,混音器数量少,混音过程的计算量小,减少了对CPU的占用,且发送给同一个会场的原始语言和翻译语言混音的表达信息一致。The conference implementation method provided in this embodiment only needs to create a mixing process, the number of mixers is small, the calculation amount of the mixing process is small, the occupation of the CPU is reduced, and the original language and the translation language are sent to the same venue. The expression of the mix is consistent.
下面对步骤902进一步说明。MCU的混音单元包括:VAD模块、排序模块、选择模块、混音及交换处理模块。其中VAD模块是对会议发言端的SNR(信噪比)和能量进行估计。排序模块是根据VAD模块的运算结果,对会议的所有终端进行排序。选择模块从排序后的终端中挑出符合混音条件的若干个(比如音量最大的3个)终端进行混音。混音和交换处理模块是对参加混音的发言端进行混音处理,得到参加混音的各端口发送端数据。MCU的混音处理策略应使得任一会场听不到自身的声音,只听到其他会场的声音。如图10所示,包括:Step 902 is further described below. The mixing unit of the MCU includes: a VAD module, a sorting module, a selection module, a mixing and an exchange processing module. The VAD module estimates the SNR (Signal to Noise Ratio) and energy of the conference speaker. The sorting module sorts all the terminals of the conference according to the operation result of the VAD module. The selection module picks out a number of terminals (such as the three with the highest volume) that match the mixing condition from the sorted terminals for mixing. The mixing and exchange processing module performs mixing processing on the speech speakers participating in the mixing, and obtains the data of the transmitting end of each port participating in the mixing. The MCU's mixing processing strategy should make it impossible for any venue to hear its own voice and only hear the sound of other venues. As shown in Figure 10, it includes:
步骤1001,获取终端的SNR和能量;Step 1001: Acquire an SNR and an energy of the terminal.
步骤1002,根据终端的SNR和能量判断终端是否参加排序和混音,生成终端链表;Step 1002: Determine, according to the SNR and the energy of the terminal, whether the terminal participates in sorting and mixing, and generates a terminal linked list.
该处理是为了排除不发言的终端,即只让发言的终端参加排序和混音,让不发言的终端不参加排序和混音。终端参加排序的条件有两个:一,终端不静音。二,终端被判断为发言端。终端链表中包括参加排序的终端。This process is to exclude the terminal that does not speak, that is, only the terminal that speaks to participate in sorting and mixing, so that the terminal that does not speak does not participate in sorting and mixing. There are two conditions for the terminal to participate in the sorting: First, the terminal is not muted. Second, the terminal is judged to be the speaker. The terminal list includes terminals that participate in sorting.
步骤1003,对会议的终端链表进行排序;Step 1003: Sort the terminal list of the conference;
其中,可以根据终端的能量和信噪比对终端链表进行排序。The terminal list can be sorted according to the energy and signal to noise ratio of the terminal.
步骤1004,从排序后的终端链表中选择满足条件的终端参加混音。Step 1004: Select a terminal that satisfies the condition from the sorted terminal linked list to participate in the mixing.
比如,可以选择三个音量最大的终端作为参加混音的终端。当然,可以根据需要选择更多或更少的终端作为参加混音的终端。For example, you can select the three terminals with the highest volume as the terminal to participate in the mix. Of course, more or fewer terminals can be selected as the terminal to participate in the mixing as needed.
在选择参加混音的终端后,根据选择的混音端口数量实行两种不同方式的混音处理,分为混音端口数量等于1和混音端口数量大于1两种情况,其中:After selecting the terminal to participate in the mixing, two different ways of mixing processing are performed according to the selected number of mixing ports, which are divided into two cases: the number of mixing ports is equal to 1 and the number of mixing ports is greater than 1, wherein:
混音端口数量等于1时,此时主要选择混音端口的输出比较重要,在非参与混音端口中选择一个非静音输入端口的输入当作混音端口的输出。When the number of mixing ports is equal to 1, it is important to select the output of the mixing port at this time. Select the input of an unmute input port as the output of the mixing port in the non-participating mixing port.
混音端口数量大于1时,此时内部混音端口相互混音,非参与混音端 口的输出为所有混音端口输入的和。例如:A,B,C,D四个终端,通过SNR和能量判断A,B,C三个终端符合排序条件,可以参与混音;混音输出的结果是:A_out:B+C;B_out:A+C;C_out:B+A;D_out:A+B+C。其中A_out:B+C的含义是:将终端B和终端C发送至MCU的音频流进行混音后发送给终端A。When the number of mixing ports is greater than 1, the internal mixing ports are mixed with each other, and the output of the non-mixing port is the sum of all mixing ports. For example: A, B, C, D four terminals, through the SNR and energy judgment A, B, C three terminals meet the sorting conditions, can participate in the mix; the result of the mix output is: A_out: B + C; B_out: A+C; C_out: B+A; D_out: A+B+C. The meaning of A_out:B+C is: the audio stream sent by the terminal B and the terminal C to the MCU is mixed and sent to the terminal A.
由于混音器支持多种不同声道数以及不同采样率的终端混音,所以在混音过程中还需进行声道数据之间的数据转换以及进行混音端口之间的采样率统一。Since the mixer supports a plurality of different channel numbers and terminal mixing of different sampling rates, data conversion between channel data and uniform sampling rate between mixing ports are required in the mixing process.
在下面各详述实施例中,可采用与本实施例相同的混音处理策略进行多语多声道语音的混音处理,不再赘述。In the following detailed embodiments, the mixing processing strategy of the multi-language multi-channel speech can be performed by using the same mixing processing strategy as in the present embodiment, and details are not described herein again.
图11为本申请一实施例提供的一种会议实现方法流程图,如图11所示,包括:FIG. 11 is a flowchart of a method for implementing a conference according to an embodiment of the present disclosure. As shown in FIG.
步骤1101,终端接收MCU发送的多声道混音编码包;Step 1101: The terminal receives the multi-channel mixing and encoding package sent by the MCU.
步骤1102,所述终端解析所述多声道混音编码包,获得多声道混音音频流。Step 1102: The terminal parses the multi-channel mixing and encoding packet to obtain a multi-channel mixed audio stream.
本实施例提供的会议实现方法,终端可以通过一个通道获取各声道的混音音频流,降低了对资源的占用。In the conference implementation method provided in this embodiment, the terminal can obtain the mixed audio stream of each channel through one channel, thereby reducing the occupation of resources.
在一可选实施例中,所述方法还包括:所述终端根据所述多声道混音编码包中携带的语言类型信息确定各声道的音频流所使用的语言类型,或者,根据预设的声道与语言类型的对应关系确定每个声道对应的语言类型。即有两种方式可以获取各声道的音频流的语言类型,一种是预先设定,比如预先设定第一声道为中文,第二声道为英文,第三声道为法文,等等,MCU在多声道混音编码包中按预设设定的方式使用第一声道携带中文音频流,第二声道携带英文音频流,第三声道携带法文音频流,终端从第一声道获取中文音频流,第二声道获取英文音频流,第三声道获取发文音频 流。In an optional embodiment, the method further includes: determining, by the terminal, the language type used by the audio stream of each channel according to the language type information carried in the multi-channel mixing and encoding package, or according to The correspondence between the set channel and the language type determines the language type corresponding to each channel. That is, there are two ways to obtain the language type of the audio stream of each channel, one is preset, such as presetting the first channel to be Chinese, the second channel to English, the third channel to French, etc. Etc., the MCU uses the first channel to carry the Chinese audio stream in the multi-channel mixing and encoding package, the second channel carries the English audio stream, and the third channel carries the French audio stream. One channel acquires the Chinese audio stream, the second channel acquires the English audio stream, and the third channel acquires the sent audio stream.
在一可选实施例中,所述多声道编码包为实时传输协议报文,所述终端根据所述多声道混音编码包中携带的语言类型信息确定各声道的音频流所使用的语言类型包括:In an optional embodiment, the multi-channel encoded packet is a real-time transport protocol message, and the terminal determines, according to the language type information carried in the multi-channel mixing and encoding packet, the audio stream of each channel is used. Language types include:
所述终端从所述多声道编码包中的实时传输协议头扩展中的帧类型字段获取各声道的音频流所使用的语言类型信息。The terminal acquires language type information used by the audio stream of each channel from a frame type field in the real-time transport protocol header extension in the multi-channel encoded packet.
在一可选实施例中,所述方法还包括,所述终端获得各声道的音频流后,如果其中一个声道的音频流的语言类型为所述终端所在会场使用的语言类型,则播放该声道的音频流。In an optional embodiment, the method further includes: after the terminal obtains an audio stream of each channel, if the language type of the audio stream of one of the channels is a language type used by the site where the terminal is located, playing The audio stream of this channel.
相关技术中,终端接收到的原始语音混音为会议中所有发言者声音最大的若干个发言者的声音混音,终端接收到的翻译语音混音仅为该终端支持的语言类型对应的翻译语音的混音,无法获取所有会场翻译语音的有效信息,即同一个会场接收到的原始语音混音的表达信息与翻译语音混音的表达信息不一致。本实施例中,终端接收到的原始语音的表达信息与翻译语音混音的表达信息一致。In the related art, the original voice mix received by the terminal is a sound mix of a plurality of speakers whose voices are the loudest among all the speakers in the conference, and the translated voice mix received by the terminal is only the translated voice corresponding to the language type supported by the terminal. The remixing, unable to obtain valid information of all the translated speech of the venue, that is, the expression information of the original speech remix received by the same venue is inconsistent with the expression information of the transliterated speech remix. In this embodiment, the expression information of the original voice received by the terminal is consistent with the expression information of the voice mixture mixed.
图12为本申请第二实施例提供的会议电视应用场景示意图。本实施例结合具体的应用场景对支持多语会议电视的方法实施例的技术方案进行详细说明。FIG. 12 is a schematic diagram of a conference television application scenario according to a second embodiment of the present application. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.
如图12所示的应用场景中,中文会场、英文会场召开会议电视,各会场根据会议支持的语言类型,对会场的原始语音进行翻译处理,如该场景会议协商的语言为中文和英文,中文会场的原始语音为中文,则中文会场的左声道的音源为原始中文语音,右声道的音源为对原始语音进行翻译后的英文语音,英文会场的原始语音为英文,则英文会场的左声道的音源为对原始英文语音进行翻译后的中文语音,右声道的音源为该会场的原始语音即英文语音。在本实施例中,各会场的终端支持的语言类型为两种,MCU 和各会场的终端的交互如下:In the application scenario shown in Figure 12, the conference venues and the English conference venues hold conference TV. Each conference site translates the original voice of the conference according to the language type supported by the conference. For example, the language negotiated in the conference is Chinese and English. The original voice of the venue is Chinese, the source of the left channel of the Chinese venue is the original Chinese voice, the source of the right channel is the English voice after the original voice is translated, and the original voice of the English venue is English, then the left of the English venue The sound source of the channel is the Chinese voice translated by the original English voice, and the sound source of the right channel is the original voice of the venue, that is, the English voice. In this embodiment, the types of languages supported by the terminals of each site are two. The interaction between the MCU and the terminals of each site is as follows:
各会场的终端获取自身所在会场的原始语音,根据所在多语会议支持的语言类型,将原始语音直接作为左声道输入或者直接作为右声道输入或者经过翻译后作为左右声道的音源输入,通过终端的立体声编码,发送给MCU。如图12中文会场的终端将本会场中的原始语音(中文)直接作为左声道音源输入,原始语音翻译成英文后,作为右声道音源输入,通过立体声编码形成一路音频,通过一路音频通道发送给MCU。英文会场的终端将本会场中的原始语音(英文)直接作为右声道音源输入,原始语音翻译成中文后,作为左声道音源输入,进行编码形成一路音频,通过一路音频通道发送给MCU。The terminal of each site obtains the original voice of the site where it is located. According to the language type supported by the multilingual conference, the original voice is directly input as a left channel or directly as a right channel input or translated as a sound source of the left and right channels. It is sent to the MCU through the stereo encoding of the terminal. As shown in Figure 12, the terminal of the Chinese venue directly inputs the original voice (Chinese) in the venue as the left channel source. After the original voice is translated into English, it is used as the right channel source input, and stereo audio is used to form one channel of audio through one channel. Sent to the MCU. The terminal of the English conference site directly inputs the original voice (English) in the conference as the right channel audio source. After the original voice is translated into Chinese, it is input as the left channel sound source, and is encoded to form one channel of audio, which is sent to the MCU through one audio channel.
MCU将各终端发送的多声道音频流进行混音处理生成多声道混音编码包,并将发送到各终端。MCU的混音处理策略应使得任一会场听不到自身的声音,只听到其他会场的声音。此外,当发言的会场有多个时,可以设定参加混音的多个会场为音量最大的若干个会场,比如音量最大的三个会场。The MCU performs a mixing process on the multi-channel audio stream sent by each terminal to generate a multi-channel mixing code packet, and transmits it to each terminal. The MCU's mixing processing strategy should make it impossible for any venue to hear its own voice and only hear the sound of other venues. In addition, when there are multiple sites in the conference, you can set a number of sites that participate in the mix as the site with the highest volume, such as the three sites with the highest volume.
各终端接收MCU发送的多声道混音编码包(本实施例中也可称为立体声混音),解码接收到的多声道混音音频流(本实施例中为两个声道),分别通过终端的左右声道选择性的进行播放。Each terminal receives a multi-channel mixing and encoding packet (which may also be referred to as a stereo mixing in the present embodiment) transmitted by the MCU, and decodes the received multi-channel mixed audio stream (two channels in this embodiment). The playback is selectively performed through the left and right channels of the terminal.
如果MCU发送给某会场的终端的左右声道的立体声混音中的某声道声音与该会场支持的语言类型相同时,该会场的终端可以直接在所在会场中播放该声道的声音。比如中文会场接收到的立体声混音的左声道为中文会议混音,中文会场的终端可以直接在所在会场中播放左声道声音;又比如英文会场接收到的立体声混音的右声道为英文会议混音,英文会场的终端可以直接在所在会场中播放右声道声音。If the sound of a channel in the stereo mix of the left and right channels of the terminal sent by the MCU is the same as the language supported by the site, the terminal of the site can directly play the sound of the channel in the venue. For example, the left channel of the stereo mix received by the Chinese venue is a Chinese conference mix. The terminal of the Chinese venue can play the left channel sound directly in the venue; for example, the right channel of the stereo mix received by the English venue is English conference mixing, the terminal of the English conference venue can play the right channel sound directly in the venue.
对于接收到的立体声会议混音,会场可以根据具体的需求进行处理, 如有的会场需要进行翻译,然后再进行播放,而有的会场则可以不进行翻译直接播放原始混音。本实施例中每个终端都可以不经过翻译直接收听左声道或者右声道的声音。在另一实施例中,会议的混音还可以通过语音转文字处理,形成字幕,在终端屏幕上显示。For the received stereo conference mix, the venue can be processed according to specific needs. For example, some venues need to be translated and then played, and some venues can play the original mix without translation. In this embodiment, each terminal can directly listen to the sound of the left channel or the right channel without translating. In another embodiment, the mix of the conferences can also be processed by voice-to-text processing to form subtitles for display on the terminal screen.
图13为本申请第三实施例提供的会议电视应用场景示意图。本实施例结合具体的应用场景对支持多语会议电视的方法实施例的技术方案进行详细说明。FIG. 13 is a schematic diagram of a conference television application scenario according to a third embodiment of the present disclosure. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.
本实施例与第二实施例的区别在于会议中的会场自身的交流语言为多种。如图13所示的应用场景中,中文会场、英文会场和中英混合会场召开会议电视。同一会场中的与会人员所使用的语言类型可以多种,如中英文混合会场。各会场根据会议支持的语言类型,对会场的原始语音进行翻译处理,如该场景下,多语会议支持的语言为中文跟英文,中文会场的原始语音为中文,则中文会场的左声道的音源为原始中文语音,右声道的音源为对原始语音进行翻译的英文语音,英文会场的原始语音为英文,则英文会场的左声道的音源为对原始英文语音进行翻译的中文语音,右声道的音源为原始英文语音。一个会场可能会使用多种语言发言,如中英文混合会场,会场的原始语音相应的为多种语言类型,该会场的左声道音源为对原始语音进行翻译后得到的中文语音,该会场的右声道音源为对原始语音进行翻译后得到的英文语音。在本实施例中,各会场的终端翻译后的语言类型为固定的,即终端将本会场的原始语音翻译成会议支持的固定语言类型的语音。在本实施例中,在会议进行时,MCU和各会场的终端的工作方法如下:The difference between this embodiment and the second embodiment is that the exchange language of the site itself in the conference is multiple. In the application scenario shown in Figure 13, the Chinese conference venue, the English conference venue, and the Chinese-English mixed conference venue hold a conference TV. Participants in the same venue can use a variety of language types, such as Chinese and English mixed venues. According to the language type supported by the conference, each venue translates the original voice of the conference. For example, in the scenario, the language supported by the multilingual conference is Chinese and English, and the original voice of the Chinese conference venue is Chinese, and the left channel of the Chinese venue is The sound source is the original Chinese voice, the sound source of the right channel is the English voice for the original voice, and the original voice of the English venue is English. The source of the left channel of the English venue is the Chinese voice for the original English voice, right. The source of the channel is the original English voice. A venue may use multiple languages to speak. For example, the original voice of the venue is multi-language. The left channel of the venue is the Chinese voice obtained by translating the original voice. The right channel sound source is the English voice obtained by translating the original voice. In this embodiment, the translated language type of the terminal of each site is fixed, that is, the terminal translates the original voice of the site into a voice of a fixed language type supported by the conference. In this embodiment, when the conference is in progress, the working methods of the MCU and the terminals of each conference site are as follows:
各会场的终端获取自身所在会场的原始语音,根据所在多语会议支持的语言类型,将原始语音直接作为左声道输入或者直接作为右声道输入或者经过翻译后作为左右声道的音源输入,通过终端的立体声编码,发送给 MCU。如,中文会场的终端将本会场中的原始语音(中文)作为左声道音源输入,原始语音翻译成英文后,作为右声道音源输入,通过立体声编码形成一路音频(即多声道编码包),通过一路音频通道发送给MCU。英文会场的终端将本会场中的原始语音(英文)作为右声道音源输入,原始语音翻译成中文后,作为左声道音源输入,通过立体声编码形成一路音频(即多声道编码包),通过一路音频通道发送给MCU。中英文混合会场的终端将本会场中的原始语音(中英混合)翻译成中文作为左声道音源输入,将本会场中的原始语音(中英混合)翻译成英文后,作为右声道音源输入,通过立体声编码形成一路音频,通过一路音频通道发送给MCU。The terminal of each site obtains the original voice of the site where it is located. According to the language type supported by the multilingual conference, the original voice is directly input as a left channel or directly as a right channel input or translated as a sound source of the left and right channels. It is sent to the MCU through the stereo encoding of the terminal. For example, the terminal of the Chinese venue will input the original voice (Chinese) in the venue as the left channel source. After the original voice is translated into English, it will be used as the right channel source input to form one channel of audio (ie, multi-channel encoding package). ), sent to the MCU through one audio channel. The terminal of the English venue will input the original voice (English) in the venue as the right channel source. After the original voice is translated into Chinese, it will be input as the left channel source, and one channel of audio (ie, multi-channel code pack) will be formed by stereo coding. Send to the MCU through one audio channel. The terminal of the mixed Chinese and English venue translates the original voice (Chinese-English mixed) in this venue into Chinese as the left channel source input, and translates the original voice (Chinese-English mixture) in this venue into English as the right channel source. Input, form one channel of audio through stereo encoding, and send it to the MCU through one audio channel.
MCU将各终端发送的立体声多语语音进行混音处理,并将立体声混音(多声道混音编码包)发送到各终端。MCU的混音处理策略应使得任一会场听不到自身的声音,只听到其他会场的声音。此外,当发言的会场有多个时,可以设定参加混音的多个会场为音量较大的三个会场。The MCU mixes the stereo multi-lingual speech transmitted by each terminal, and transmits a stereo mix (multi-channel mixed coded packet) to each terminal. The MCU's mixing processing strategy should make it impossible for any venue to hear its own voice and only hear the sound of other venues. In addition, when there are multiple sites in the speech, it is possible to set a plurality of sites participating in the mixing to be three sites with a large volume.
各终端接收MCU发送的立体声混音。同时,各终端解码接收MCU发送的立体声混音,分别通过终端的左右声道选择性的进行播放。Each terminal receives a stereo mix sent by the MCU. At the same time, each terminal decodes and receives the stereo mix sent by the MCU, and selectively plays the same through the left and right channels of the terminal.
由于MCU发送给某会场的终端的左右声道的立体声混音中的某路声音可能与该会场支持的语言类型相同:比如中文会场接收到的立体声混音的左声道为中文会议混音,因此,中文会场的终端可以直接在所在会场中播放左声道声音。又比如英文会场接收到的立体声混音的右声道为英文会议混音,因此,英文会场的终端可以直接在所在会场中播放右声道声音。又比如中英文混合会场接收到的立体声混音的左声道为中文会议混音,右声道为英文会议混音,因此,中英文混合会场的与会成员可以根据自己熟悉的语言类型分别收听左右不同声道的会议混音。The sound in the stereo mix of the left and right channels of the terminal sent by the MCU to the conference site may be the same as the language supported by the site: for example, the left channel of the stereo mix received by the Chinese venue is the Chinese conference mix. Therefore, the terminal of the Chinese venue can play the left channel sound directly in the venue. For example, the right channel of the stereo mix received by the English venue is an English conference mix. Therefore, the terminal of the English venue can play the right channel sound directly in the venue. For example, the left channel of the stereo mix received by the Chinese and English mixed venues is the Chinese conference mix, and the right channel is the English conference mix. Therefore, the participants in the Chinese and English mixed venues can listen to each other according to their familiar language types. Conference mix for different channels.
从第二实施例以及第三实施例可以看出,本申请提出的会议实现方法,不仅可以有效的减少终端侧翻译人员的工作量,又可以有效满足多语会议 环境下各个会场人员的语言听说需求。It can be seen from the second embodiment and the third embodiment that the method for implementing the conference proposed by the present application can effectively reduce the workload of the terminal-side translators and effectively satisfy the language listening of the participants in the multi-language conference environment. Say the demand.
图14为本申请第四实施例提供的会议电视应用场景示意图。本实施例结合具体的应用场景对支持多语会议电视的方法实施例的技术方案进行详细说明。FIG. 14 is a schematic diagram of a conference television application scenario according to a fourth embodiment of the present disclosure. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.
本实施例与第三实施例的区别在于,会议支持的语言类型为三种,各个会场之间具有互译的能力,确定为中英法三语会议。本实例中英法会场的音频采集到的原始会场声音后,会经过两种翻译转换,比如中文会场,原始语音作为终端声道1的输入,中译英的翻译语音作为终端声道2的输入,中译法的翻译声音作为终端声道3的输入,分别编码后,进行打包处理,具体实现方式见图8(a)。英文以及法文会场的声音输入处理步骤同理。The difference between this embodiment and the third embodiment is that the type of language supported by the conference is three, and the ability of each conference to translate with each other is determined to be a Chinese-English trilingual conference. In this example, after the original venue sound collected by the audio of the English and French venues, there will be two translation conversions, such as the Chinese venue, the original voice as the input of the terminal channel 1, and the translated speech of the Chinese to English as the input of the terminal channel 2. The translation sound of the Chinese translation method is used as the input of the terminal channel 3, and is separately encoded and then subjected to packing processing. The specific implementation manner is shown in Fig. 8(a). The sound input processing steps in English and French venues are the same.
在本实施例中,在会议进行时,MCU和各会场的终端的工作方法如下:In this embodiment, when the conference is in progress, the working methods of the MCU and the terminals of each conference site are as follows:
(1)各会场的终端获取自身所在会场的原始语音以及会议支持的语言类型,确定翻译的种类,如英译汉或者汉译英或者其他等,将原始语音连同两路翻译的语音作为声道1到声道3的输入音源,打包生成多声道编码包作为一路音频码流发送给MCU。(1) The terminal of each venue obtains the original voice of the venue where it is located and the language type supported by the conference, determines the type of translation, such as English-Chinese translation or Chinese-English translation or other, and uses the original voice together with the two-way translated voice as the channel. The input source of channel 1 to channel 3 is packaged to generate a multi-channel encoded packet and sent to the MCU as an audio stream.
(2)MCU将各终端发送的多声道编码包进行解包解码处理,根据解码后得到的声道语言类型,送入多声道混音器处理,得到多声道混音音频流,进行编码打包得到多声道混音编码包,具体实现方式见图8(a)。最后将多声道混音编码包发送到各终端。本实施例中,MCU的混音处理可以快速实现,仅需一个混音器完成多声道混音。(2) The MCU decodes and decodes the multi-channel encoded packet sent by each terminal, and sends it to the multi-channel mixer according to the channel language type obtained after decoding to obtain a multi-channel mixed audio stream. Encoding and packing to obtain a multi-channel mixing and encoding package, the specific implementation is shown in Figure 8(a). Finally, the multi-channel mixing code packet is sent to each terminal. In this embodiment, the mixing process of the MCU can be quickly implemented, and only one mixer is required to complete the multi-channel mixing.
(3)各终端接收MCU发送的多声道混音编码包,进行解码拆分成三个声道的语音,分别解码成线性码流,终端侧按需收听会议混音。(3) Each terminal receives the multi-channel mixing and encoding packet sent by the MCU, performs decoding and splits into three channels of speech, and respectively decodes into a linear code stream, and the terminal side listens to the conference mixing as needed.
由于MCU发送给某会场的终端的多声道混音编码包涵盖了各个会场支持的语言类型,因此,会议中的中文终端、英文终端、法文终端可以通过选择不同声道,直接收听或者播放会议的语音。该应用场景下每个终端 都可以从多个不同的语种中自由选择收听的语种。Since the multi-channel mixing and encoding package of the terminal sent by the MCU to a certain site covers the language types supported by each site, the Chinese terminal, the English terminal, and the French terminal in the conference can directly listen to or play the conference by selecting different channels. Voice. In this application scenario, each terminal can freely choose the language to listen to from a variety of different languages.
图15为本申请第五实施例提供的会议电视应用场景示意图。本实施例结合具体的应用场景对支持多语会议电视的方法实施例的技术方案进行详细说明。FIG. 15 is a schematic diagram of a conference television application scenario according to a fifth embodiment of the present disclosure. This embodiment describes the technical solution of the method for supporting multi-session conference television in detail according to a specific application scenario.
本实施例与第四实施例的区别在于,会议中的会场语言类型为五种,分别为中、英、德、日、法。在会议召开的信令交互阶段,德日法混合会场具有与中文,英文互译的能力,但是其他会场不具备德日法文的翻译能力,因此,多语会议支持的语言类型为中英两种。德日法会场在参加会议交流的时候,首先将会场的声音翻译成中文以及英文,作为声道1以及声道2的音源输入,如图15所示。中文会场发送的语音分别为原始语音中文以及经原始语音翻译后的英文语音,英文会场发送的语音分别为经原始语音翻译后的中文以及原始语音英文。中英文混合会场将原始中英文混合语音分别翻译成中文语音以及英文语音,德日法文混合会场同理,将德日法混合语音分别翻译成中文语音以及英文语音。The difference between this embodiment and the fourth embodiment is that there are five types of venue languages in the conference, namely Chinese, English, German, Japanese, and French. In the signaling interaction phase of the conference, the German-Japanese mixed forum has the ability to translate Chinese and English, but other venues do not have the translation ability of German-Japanese and French. Therefore, the language types supported by multi-lingual conferences are Chinese and English. When attending the conference, the German-Japanese law venue first translated the voice of the venue into Chinese and English, as the audio source input for channel 1 and channel 2, as shown in Figure 15. The voices sent by the Chinese venue are the original voice Chinese and the English voice translated by the original voice. The voices sent by the English venue are the Chinese translated by the original voice and the original voice English. The mixed Chinese and English mixed speech is translated into Chinese speech and English speech respectively. The German-Japanese-French mixed conference is the same, and the German-Japanese mixed speech is translated into Chinese speech and English speech respectively.
在本实施例中,在会议进行时,MCU和各会场的终端的工作方法如下:In this embodiment, when the conference is in progress, the working methods of the MCU and the terminals of each conference site are as follows:
(1)各会场的终端获取自身所在会场的原始语音以及多语会议支持的语言类型,确定翻译的种类,如英译汉或者汉译英或者其他等。将输出的两种语言类型的语音作为左右声道的音源,进行编码得到多声道编码包作为一路音频码流发送给MCU。(1) The terminal of each venue obtains the original voice of the venue where it is located and the language type supported by the multilingual conference, and determines the type of translation, such as English to Chinese or Chinese to English or the like. The two languages of the output are used as the sound source of the left and right channels, and the multi-channel encoded packet is encoded as one audio stream to be sent to the MCU.
(2)MCU将各终端发送的多声道编码包进行混音处理,得到多声道混音音频流,并将多声道混音音频流进行编码打包后发送到各终端。MCU的混音处理可以快速实现,仅需一个混音器完成多语混音。(2) The MCU mixes the multi-channel encoded packets sent by each terminal to obtain a multi-channel mixed audio stream, and encodes and packs the multi-channel mixed audio stream and transmits it to each terminal. MCU's mixing process can be implemented quickly, requiring only one mixer to complete a multilingual mix.
(3)各终端接收MCU发送的多声道混音编码包,进行解码,拆分左右两路,按照需求分别收听多语语音。(3) Each terminal receives the multi-channel mixing and encoding packet sent by the MCU, decodes it, splits the left and right channels, and listens to the multi-lingual voice separately according to the requirements.
如果MCU发送给某会场的终端的混音编码包中的混音音频流与该会场支持的语言类型相同,如中文与英文多语会议中的中文会场接收到的多语音频为中文以及英文,会议中的中文终端、英文终端、中英文终端可以通过选择不同声道,直接收听或者播放会议的语音。If the mixed audio stream in the mix encoding package of the terminal sent by the MCU to the site is the same as the language type supported by the site, for example, the multi-lingual audio received by the Chinese venue in the Chinese and English multilingual conferences is Chinese and English. Chinese terminals, English terminals, Chinese and English terminals in the conference can directly listen to or play the conference voice by selecting different channels.
对于接收到的混音编码包中的语言类型与会场的语言类型不一致的终端,则可以根据具体的需求进行翻译转换,如德日法会场需要对接收到的中文混音或者英文混音进行翻译转换,然后再进行播放。For the terminal in which the language type of the received mixing code package is inconsistent with the language type of the site, the translation can be converted according to the specific requirements. For example, the German-Japanese venue needs to translate the received Chinese mix or English mix. Convert and then play.
本实施例中每个会场都具有翻译功能,能够对会场的原始语音进行翻译后输出。MCU将终端输出的多语语音进行混音处理后,仅需一个混音器,仅需一个处理流程,混音前无需新增语音识别模块,可以通过帧类型信息进行语言类型识别。会议中的各个会场的参会人员使用的交流语言可以为一种,也可以为多种,均可无障碍进行会议沟通,方案易于实现,具有很高的实用性。In this embodiment, each site has a translation function, and the original voice of the site can be translated and output. After the MCU mixes the multi-lingual voice outputted by the terminal, only one mixer is needed, and only one processing flow is needed. Before the mixing, there is no need to add a voice recognition module, and the language type identification can be performed through the frame type information. Participants in each meeting in the conference can use one or more communication languages, and can communicate with each other without barriers. The solution is easy to implement and has high practicability.
本申请一实施例提供一种会议实现装置,如图16所示,包括:An embodiment of the present application provides a conference implementation apparatus, as shown in FIG. 16, including:
码流生成单元1601,设置为,获取所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,且一个声道对应一种语言类型的音频流;The code stream generating unit 1601 is configured to acquire the original voice of the site and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
第一编码单元1602,设置为,将所述多声道音频流进行编码获得多声道编码包;The first encoding unit 1602 is configured to encode the multi-channel audio stream to obtain a multi-channel encoded packet;
第一发送单元1603,设置为,将所述多声道编码包发送给多点控制单元。The first transmitting unit 1603 is configured to transmit the multi-channel encoded packet to the multi-point control unit.
所述会议实现装置还包括:语言类型确定单元1604,所述语言类型确定单元1604设置为,确定会议支持的语言类型。比如,所述语言类型确定单元1604从所述多点控制单元获取所述会议支持的语言类型。The conference implementing apparatus further includes: a language type determining unit 1604, the language type determining unit 1604 being configured to determine a language type supported by the conference. For example, the language type determining unit 1604 acquires the language type supported by the conference from the multipoint control unit.
所述码流生成单元1601生成的所述多声道音频流中包括所述会议支持 的每个语言类型的音频流。The multi-channel audio stream generated by the code stream generating unit 1601 includes an audio stream of each language type supported by the conference.
在一实施例中,所述码流生成单元获取所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,包括如下之一或其组合:In an embodiment, the code stream generating unit acquires the original voice of the site and the translated voice corresponding to the original voice, and generates a multi-channel audio stream, including one or a combination thereof:
当所述原始语音的语言类型为所述会议支持的语言类型之一时,将所述原始语音对应的音频流作为一个声道的音频流,将所述原始语音对应的所述会议支持的除所述原始语音的语言类型外的其他语言类型的翻译语音作为其他声道的音频流,生成所述多声道音频流;When the language type of the original voice is one of the language types supported by the conference, the audio stream corresponding to the original voice is used as an audio stream of one channel, and the conference supported by the original voice is supported by the conference. Translating speech of other language types other than the language type of the original speech as an audio stream of other channels, generating the multi-channel audio stream;
当所述原始语音的语言类型非所述会议支持的语言类型时,将所述原始语音对应的所述会议支持的每个语言类型的翻译语音分别作为一个声道的音频流,生成所述多声道音频流。When the language type of the original voice is not the language type supported by the conference, the translated voice of each language type supported by the conference corresponding to the original voice is respectively used as an audio stream of one channel, and the multiple Channel audio stream.
本申请一实施例提供一种会议实现装置,如图17所示,包括:An embodiment of the present application provides a conference implementation apparatus, as shown in FIG. 17, including:
第一接收单元1701,设置为,接收多点控制单元发送的多声道混音编码包;The first receiving unit 1701 is configured to receive the multi-channel mixing and encoding packet sent by the multi-point control unit;
第一解析单元1702,设置为,解析所述多声道混音编码包,获取多声道混音音频流。The first parsing unit 1702 is configured to parse the multi-channel mixing and encoding packet to acquire a multi-channel mixed audio stream.
在一实施例中,所述第一解析单元1702还设置为,根据所述多声道混音编码包中携带的语言类型信息确定各声道的音频流所使用的语言类型,或者,根据预设的声道与语言类型的对应关系确定每个声道对应的语言类型。In an embodiment, the first parsing unit 1702 is further configured to determine, according to the language type information carried in the multi-channel mixing and encoding package, a language type used by each channel of the audio stream, or according to the pre- The correspondence between the set channel and the language type determines the language type corresponding to each channel.
在一实施例中,所述第一解析单元1702获取每个声道的编码数据包括:In an embodiment, the obtaining, by the first parsing unit 1702, the encoded data of each channel includes:
所述第一解析单元1702从所述多声道编码包中的实时传输协议头扩展中的帧类型字段获取每个声道的音频流所使用的语言类型信息,从所述实时传输协议头扩展中的帧长度字段获取每个声道的编码数据的长度,根据每个声道的编码数据的长度从所述多声道编码包的实时传输协议负载中获取每个声道的编码数据。The first parsing unit 1702 acquires language type information used by an audio stream of each channel from a frame type field in a real-time transport protocol header extension in the multi-channel encoded packet, and expands from the real-time transport protocol header The frame length field in the field acquires the length of the encoded data of each channel, and the encoded data of each channel is obtained from the real-time transport protocol payload of the multi-channel encoded packet according to the length of the encoded data of each channel.
本申请一实施例提供一种会议实现装置,如图18所示,包括:An embodiment of the present application provides a conference implementation apparatus, as shown in FIG. 18, including:
第二接收单元1801,设置为,接收参与会议的终端发送的多声道编码包;The second receiving unit 1801 is configured to receive a multi-channel encoded packet sent by the terminal participating in the conference;
第二解析单元1802,设置为,对所述多声道编码包进行解析获得多声道音频流,其中,所述多声道音频流中每个声道对应一个语言类型的音频流;The second parsing unit 1802 is configured to parse the multi-channel encoded packet to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type;
混音单元1803,设置为,确定需要混音的终端,将所确定的终端的多声道音频流中同一语言类型的音频流进行混音,获得多声道混音音频流;且一个声道对应一种语言类型的混音音频流;The mixing unit 1803 is configured to determine a terminal that needs to be mixed, and mix the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream; and one channel a mixed audio stream corresponding to one language type;
第二编码单元1804,设置为,将所述多声道混音音频流进行编码获得多声道混音编码包;a second encoding unit 1804, configured to encode the multi-channel mixed audio stream to obtain a multi-channel mixed encoding package;
第二发送单元1805,设置为,将所述多声道混音编码包发送给参与会议的终端。The second transmitting unit 1805 is configured to transmit the multi-channel mixing and encoding packet to the terminal participating in the conference.
本申请一实施例提供一种会议实现设备,包括存储器和处理器,所述存储器存储有程序,所述程序在被所述处理器读取执行时,实现上述任一所述的会议实现方法。An embodiment of the present application provides a conference implementation device, including a memory and a processor, where the memory stores a program, and when the program is read and executed by the processor, the conference implementation method of any of the foregoing is implemented.
本申请一实施例一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现上述任一所述的会议实现方法。An embodiment of the present application is a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement any of the above The conference implementation method.
所述计算机可读存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The computer readable storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like, which can store program codes. Medium.
本申请一实施例提供一种会议实现系统,如图19所示,包括:MCU1901和至少一个终端1902,其中:An embodiment of the present application provides a conference implementation system, as shown in FIG. 19, including: an MCU 1901 and at least one terminal 1902, where:
所述终端1902设置为,获取所述终端所在会场的原始语音以及所述原 始语音对应的翻译语音,生成多声道音频流,且一个声道对应一种语言类型的音频流;将所述多声道音频流进行编码获得多声道编码包;将所述多声道编码包发送给所述多点控制单元,以及,接收多点控制单元发送的多声道混音编码包;解析所述多声道混音编码包,获取多声道混音音频流;The terminal 1902 is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type; The channel audio stream is encoded to obtain a multi-channel encoded packet; the multi-channel encoded packet is transmitted to the multi-point control unit, and the multi-channel mixing encoded packet transmitted by the multi-point control unit is received; Multi-channel mixing and encoding package to obtain a multi-channel mixed audio stream;
所述MCU1901设置为,接收参与会议的终端发送的多声道编码包,进行解析获得多声道音频流,其中,所述多声道音频流中每个声道对应一个语言类型的音频流;确定需要混音的终端,将所确定的终端的多声道音频流中同一语言类型的音频流进行混音,获得多声道混音音频流;将所述多声道混音音频流进行编码获得多声道混音编码包;将所述多声道混音编码包发送给参与会议的终端。The MCU 1901 is configured to receive a multi-channel encoded packet sent by a terminal participating in the conference, and perform parsing to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type; Determining a terminal that needs to be mixed, mixing the audio stream of the same language type in the determined multi-channel audio stream of the terminal, obtaining a multi-channel mixed audio stream; encoding the multi-channel mixed audio stream Obtaining a multi-channel mixing encoding packet; transmitting the multi-channel mixing encoding packet to a terminal participating in the conference.
终端和MCU的详细实现请参考前面的实施例,此处不再赘述。For detailed implementation of the terminal and the MCU, refer to the previous embodiment, and details are not described herein.
虽然本申请所揭露的实施方式如上,但所述的内容仅为便于理解本申请而采用的实施方式,并非用以限定本申请。任何本申请所属领域内的技术人员,在不脱离本申请所揭露的精神和范围的前提下,可以在实施的形式及细节上进行任何的修改与变化,但本申请的专利保护范围,仍须以所附的权利要求书所界定的范围为准。The embodiments disclosed in the present application are as described above, but the description is only for the purpose of understanding the present application, and is not intended to limit the present application. Any modifications and changes in the form and details of the embodiments may be made by those skilled in the art without departing from the spirit and scope of the disclosure. The scope defined by the appended claims shall prevail.
工业实用性Industrial applicability
不同于相关技术中使用不同通道分别传输原始语音和翻译语音,采用本申请实施例,是将不同语言类型的语音生成多声道编码包后发送,仅需要一路通道,从而减少了对端口的占用。Different from the related art, the original voice and the translated voice are separately transmitted by using different channels. In the embodiment of the present application, the voices of different language types are generated by multi-channel encoding, and only one channel is needed, thereby reducing the occupation of the port. .

Claims (22)

  1. 一种会议实现方法,包括:A meeting implementation method includes:
    获取终端所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,且一个声道对应一种语言类型的音频流;Acquiring the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generating a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
    将所述多声道音频流进行编码获得多声道编码包;Encoding the multi-channel audio stream to obtain a multi-channel encoded packet;
    将所述多声道编码包发送给多点控制单元。The multi-channel encoded packet is transmitted to a multi-point control unit.
  2. 如权利要求1所述的会议实现方法,其中,所述方法还包括:确定会议支持的语言类型;The conference implementation method of claim 1, wherein the method further comprises: determining a language type supported by the conference;
    所述多声道音频流中包括所述会议支持的每个语言类型的音频流。The multi-channel audio stream includes an audio stream of each language type supported by the conference.
  3. 如权利要求2所述的会议实现方法,其中,所述确定会议支持的语言类型包括:The conference implementation method of claim 2, wherein the determining the language type supported by the conference comprises:
    从所述多点控制单元获取所述会议支持的语言类型。The language type supported by the conference is obtained from the multipoint control unit.
  4. 如权利要求2所述的会议实现方法,其中,所述获取终端所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,包括如下之一或其组合:The method for implementing a conference according to claim 2, wherein the acquiring the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, and generating a multi-channel audio stream, including one or a combination thereof:
    当所述原始语音的语言类型为所述会议支持的语言类型之一时,将所述原始语音对应的音频流作为一个声道的音频流,将所述原始语音对应的所述会议支持的除所述原始语音的语言类型外的其他语言类型的翻译语音作为其他声道的音频流,生成所述多声道音频流;When the language type of the original voice is one of the language types supported by the conference, the audio stream corresponding to the original voice is used as an audio stream of one channel, and the conference supported by the original voice is supported by the conference. Translating speech of other language types other than the language type of the original speech as an audio stream of other channels, generating the multi-channel audio stream;
    当所述原始语音的语言类型非所述会议支持的语言类型时,将所述原始语音对应的所述会议支持的每个语言类型的翻译语音分别作为一个声道的音频流,生成所述多声道音频流。When the language type of the original voice is not the language type supported by the conference, the translated voice of each language type supported by the conference corresponding to the original voice is respectively used as an audio stream of one channel, and the multiple Channel audio stream.
  5. 如权利要求1至4任一所述的会议实现方法,其中,所述多声道编码包中每个声道对应预设语言类型的音频流;或者,所述多声道编码包中携带每个声道的语言类型信息。The conference implementation method according to any one of claims 1 to 4, wherein each of the multi-channel encoded packets corresponds to an audio stream of a preset language type; or, the multi-channel encoded packet carries each Language type information for the channel.
  6. 如权利要求1至4任一所述的会议实现方法,其中,所述多声道编码包为实时传输协议报文,所述多声道编码包中包括实时传输协议头、实时传输协议头扩展和实时传输协议负载,所述实时传输协议头扩展中包括长度字段和一个或多个帧信息字段,所述长度字段中携带所述帧信息字段的个数,所述帧信息字段包括帧类型字段和帧长度字段,所述帧类型字段携带一个声道的语言类型,所述帧长度字段携带该声道的编码数据长度,所述实时传输协议负载携带每个声道的编码数据。The method for implementing a conference according to any one of claims 1 to 4, wherein the multi-channel encoded packet is a real-time transport protocol packet, and the multi-channel encoded packet includes a real-time transport protocol header and a real-time transport protocol header extension. And a real-time transport protocol header, where the real-time transport protocol header extension includes a length field and one or more frame information fields, where the length field carries the number of the frame information fields, and the frame information field includes a frame type field And a frame length field carrying a language type of a channel, the frame length field carrying an encoded data length of the channel, the real-time transport protocol payload carrying encoded data for each channel.
  7. 一种会议实现装置,包括:A conference implementation device, comprising:
    码流生成单元,设置为,获取终端所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,且一个声道对应一种语言类型的音频流;The code stream generating unit is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type;
    第一编码单元,设置为,将所述多声道音频流进行编码获得多声道编码包;a first encoding unit configured to encode the multi-channel audio stream to obtain a multi-channel encoded packet;
    第一发送单元,设置为,将所述多声道编码包发送给多点控制单元。The first transmitting unit is configured to transmit the multi-channel encoded packet to the multi-point control unit.
  8. 一种会议实现方法,包括:A meeting implementation method includes:
    接收参与会议的终端发送的多声道编码包,进行解析获得多声道音频流,其中,所述多声道音频流中每个声道对应一个语言类型的音频流;Receiving a multi-channel encoded packet sent by the terminal participating in the conference, and performing parsing to obtain a multi-channel audio stream, wherein each channel of the multi-channel audio stream corresponds to an audio stream of one language type;
    确定需要混音的终端,将所确定的终端的多声道音频流中同一语言类型的音频流进行混音,获得多声道混音音频流;Determining a terminal that needs to be mixed, and mixing the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream;
    将所述多声道混音音频流进行编码获得多声道混音编码包;Encoding the multi-channel mixed audio stream to obtain a multi-channel mixing and encoding package;
    将所述多声道混音编码包发送给参与会议的终端。The multi-channel mixing and encoding package is transmitted to the terminal participating in the conference.
  9. 如权利要求8所述的会议实现方法,其中,所述方法还包括:The method for implementing a conference according to claim 8, wherein the method further comprises:
    所述多声道混音编码包中携带语言类型信息,指示每个声道的混音音频流所使用的语言类型;或者,所述多声道混音编码包中每个声道对应预设语言类型的混音音频流。The multi-channel mixing and encoding package carries language type information indicating a language type used by the mixed audio stream of each channel; or, each channel of the multi-channel mixing and encoding package corresponds to a preset A mixed audio stream of the language type.
  10. 如权利要求8所述的会议实现方法,其中,所述多声道混音编码包为实时传输协议报文,所述多声道混音编码包中包括实时传输协议头、实时传输协议头扩展和实时传输协议负载,所述实时传输协议头扩展中包括长度字段和一个或多个帧信息字段,所述长度字段指示所述帧信息字段的个数,所述帧信息字段包括帧类型字段和帧长度字段,所述帧类型字段携带一个声道的语言类型,所述帧长度字段携带该声道的编码数据长度,所述实时传输协议负载携带每个声道的编码数据。The conference implementation method according to claim 8, wherein the multi-channel mixing and encoding packet is a real-time transmission protocol packet, and the multi-channel mixing and encoding packet includes a real-time transmission protocol header and a real-time transmission protocol header extension. And a real-time transport protocol header, the real-time transport protocol header extension including a length field and one or more frame information fields, the length field indicating a number of the frame information fields, the frame information field including a frame type field and A frame length field carrying a language type of a channel, the frame length field carrying an encoded data length of the channel, the real-time transport protocol payload carrying encoded data for each channel.
  11. 如权利要求8至10任一所述的会议实现方法,其中,所述方法还包括:确定会议支持的语言类型,将所述会议支持的语言类型发送给参与会议的终端。The conference implementation method according to any one of claims 8 to 10, wherein the method further comprises: determining a language type supported by the conference, and transmitting the language type supported by the conference to the terminal participating in the conference.
  12. 如权利要求11所述的会议实现方法,其中,所述确定会议支持的语言类型包括以下至少之一:The conference implementation method according to claim 11, wherein the determining the language type supported by the conference comprises at least one of the following:
    当会场的优先级最高时,将该会场的发言语种作为会议支持的一种语言类型;When the site has the highest priority, the speaking language of the site is used as a language type supported by the conference;
    当会场的发言语种为会议中所有会场中使用最多的语言类型时,将该会场的发言语种作为会议支持的一种语言类型;When the speaking language of the site is the most used language type in all the conference sites, the speaking language of the site is used as a language type supported by the conference;
    当会场的发言语种与其他会场具备互译能力时,将该会场的发言语种作为会议支持的一种语言类型。When the speaking language of the venue and the other venues have the ability to translate, the speaking language of the venue is a language type supported by the conference.
  13. 如权利要求11所述的会议实现方法,其中,所述多声道混音音频流中包括所述会议支持的每个语言类型的混音音频流。The conference implementing method according to claim 11, wherein said multi-channel mixed audio stream includes a mixed audio stream of each language type supported by said conference.
  14. 一种会议实现装置,包括:A conference implementation device, comprising:
    第二接收单元,设置为,接收参与会议的终端发送的多声道编码包;a second receiving unit, configured to receive a multi-channel encoded packet sent by the terminal participating in the conference;
    第二解析单元,设置为,对所述多声道编码包进行解析获得多声道音频流,其中,所述多声道音频流中每个声道对应一个语言类型的音频流;a second parsing unit, configured to parse the multi-channel encoded packet to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to an audio stream of one language type;
    混音单元,设置为,确定需要混音的终端,将所确定的终端的多声道音频流中同一语言类型的音频流进行混音,获得多声道混音音频流;a mixing unit, configured to determine a terminal that needs to be mixed, and mix the audio streams of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream;
    第二编码单元,设置为,将所述多声道混音音频流进行编码获得多声道混音编码包;a second encoding unit configured to encode the multi-channel mixed audio stream to obtain a multi-channel mixing encoding package;
    第二发送单元,设置为,将所述多声道混音编码包发送给参与会议的终端。The second transmitting unit is configured to send the multi-channel mixing and encoding packet to the terminal participating in the conference.
  15. 一种会议实现方法,包括:A meeting implementation method includes:
    接收多点控制单元发送的多声道混音编码包;Receiving a multi-channel mixing and encoding package sent by the multi-point control unit;
    解析所述多声道混音编码包,获取多声道混音音频流。The multi-channel mixing and encoding package is parsed to obtain a multi-channel mixed audio stream.
  16. 如权利要求15所述的会议实现方法,其中,所述方法还包括:根据所述多声道混音编码包中携带的语言类型信息确定每个声道的语言类型,或者,根据预设的声道与语言类型的对应关系确定每个声道对应的语言类型。The conference implementing method according to claim 15, wherein the method further comprises: determining a language type of each channel according to language type information carried in the multi-channel mixing encoded packet, or according to a preset The correspondence between the channel and the language type determines the language type corresponding to each channel.
  17. 如权利要求15或16所述的会议实现方法,其中,所述解析所述多声道混音编码包,获取多声道混音音频流包括:The conference implementation method according to claim 15 or 16, wherein said parsing said multi-channel mixing and encoding packet to obtain a multi-channel mixed audio stream comprises:
    所述多声道编码包为实时传输协议报文时,从所述多声道编码包中的实时传输协议头扩展中的帧类型字段获取每个声道的音频流所使用的语言类型信息,从所述实时传输协议头扩展中的帧长度字段获取每个声道的编码数据的长度,根据每个声道的编码数据的长度从所述多声道编码包的实时传输协议负载中获取每个声道的编码数据,对每个声道的编码数据进行解码获得该声道的混音音频流。When the multi-channel encoded packet is a real-time transport protocol message, the language type information used by the audio stream of each channel is obtained from a frame type field in the real-time transport protocol header extension in the multi-channel encoded packet. Acquiring the length of the encoded data of each channel from the frame length field in the real-time transport protocol header extension, and obtaining each from the real-time transport protocol payload of the multi-channel encoded packet according to the length of the encoded data of each channel The encoded data of each channel is decoded for each channel of the encoded data to obtain a mixed audio stream of the channel.
  18. 一种会议实现装置,包括:A conference implementation device, comprising:
    第一接收单元,设置为,接收多点控制单元发送的多声道混音编码包;The first receiving unit is configured to receive the multi-channel mixing and encoding package sent by the multi-point control unit;
    第一解析单元,设置为,解析所述多声道混音编码包,获取多声道 混音音频流。The first parsing unit is configured to parse the multi-channel mixing and encoding packet to obtain a multi-channel mixed audio stream.
  19. 一种会议实现设备,包括存储器和处理器,所述存储器存储有程序,所述程序在被所述处理器读取执行时,实现如权利要求1至6、15至17任一所述的会议实现方法。A conference implementation device comprising a memory and a processor, the memory storing a program, the program implementing the conference according to any one of claims 1 to 6, 15 to 17 when read by the processor Implementation.
  20. 一种会议实现设备,包括存储器和处理器,所述存储器存储有程序,所述程序在被所述处理器读取执行时,实现如权利要求8至13任一所述的会议实现方法。A conference implementation device, comprising a memory and a processor, the memory storing a program, the program implementing the conference implementation method according to any one of claims 8 to 13 when read by the processor.
  21. 一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如权利要求1至6、8至13、15至17任一所述的会议实现方法。A computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the claims 1 to 6, 8 The conference implementation method as described in any of 13, 15 to 17.
  22. 一种会议实现系统,包括:多点控制单元和至少一个终端,其中:A conference implementation system includes: a multipoint control unit and at least one terminal, wherein:
    所述终端设置为,获取所述终端所在会场的原始语音以及所述原始语音对应的翻译语音,生成多声道音频流,且一个声道对应一种语言类型的音频流;将所述多声道音频流进行编码获得多声道编码包;将所述多声道编码包发送给所述多点控制单元,以及,接收多点控制单元发送的多声道混音编码包;解析所述多声道混音编码包,获取多声道混音音频流;The terminal is configured to acquire the original voice of the site where the terminal is located and the translated voice corresponding to the original voice, generate a multi-channel audio stream, and one channel corresponds to an audio stream of one language type; The channel audio stream is encoded to obtain a multi-channel encoded packet; the multi-channel encoded packet is transmitted to the multi-point control unit, and the multi-channel mixing and encoding packet transmitted by the multi-point control unit is received; Channel mixing encoding package to obtain a multi-channel mixed audio stream;
    所述多点控制单元设置为,接收参与会议的终端发送的多声道编码包,进行解析获得多声道音频流,其中,所述多声道音频流中每个声道对应一个语言类型的音频流;确定需要混音的终端,将所确定的终端的多声道音频流中同一语言类型的音频流进行混音,获得多声道混音音频流;将所述多声道混音音频流进行编码获得多声道混音编码包;将所述多声道混音编码包发送给参与会议的终端。The multi-point control unit is configured to receive a multi-channel encoded packet sent by the terminal participating in the conference, and perform parsing to obtain a multi-channel audio stream, where each channel of the multi-channel audio stream corresponds to one language type Audio stream; determining a terminal that needs to be mixed, mixing the audio stream of the same language type in the determined multi-channel audio stream of the terminal to obtain a multi-channel mixed audio stream; and the multi-channel mixed audio The stream is encoded to obtain a multi-channel mixing encoding packet; the multi-channel mixing encoding packet is transmitted to a terminal participating in the conference.
PCT/CN2018/096532 2017-12-29 2018-07-20 Conference implementation method, device, apparatus and system, and computer readable storage medium WO2019128204A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711482042.3A CN108076306B (en) 2017-12-29 2017-12-29 Conference implementation method, device, equipment and system and computer readable storage medium
CN201711482042.3 2017-12-29

Publications (1)

Publication Number Publication Date
WO2019128204A1 true WO2019128204A1 (en) 2019-07-04

Family

ID=62156131

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/096532 WO2019128204A1 (en) 2017-12-29 2018-07-20 Conference implementation method, device, apparatus and system, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN108076306B (en)
WO (1) WO2019128204A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113014998A (en) * 2021-02-03 2021-06-22 深圳创维-Rgb电子有限公司 Audio output method, device, television and computer readable storage medium
CN113472743A (en) * 2021-05-28 2021-10-01 引智科技(深圳)有限公司 Multilingual conference sharing and personalized editing method
CN114827101A (en) * 2022-04-13 2022-07-29 京东科技信息技术有限公司 Audio processing method and device, electronic equipment and storage medium

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108076306B (en) * 2017-12-29 2021-09-17 中兴通讯股份有限公司 Conference implementation method, device, equipment and system and computer readable storage medium
CN109379490B (en) * 2018-09-30 2021-02-05 Oppo广东移动通信有限公司 Audio playing method and device, electronic equipment and computer readable medium
CN109348306A (en) * 2018-11-05 2019-02-15 努比亚技术有限公司 Video broadcasting method, terminal and computer readable storage medium
CN111354356B (en) * 2018-12-24 2024-04-30 北京搜狗科技发展有限公司 Voice data processing method and device
CN110166729B (en) * 2019-05-30 2021-03-02 上海赛连信息科技有限公司 Cloud video conference method, device, system, medium and computing equipment
CN110324565A (en) * 2019-06-06 2019-10-11 浙江华创视讯科技有限公司 Audio-frequency inputting method, device, conference host, storage medium and electronic device
CN112218020B (en) * 2019-07-09 2023-03-21 海信视像科技股份有限公司 Audio data transmission method and device for multi-channel platform
WO2021004048A1 (en) * 2019-07-09 2021-01-14 海信视像科技股份有限公司 Display device and audio data transmission method
CN110677406A (en) * 2019-09-26 2020-01-10 上海译牛科技有限公司 Simultaneous interpretation method and system based on network
CN113743132A (en) * 2020-05-14 2021-12-03 大富科技(安徽)股份有限公司 Intelligent terminal, translation method thereof and storage medium
CN112188144B (en) * 2020-09-14 2023-03-24 浙江华创视讯科技有限公司 Audio transmission method and device, storage medium and electronic device
CN115314660A (en) * 2021-05-07 2022-11-08 阿里巴巴新加坡控股有限公司 Processing method and device for audio and video conference
CN114912469B (en) * 2022-05-26 2023-03-31 东北农业大学 Information communication method for converting Chinese and English languages and electronic equipment
CN117749947A (en) * 2023-12-22 2024-03-22 广东保伦电子股份有限公司 Multi-terminal protocol-based multi-party call processing method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1937664A (en) * 2006-09-30 2007-03-28 华为技术有限公司 System and method for realizing multi-language conference
US20090251529A1 (en) * 2008-04-02 2009-10-08 Freeport Technologies Network management server for managing multiple operating modes of a conferencing network with different sets of policies
CN101702762A (en) * 2006-09-30 2010-05-05 华为技术有限公司 Multipoint control unit for realizing multi-language conference and conference terminal
CN104853272A (en) * 2015-05-13 2015-08-19 视讯联合科技股份有限公司 Multifunctional DSP wireless headphone with interconnected interaction
CN108076306A (en) * 2017-12-29 2018-05-25 中兴通讯股份有限公司 Meeting implementation method, device, equipment and system, computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102226944B (en) * 2011-05-25 2014-11-12 贵阳朗玛信息技术股份有限公司 Audio mixing method and equipment thereof
KR20130101823A (en) * 2012-03-06 2013-09-16 한승묵 Digital device and video call performing method
CN102821259B (en) * 2012-07-20 2016-12-21 冠捷显示科技(厦门)有限公司 There is TV system and its implementation of multi-lingual voiced translation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1937664A (en) * 2006-09-30 2007-03-28 华为技术有限公司 System and method for realizing multi-language conference
CN101702762A (en) * 2006-09-30 2010-05-05 华为技术有限公司 Multipoint control unit for realizing multi-language conference and conference terminal
US20090251529A1 (en) * 2008-04-02 2009-10-08 Freeport Technologies Network management server for managing multiple operating modes of a conferencing network with different sets of policies
CN104853272A (en) * 2015-05-13 2015-08-19 视讯联合科技股份有限公司 Multifunctional DSP wireless headphone with interconnected interaction
CN108076306A (en) * 2017-12-29 2018-05-25 中兴通讯股份有限公司 Meeting implementation method, device, equipment and system, computer readable storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113014998A (en) * 2021-02-03 2021-06-22 深圳创维-Rgb电子有限公司 Audio output method, device, television and computer readable storage medium
CN113014998B (en) * 2021-02-03 2023-08-04 深圳创维-Rgb电子有限公司 Audio output method, device, television and computer readable storage medium
CN113472743A (en) * 2021-05-28 2021-10-01 引智科技(深圳)有限公司 Multilingual conference sharing and personalized editing method
CN113472743B (en) * 2021-05-28 2023-05-26 引智科技(深圳)有限公司 Multilingual conference sharing and personalized editing method
CN114827101A (en) * 2022-04-13 2022-07-29 京东科技信息技术有限公司 Audio processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN108076306A (en) 2018-05-25
CN108076306B (en) 2021-09-17

Similar Documents

Publication Publication Date Title
WO2019128204A1 (en) Conference implementation method, device, apparatus and system, and computer readable storage medium
US8531994B2 (en) Audio processing method, system, and control server
KR100880150B1 (en) Multi-point video conference system and media processing method thereof
CN113746808B (en) Converged communication method, gateway, electronic equipment and storage medium for online conference
US10068581B2 (en) Method and arrangement for providing a backwards compatible payload format
US8228363B2 (en) Method and system for conducting continuous presence conferences
US8169937B2 (en) Managing a packet switched conference call
US8600530B2 (en) Method for determining an audio data spatial encoding mode
CN109640028B (en) Method and device for carrying out conference combining on multiple video networking terminals and multiple Internet terminals
US20050262260A1 (en) Two-way audio/video conference system
WO2012155660A1 (en) Telepresence method, terminal and system
WO2010083737A1 (en) Method and apparatus for processing voice signal, method and apparatus for transmitting voice signal
CN110062191B (en) Multi-party group meeting method and server based on video network
WO2012167638A1 (en) Media data control method and apparatus
CN110475094B (en) Video conference processing method and device and readable storage medium
CN111327580A (en) Message transmission method and device
WO2021073155A1 (en) Video conference method, apparatus and device, and storage medium
CN103188403A (en) Voice gateway online monitoring method
CN114638237A (en) Method, device and system for realizing simultaneous interpretation
CN109151559B (en) Multi-party conversation method and household internet protocol shutdown set top box all-in-one machine
CN109150917B (en) Video synthesis control method and system based on SIP (Session initiation protocol)
WO2024134010A1 (en) Complexity reduction in multi-stream audio
CN117859313A (en) Supporting quality of service for media communications
Hatanaka et al. RTP Payload Format for the Adaptive TRansform Acoustic Coding (ATRAC) Family
Zhang et al. Research on user applying mode for video conference system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18894454

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 18.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18894454

Country of ref document: EP

Kind code of ref document: A1