WO2022127826A1 - 一种实现同声传译的方法、装置及系统 - Google Patents

一种实现同声传译的方法、装置及系统 Download PDF

Info

Publication number
WO2022127826A1
WO2022127826A1 PCT/CN2021/138353 CN2021138353W WO2022127826A1 WO 2022127826 A1 WO2022127826 A1 WO 2022127826A1 CN 2021138353 W CN2021138353 W CN 2021138353W WO 2022127826 A1 WO2022127826 A1 WO 2022127826A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio stream
language
terminal
media server
translation
Prior art date
Application number
PCT/CN2021/138353
Other languages
English (en)
French (fr)
Inventor
夏禹
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Priority to EP21905752.8A priority Critical patent/EP4246962A4/en
Publication of WO2022127826A1 publication Critical patent/WO2022127826A1/zh
Priority to US18/333,877 priority patent/US20230326448A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/20Aspects of automatic or semi-automatic exchanges related to features of supplementary services
    • H04M2203/2061Language aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2242/00Special services or facilities
    • H04M2242/12Language recognition, selection or translation arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/152Multipoint control units therefor

Definitions

  • the present application relates to the field of conference communication, and in particular, to a method, device and system for realizing simultaneous interpretation.
  • Simultaneous interpretation refers to a translation method in which the interpreter interprets the content to the audience without interruption without interrupting the speaker's speech.
  • a large seminar and international conference is usually carried out by two to three translators.
  • the speakers of a conference are constantly switched, and speakers from different countries will use different types of languages.
  • the translator also needs to adjust the language of the translation output following the switch of the speaker. For example, a translator's job is to translate between Chinese and English. When the speaker is changed and the language of the speech is switched from Chinese to English, the translator will also switch from Chinese to English to English to Chinese. At the same time, when the translator changes his output language, he needs to manually set his output language from English to Chinese on the relevant equipment, so that the machine can send the translated audio stream to the listeners belonging to the same language.
  • the conference site also needs to be equipped with a special conference manager to set the language of the current speaker, so that the media server can identify the language of the current speaker and return the translated audio stream to the large-screen terminal at the conference site.
  • the present application provides a method, device and system for realizing simultaneous interpretation, which reduces the degree of manual participation in simultaneous interpretation and improves the efficiency of simultaneous interpretation for conferences.
  • the present application provides a method for implementing simultaneous interpretation.
  • the media server receives the first audio stream and the second audio stream translated according to the first audio stream; then sends the second audio stream to the AI device to identify the language of the second audio stream;
  • the terminal sends the second audio stream, where the language of the second audio stream is the language of the audio stream that the first terminal expects to receive.
  • the media server uses AI equipment to identify the language of the translated audio stream (second audio stream), and translators no longer need to manually set their own translated language through the translation terminal, which reduces the pressure on translators and reduces the error rate of the conference language system. , to improve the efficiency of simultaneous interpretation.
  • the media server sends the first audio stream to the AI device to identify the language of the first audio stream, and then sends the first audio stream to the second terminal according to the language of the first audio stream, wherein , the language of the first audio stream is the language of the audio stream expected to be received by the second terminal.
  • the media service uses AI equipment to identify the language of the speaker's original audio stream (the first audio stream), eliminating the need for conference managers to manually set the speaker's language through the conference room terminal, reducing manual participation in the entire simultaneous interpretation process. , to improve the efficiency of simultaneous interpretation.
  • the media server determines the language of the second audio stream according to the language identification result of the second audio stream returned by the AI device.
  • the AI device directly returns the language recognition result, and the media server does not need to perform any processing on the result, and then forwards the second audio stream to the first terminal according to the language recognition result.
  • the media server receives the text corresponding to the second audio stream returned by the AI device, and then determines the language of the second audio stream according to the text.
  • the AI device converts the audio stream into text and sends it to the media server, and the media server determines the language type of the second audio stream according to the text.
  • the media server can also forward the text to the corresponding terminal according to the settings of each terminal, so as to realize real-time subtitles.
  • the media server sends a first audio stream to translation terminals used by all translators, and then receives a second audio stream, where the second audio stream is one of the audio streams returned by all translation terminals.
  • the media server adopts the full-staff sending strategy when sending the original audio stream of the speaker to the translator, without considering the translation ability of the translator, reducing the occupation of the computing resources of the media server and reducing the error rate of simultaneous interpretation.
  • the language of the first audio stream is the first language
  • the language of the second audio stream is the second language
  • the media server uses the language recognition result of the first audio stream by the AI device and the first translation
  • the capability parameter sends the first audio stream to the third terminal, wherein the first translation capability parameter is used to indicate that the translation capability of the first translator using the third terminal includes translating the first language into the second language; then the media server receives the first audio stream.
  • the media server considers the translator's translation capability when forwarding the speaker's original audio stream to the translator, that is, only forwards the original audio stream to translators who are involved in the language-related services of the first audio stream, reducing the number of The transmission of redundant information reduces the occupation of network transmission resources.
  • the media server receives the first translation capability parameter sent by the third terminal.
  • the first translator feeds back its own translation capability parameters to the media server through the third terminal, such as Chinese-English bidirectional translation, English-French bidirectional translation, and the like.
  • the media server specifies the translation capability parameter corresponding to the third terminal before the conference starts, and the translator selects the third terminal according to his own translation capability to receive the original audio stream of the speaker and send the translated audio flow.
  • the language of the first audio stream is the first language
  • the language of the second audio stream is the second language
  • the capability parameter and the third translation capability parameter determine the fourth terminal and the fifth terminal, and the second translation capability parameter is used to indicate that the translation capability of the second translator using the fourth terminal includes translating the first language into The third language
  • the third translation capability parameter is used to indicate that the translation capability of the third translator using the fifth terminal includes translating the third language into the second language
  • the media server sends the fourth language to the fourth language.
  • the terminal sends the first audio stream; the media server receives the third audio stream sent by the fourth terminal, and the language of the third audio stream is the third language;
  • the media server sends the third audio stream to the fifth terminal ;
  • the media server receives the second audio stream sent by the fifth terminal.
  • the media server determines a translation relay strategy according to the language identification result of the first audio stream and the translation capability parameter information of the translator, so as to ensure the normal operation of the conference translation service.
  • the media server before the media server sends the second audio stream to the first terminal, the media server further stores the second audio stream, and after the determined time, the media server stores the second audio stream from the second audio stream stored before the determined time.
  • the audio stream starts to send the second audio stream to the first terminal, and the determined time is the time when the media service determines that the language of the second audio stream is the language that the first terminal expects to receive.
  • the second audio stream is buffered before the second audio stream is sent to the first terminal, and then forwarded after confirming the language information, which reduces the probability of crosstalk at the venue and improves user experience.
  • the media server receives first language setting information sent by the first terminal, where the first language setting information is used to indicate the language of the audio stream that the first terminal expects to receive; the media The server receives the second language setting information sent by the second terminal, where the second language setting information is used to indicate the language of the audio stream that the second terminal expects to receive.
  • the media server determines the language of the audio stream that each terminal expects to receive according to the language setting information of each terminal.
  • the AI device and the media server are deployed in the same server.
  • the communication delay between the AI device and the media server is reduced, reducing the impact of the network on the simultaneous interpretation service.
  • the simultaneous interpretation method performs language recognition on each audio stream through an AI device, so as to realize a high-efficiency simultaneous interpretation service for a conference.
  • conference managers there is no need to manually change the language of the current speaker through the conference terminal when the speaker changes languages, which reduces manual participation in the simultaneous interpretation process; for translators, there is no need to perform translation work.
  • the language to be output was set through the translation terminal, which relieved the translator's work pressure and reduced the probability of language errors in the conference.
  • the simultaneous interpretation method relieves the pressure of the staff and improves the efficiency of simultaneous interpretation of the meeting.
  • the present application provides an apparatus for implementing simultaneous interpretation, the apparatus including each module for implementing the method for implementing simultaneous interpretation in the first aspect or any possible implementation manner of the first aspect.
  • the present application provides a system for implementing simultaneous interpretation, the system including a media server and an AI device.
  • the media server is used to receive a first audio stream and a second audio stream, the second audio stream is an audio stream translated according to the first audio stream, and is also used to send the second audio stream to the AI device;
  • AI The device is configured to receive the second audio stream and send the first language identification information to the media server;
  • the media server is further configured to determine the language of the second audio stream according to the first language identification information, and send the first language identification information to the first language identification information.
  • a terminal sends the second audio stream.
  • the media server is further configured to send the first audio stream to the AI device; the AI device is further configured to receive the first audio stream and send the second language to the media server identification information; the media server is further configured to determine the language of the second audio stream according to the second language identification information, and send the second audio stream to the first terminal.
  • the first language identification information includes a language identification result of the first audio stream or text corresponding to the first audio stream.
  • the present application provides a simultaneous interpretation device
  • the simultaneous interpretation device includes a processor, a memory, a communication interface, and a bus
  • the processor, the memory, and the communication interface are connected through a bus and complete mutual communication
  • the memory is used for storing computer-executable instructions
  • the processor executes the computer-implemented instructions in the memory to utilize hardware resources in the device to perform the first aspect or the first aspect.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer, the computer executes the method described in the first aspect.
  • the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method described in the first aspect above.
  • the present application may further combine to provide more implementation manners.
  • FIG. 1 is a system architecture diagram of manually setting output languages to realize simultaneous conference interpretation provided by the present application.
  • FIG. 2 is a schematic diagram of a method for manually setting an output language to realize simultaneous interpretation in a conference provided by the present application.
  • FIG. 3 is a system architecture diagram for realizing conference simultaneous interpretation provided by the present application.
  • FIG. 4 is an overall flow chart for realizing conference simultaneous interpretation provided by the present application.
  • FIG. 5 is a schematic diagram of a method for realizing simultaneous interpretation of a conference provided by the present application.
  • FIG. 6 is a flow chart of a certain method for realizing conference simultaneous interpretation provided by the present application.
  • FIG. 7 is a flow chart of another method for implementing simultaneous conference interpretation provided by the present application.
  • FIG. 8 is a schematic structural diagram of a conference simultaneous interpretation device provided by the present application.
  • FIG. 9 is a schematic structural diagram of another conference simultaneous interpretation device provided by the present application.
  • Multimedia Control Unit A media processing server based on a central architecture, which can decode, mix, and encode audio and video streams. It is used to access multiple terminals for multi-point audio and video communication.
  • SFU Selective Forwarding Unit
  • PCM Pulse Code Modulation
  • AAC-LD Advanced Audio streams under protocols such as Audio Coding-Low Delay, low-latency advanced audio coding
  • the interpreter will sit in a small room with good sound insulation (commonly known as "box"), use professional equipment, interpret what he hears from the headset into the target language synchronously, and output it through the microphone. Participants who need the simultaneous interpretation service can set the language they need through the device that receives the audio stream, and then obtain the translated information.
  • box good sound insulation
  • a simultaneous interpretation system mainly includes a translation terminal 11 , a conference room terminal 12 , a user terminal 13 , and a media server 14 .
  • the translation terminal 11 is a device for the translator to receive the audio stream of the speaker or a device to output the audio stream translated by himself.
  • the device may be a handheld mobile terminal, such as a mobile phone, or the device may be a personal computer.
  • the translation business is generally two-way, such as two-way translation between Chinese and English. During the meeting, the translator will change the language of his output at any time according to the language of the speaker.
  • the conference room terminal 12 is generally located in the conference room where the speaker is located, is connected to the speaker's microphone, and collects the original sound of the speaker. At the same time, the conference terminal 12 further includes a signal transceiving module for sending the compressed original audio stream to the media server, and also for receiving the translated audio stream. In addition, the conference room terminal 12 may also include a speaker unit, such as a speaker or a speaker, for broadcasting the speaker's original voice and the translated audio stream. Before the conference terminal sends the audio stream to the media server, it needs to first encode the analog signal collected by the microphone into a PCM code stream, and then encode the PCM code stream into an audio stream under protocols such as ACC-LD, and then send it to the media server.
  • ACC-LD protocols
  • the encoding and decoding of the audio stream in the conference system belongs to the prior art, and the present invention will not describe it in detail.
  • the various audio streams (original audio streams, translated audio streams) appearing in the following in this application mainly emphasize the language of the audio stream, and do not limit whether the audio stream is encoded or decoded.
  • the conference room terminal 12 has various forms, which may be a large-screen terminal or a computer connected to an audio system, and the specific form of the conference terminal is not limited in this application.
  • the user terminals 13 include terminals 1 , 2 and 3 , and the corresponding users are users 1 , 2 and 3 .
  • the languages of users 1, 2, and 3 may be the same or different. Users can choose to receive only one audio stream in one language, or they can choose to receive both the original audio stream and the audio stream in another language. Users 1, 2, 3 can be in the conference room or anywhere else.
  • the user terminal may be a handheld mobile terminal, or various devices that can output audio such as a personal computer.
  • the media server 14 is a media processing server, which may be an MCU or an SFU.
  • the media server can be deployed in the cloud or in the local computer room.
  • the media server is mainly used to forward the audio stream of the speaker and the audio stream output by the translator according to the language configured by each terminal (user terminal or conference room terminal).
  • the network includes wired and/or wireless transmission, wherein the wired transmission includes data transmission in the form of ether, optical fiber, and the like.
  • Wireless transmission methods include broadband cellular network transmission methods such as 3G (Third generation), 4G (Fourth generation), or 5G (Fifth generation).
  • Step 1 When Chinese user A, English user B, and English user C join the conference, set the languages of the audio streams they want to receive as Chinese, English, and English, respectively, so that the media server can identify the desired language of each user, and then The corresponding audio stream is forwarded to the corresponding terminal.
  • Step 2 Assuming that the language of the current speaker is Chinese, the conference administrator needs to manually set the language of the conference room to Chinese, so that the media server can identify the language of the speaker as Chinese. That is, the manual operation of the conference administrator can help the media server to identify the language of the original audio stream sent by the conference room terminal.
  • Step 3 The conference room terminal sends the original audio stream to the media server, and the media server forwards the original audio stream to the translator after receiving the original audio stream.
  • the conference room terminal may also directly send the original audio stream to the translation terminal.
  • Step 4 The translation capability parameter of the Chinese-English translator is Chinese-English bidirectional translation. After receiving the original audio stream, the translated audio stream is output through a microphone or a radio device such as a microphone. And before translation, the translator must manually set the language of his output to English on the translation terminal, so that the media server can identify the language of the translated audio stream.
  • Step 5 The media server sends the original audio stream and the translated audio stream to the corresponding user terminal according to the languages set by the users A, B, and C.
  • Step 6 When the speaker is replaced and the replaced speaker speaks in English, the conference administrator needs to manually set the language of the conference room to English again so that the media server can identify the language of the speaker. At the same time, after the translator receives the original English audio stream, before the official translation, he needs to change the output language to Chinese through the translation terminal again, so that the media server can identify the language of the translated audio stream. Then, the media server re-forwards the corresponding audio stream according to the settings of the user and the conference room.
  • AI Artificial Intelligence, artificial intelligence
  • the AI device 15 can be deployed in the cloud or in a local computer room.
  • AI equipment is generally a heterogeneous server, which can be combined in different ways according to the scope of the application, such as CPU+GPU, CPU+TPU, CPU+other accelerator cards, etc. .
  • AI equipment generally adopts the form of CPU+GPU.
  • GPU adopts a parallel computing mode and is good at combing intensive data operations, such as graphics rendering and machine learning.
  • the AI device 15 and the media server 14 are integrated and deployed in the cloud.
  • the media server 14 is deployed in a client room, and the AI device is deployed in the cloud.
  • the AI device 15 and the media server 14 are integrated together and deployed in the client room.
  • the functions of the conference room terminal include sending the original audio stream of the speaker and receiving the translated audio stream.
  • the conference room terminal may only broadcast the original audio without broadcasting the translated audio stream.
  • the media server does not need to forward the translated audio stream to the conference room terminal, but only needs to send the translated audio stream to the user terminal. That is, the flow direction of the translated audio stream mainly depends on the settings of the conference administrator and the settings of the user.
  • the original voice of the speaker can also be directly transmitted to the translator for translation without forwarding through the media server.
  • the embodiment of the present application still takes the example that the original voice of the speaker needs to be forwarded to the translator through the media server, that is, the media server performs overall control and forwarding.
  • a conference often includes participants corresponding to more than two languages.
  • the two languages of the broadcast are preset by the conference administrator in advance.
  • a standard United Nations international conference generally includes six languages: Arabic, Chinese, English, French, Russian and Spanish.
  • the conference site also includes leaders and journalists or related work of these six countries. personnel. If the speaker broadcasts the corresponding remaining 5 languages for every sentence, it will lead to confusion of the on-site language and affect the user experience. Therefore, generally only the two languages set by the conference administrator are broadcast on site.
  • the language set by the conference administrator is Chinese and English
  • the original audio stream of the Chinese speaker will be broadcast first.
  • the large-screen terminal in the conference room will also broadcast the translated English.
  • the live audiences from the four countries of Afghanistan, France, Russia, and Spain need to wear headphones to receive the audio in the corresponding language, that is, the live audiences from these four countries are equivalent to the user terminals 1, 2, and 3 in Figure 3. of users 1, 2, and 3. It should be noted that the above description of the conference site is only to increase the completeness of the solution, and does not constitute any limitation to the present invention.
  • Step S41 The media server receives a conference joining request sent by the user and the conference administrator through their respective terminals, and the request includes language setting information.
  • the user and the conference administrator generate language setting information through their respective terminals and send it to the media server when joining the conference.
  • the language setting information is used to indicate the language of the audio stream that the user or the conference room expects to receive.
  • the user can be set to receive only the audio stream of a certain language, or can be set to receive the original audio stream and the audio stream of another language.
  • the translated audio stream that is, only the original audio stream is broadcast in the conference room; or they can also set to receive audio streams in one or two languages.
  • the media server after receiving the conference joining request sent by the user or the conference administrator, the media server will allocate different UDP (User Datagram Protocol, User Datagram Protocol) ports to the user terminal and the conference terminal.
  • the media server only needs to monitor the corresponding UDP port to send and receive audio streams to and from each terminal (user terminal, conference room terminal).
  • each terminal (user terminal, conference room terminal) negotiates its own SSRC (synchronization source identifier) with the media server through the conference joining request, and then each terminal sends an audio stream.
  • SSRC synchronization source identifier
  • Step S42 The media server receives the conference joining request sent by the translator through the terminal.
  • the translator can enter the same conference ID as the user and the conference administrator in step S41 to join the same conference.
  • the translator's meeting joining request further includes a translation capability parameter, where the translation capability parameter is used to indicate the translator's business scope, such as Chinese-English bidirectional translation or English-French bidirectional translation.
  • the media server allocates a UDP port to the interpreter terminal after receiving the conference joining request sent by the interpreter through the interpreter terminal, and then transmits and receives audio streams to the interpreter terminal through the UDP port.
  • Step S43 The media server receives the original audio stream sent by the conference room terminal. After the conference officially starts, the speaker takes the stage to speak, and the conference room terminal forwards the original audio stream to the media server.
  • Step S44 The media server forwards the original audio stream to the AI device to identify the language of the original audio stream.
  • the media server identifies the language of the original audio stream according to the language identification information returned by the AI device.
  • the language identification information can be directly the result of the language identification of the AI device, or it can be the text corresponding to the original audio stream generated by the AI device.
  • Step S45 The media server forwards the original audio stream to the translation terminal.
  • the media server forwards the audio stream to all translation terminals. That is, regardless of the translator's business scope, the original audio stream is forwarded to the translation terminals used by all translators.
  • the media server forwards the original audio stream to the corresponding translation terminal according to the translation capability parameter of the translator.
  • the translation capability parameter of translator 1 is Chinese-English two-way translation
  • the translation capability parameter of translator 2 is English-French two-way translation.
  • step S44 must be executed before step S45, and at the same time, the media server must also obtain the translation capability parameters of the translator before the conference starts. There are various ways to obtain the translation capability parameters of the translators.
  • the join request sent by the translation terminal in step S42 may carry the translation capability parameters; or the media server may set the translation capability of the translator corresponding to the terminal in advance for each translation terminal. parameters, and then the translator selects the corresponding terminal to carry out translation work according to the settings of each terminal.
  • Step S46 The media server receives the translated audio stream sent by the translation terminal.
  • Step S47 The media server sends the translated audio stream to the AI device to identify the translated audio stream.
  • the identification method is the same as that of step 44, which is not repeated here.
  • Step S48 The media server forwards the original audio stream and the translated audio stream to each terminal (user terminal or conference room terminal or translation terminal) according to the language of the original audio stream and the language of the translated audio stream.
  • Steps S41-S48 describe a relatively complete method flow for implementing simultaneous interpretation. It should be noted that the sequence numbers of the above steps do not represent the order of execution. For example, in some cases, step S46 may be executed before step S44. In addition, according to the deployment form of AI devices and media servers, some steps can also be omitted directly. For example, when the AI device and the media server are deployed together, steps such as forwarding the audio stream to be recognized to the AI device and receiving the returned information from the AI device can be omitted, and the media server directly recognizes the language of the audio stream.
  • the language settings of all users include receiving the original audio stream, that is, the user expects to receive the original audio stream of the speaker regardless of the speaker speaking in any language.
  • the media server also does not need to stream the original audio to the AI device for identification, that is, it does not need to perform step S44.
  • the media server since the language of the original audio stream is not identified, the media server also needs to adopt a strategy of forwarding the original audio stream to the translators (step S45), that is, each translator receives the original audio stream.
  • FIG. 5 and FIG. 6 The method for implementing simultaneous interpretation provided by the embodiments of the present application is specifically described below by taking FIG. 5 and FIG. 6 as examples.
  • the entire conference involves three languages: Chinese, English, and Russian.
  • a Chinese user, a British user, and a Russian user send a conference joining request to the media server through their mobile terminals, they set the languages of the audio streams they expect to receive as Chinese, English, and Russian, respectively.
  • the request for joining the conference sent by the conference administrator through the conference room terminal also sets the language of the translated audio stream received by the conference room to be Chinese or English.
  • the translators at the venue include Chinese-English translators and English-Russian translators. Users corresponding to user terminals, conference room terminals, and translation terminals, conference administrators, and translators can join the same conference by entering the same conference ID (Identification). The following will be introduced directly from the stage where the speaker starts to speak.
  • Step S51 Assuming that the English speaker first speaks on the stage, the conference room terminal collects the original audio stream of the speaker through a microphone and other radio equipment and sends it to the media server.
  • Step S52 The media server sends the original audio stream to the AI device to identify the language of the original video stream.
  • the AI device can directly return the language recognition result, or return the text corresponding to the original audio stream, so that the media server can determine the language of the original audio stream according to the text. It should be noted that if the AI device and the media server are deployed on the same server cluster, this step can be omitted, that is, the media server can directly identify the language of the original audio stream.
  • Step S53 The media server sends the original audio stream of the speaker to the translator.
  • the translator receives the audio stream through a device such as a mobile terminal or a computer.
  • the media server can selectively forward the original audio stream to translators with different translation capability parameters, or it can choose to forward the original audio stream to all translators, which mainly depends on the conference settings or the settings of the translators .
  • the media server implements an all-to-all forwarding strategy when sending the original audio stream to the translation terminal of the translator, that is, forwards the original audio stream to all the translators (as shown in Figure 5, Chinese-English, English-Russian translator).
  • Step S54 The translator translates according to the original audio, and the translation terminal sends the translated audio stream to the media server.
  • the translator does not need to pay attention to the language output by himself, but only needs to translate what he hears into another language according to his professional instinct.
  • the audio stream translated by the Chinese-English translator is audio stream A
  • the audio stream translated by the English-Russian translator is audio stream B.
  • Step S55 The media server sends the translated audio streams (audio streams A, B) sent by the translator through the translation terminal to the AI device to identify the language of the translated audio stream. Similar to step S52, the media server may receive the language recognition result of the AI device scope or the text corresponding to the audio stream to determine the language of the audio stream. In addition, if the AI device and the media server are deployed on the same server cluster, the actions of sending the audio stream and receiving the language recognition result can be omitted directly.
  • Step S56 The media server sends the translated audio stream according to the language set in the conference room.
  • a conference room can broadcast at most two types of languages.
  • the conference administrator has set the language of the received translated audio stream to be English or Chinese, and English is the main language for broadcasting.
  • English is the main language for broadcasting.
  • a Chinese speaker speaks the speaker's original voice and English translation will be played; if a Russian speaker speaks, the speaker's original voice and English translation will be played; if an English speaker speaks , the speaker's original voice and Chinese translation will be played at the venue.
  • step S52 the media server has determined that the language of the speaker of the current conference room is English according to the result returned by the AI device, and in step S55, it has determined that the audio stream A is Chinese and the audio stream B is Russian. According to the rules set by the conference administrator (the languages of the audio stream expected to be received include Chinese and English), the audio stream A output by the Chinese-English translator of the media server is sent to the conference room terminal.
  • Step S57 The media server forwards the corresponding audio stream according to the language setting of the user.
  • the media server forwards the original audio stream to UK users, audio stream A to Chinese users, and audio stream B to Russian users.
  • the present application does not specifically limit the sequence of steps S52-S55.
  • After receiving the original audio stream it can be sent to the AI device for identification, and then the original audio stream is sent to the translation terminal; it can also be forwarded to the translation terminal after receiving the original audio stream, and then the original audio stream and translation
  • the resulting audio stream is sent to the AI device for recognition.
  • the original audio stream of the speaker needs to be continuously transmitted to the AI device for recognition, so that the change of the speaker's language can be quickly recognized, so as to achieve accurate forwarding.
  • the AI device can send the language recognition result to the media server when the recognized language changes.
  • the media server may intermittently send the original audio stream to save network transmission resources or to the media server, and the size of the interval may be set according to experience.
  • the language involved in the conference site is relatively small.
  • the conference involves many languages, and due to the cost of the conference, the number of translators may be insufficient, so that it is impossible to achieve that each language speaker has corresponding translators in all other languages.
  • the language of the speaker is Russian
  • the languages supported by the conference include Chinese (that is, there may be Chinese audiences), and there is no Russian-Chinese interpreter on site, then a translation relay is required. That is, a translator is required to translate Russian into English first, and then another translator to translate English into Chinese.
  • the strategy of the media server forwarding the original audio stream to the translator is still the same as above, which can be forwarded to all staff or only to the translators involved in the relevant language.
  • the media server also needs to forward the English stream output by the Russian-English translator to the English-Chinese translator to obtain the Chinese audio stream.
  • the media server can determine the optimal relay strategy according to the language of the speaker, the translation capability of the interpreter, and the language of the audio stream that the conference ultimately needs.
  • the language of the audio stream ultimately required by the conference may be uniformly set by the conference administrator, or may be determined according to the desired audio stream language reported by the user.
  • the media server needs to implement audio stream forwarding between translation terminals according to the calculated relay strategy.
  • Step S71 The speaker in Russian speaks, and the conference room terminal sends the original audio stream to the media server.
  • Step S72 The media server forwards the original audio stream to the AI device to identify the language of the original audio stream as Russian.
  • Step S73 The media server sends the original audio stream to the translation terminal. It is assumed that in the embodiment of the present application, the all-staff sending strategy is still implemented, that is, the translation terminals used by all translators (Chinese-English translators, English-Russian translators) will receive the original audio stream.
  • Step S74 After receiving the original audio stream, the English-Russian translator directly translates the heard Russian into English, and sends the translated audio stream 1 to the media server without setting the language for output on the terminal.
  • Step S75 The media server sends the audio stream output by the English-Russian translator to the AI device to identify the type of the translated audio stream as English.
  • Step S76 The media server determines that a translation relay is required, and calculates a translation relay strategy.
  • the media server determines that the current conference needs audio streams in Chinese, English and Russian according to the settings of the conference administrator or the access users. According to step S72, it can be determined that the original audio stream is in Russian, and according to step S75, it can be determined that the translated audio stream output by a translator is English, so that it can be determined that there is no Chinese audio stream at this time.
  • the translation capability parameter provided by the translator it is determined that there is a Chinese-English translator on the field, so the English translation audio stream can be forwarded to the translator to obtain the Chinese translation audio stream.
  • Step S77 The media server sends the translated audio stream 1 to the Chinese-English translator.
  • the media server has determined that the audio stream 1 is an English audio stream, and then forwards the translated audio stream 1 to the Chinese-English translator according to the relay strategy calculated in step S76.
  • Step S78 The media server receives the translated audio stream 2 sent by the Chinese-English translator.
  • the Chinese-English translator can directly translate the English into Chinese according to the professional instinct, and there is no need to manually set the language of the output audio stream on the terminal.
  • Step S79 The media server sends the received translated audio stream 2 to the AI device to identify the language of the audio stream as Chinese.
  • Step S710 The media server forwards the corresponding audio stream according to the settings of the conference administrator. Assuming that the conference administrator sets the languages of the translated audio streams received by the conference terminal as Chinese and English, the media server forwards both the translated audio streams 1 and 2 to the conference terminal.
  • Step S711 The media server forwards the corresponding audio stream according to the setting of the user. According to the user's settings, the media server forwards the original audio stream to the Russian user, the translated audio stream 1 to the English user, and the translated audio stream 2 to the Chinese user.
  • step S79 since this is a relay translation scenario, the media server has determined the translation capability of the translator, so after forwarding the English audio stream to the Chinese-English translator, it should receive the Chinese audio stream. At this time, the media server also There is no need to send audio stream 2 to the AI service to determine the language of audio stream 2. However, for the sake of accuracy, in order to ensure that the conference translation is foolproof, it is generally necessary to send the obtained audio stream to the AI device to identify the language.
  • the media server in order to reduce crosstalk, the media server also needs to buffer the audio stream before sending the audio stream to the user terminal or the conference room terminal when implementing this solution.
  • the audio stream transmission takes time as a dimension to form a transmission unit of the audio stream, assuming that an audio packet is formed every 100ms. That is, the conference terminal or the translation terminal sends an audio stream packet to the media server every 100ms. Each time the media server receives an audio package, it sends the audio package to the AI device to identify the language of the audio package.
  • the media server can only receive the language recognition result of the first audio package after receiving three audio packages, and then Forward the first audio packet to the corresponding user terminal or conference room terminal. If the media server does not cache, the media server only finds that the language of the first audio packet has changed when it receives the third audio packet, then the first audio packet and the second audio packet will be sent to the user or the conference by mistake. room, resulting in crosstalk, affecting user experience.
  • the media server sends the audio stream to be identified to the AI device, and the AI device divides the received audio stream according to preset rules, and then feeds back the divided audio streams to the media server.
  • Language identification information Exemplarily, the preset rules of the AI device include identifying the language type according to the segmented sentences of the speaker. That is to say, the AI device must first identify the fragmentation in the audio stream, then divide the received audio stream by each fragmentation, and then return the language identification information of each sentence to the media server.
  • the embodiment of the present application does not specifically limit the unit, size, duration, etc. of the buffered and identified audio stream, and it depends on the situation.
  • the above-mentioned method for realizing simultaneous interpretation reduces manual participation and improves translation efficiency.
  • the method does not require special conference managers to set the language of the conference room (the language of the current speaker), which reduces the occupation of manpower and the probability of errors.
  • Translators also do not need to set their own output language every time they switch languages, which reduces the pressure on translators.
  • the language of the speaker and the language output by the translator is uniformly identified by the AI equipment, which improves the accuracy of language switching and reduces the impact of human factors on simultaneous interpretation.
  • the translation work of translators can also be replaced by AI equipment, that is, the AI equipment can realize the simultaneous interpretation work of the whole meeting.
  • FIG. 8 provides a device 80 for implementing simultaneous interpretation provided by an embodiment of the present application, and the device 80 may be implemented as part or all of the device through software, hardware, or a combination of the two.
  • the apparatus provided in the embodiment of the present application can implement the processes described in FIGS. 4-7 in the embodiment of the present application.
  • the apparatus 80 includes: a receiving module 81 and a sending module 82, wherein:
  • the receiving module 81 is configured to receive a first audio stream and a second audio stream, and the second audio stream is an audio stream translated according to the first audio stream;
  • the sending module 82 is configured to send the second audio stream to the artificial intelligence AI device to identify the language of the second audio stream; and is also configured to send the second audio to the first terminal according to the language of the second audio stream stream, wherein the language of the second audio stream is the language of the audio stream that the first terminal expects to receive.
  • the sending module 82 is further configured to send the first audio stream to the AI device to identify the language of the first audio stream; and is also configured to send the second audio stream to the second audio stream according to the language of the first audio stream.
  • the terminal sends the first audio stream, where the language of the first audio stream is the language of the audio stream expected to be received by the second terminal.
  • the apparatus 80 for implementing simultaneous interpretation further includes a processing module 83, which is configured to determine the language of the second audio stream according to the language recognition result of the second audio stream returned by the AI device.
  • the receiving module 81 is further configured to receive the text corresponding to the second audio stream returned by the AI device; the processing module 83 is further configured to determine the language of the to-be-second audio stream according to the text.
  • the sending module 82 is further configured to send the first audio stream to the translation terminals used by all translators; the receiving module 81 is further configured to receive the second audio stream, and the second audio stream is the One of the audio streams returned by the translation terminal used by the translator.
  • the language of the first audio stream is the first language
  • the language of the second audio stream is the second language
  • the sending module 82 is further configured to recognize the language of the first audio stream according to the AI device.
  • the first translation capability parameter to send the first audio stream to the third terminal, where the first translation capability parameter is used to indicate that the translation capability of the first translator using the third terminal includes translating the first language into the second language;
  • the receiving module 81 is further configured to receive the second audio stream sent by the third terminal.
  • the receiving module 81 is further configured to receive the first translation capability parameter sent by the third terminal.
  • the language of the first audio stream is the first language
  • the language of the second audio stream is the second language
  • the processing module 83 is further configured to identify the language of the first audio stream according to the AI device
  • the result, the second translation capability parameter and the third translation capability parameter determine the fourth terminal and the fifth terminal, and the second translation capability parameter is used to indicate that the translation capability of the second translator using the fourth terminal includes converting the The first language is translated into a third language, and the third translation capability parameter is used to indicate that the translation capability of a third translator using the fifth terminal includes translating the third language into the second language; sending module 82, further configured to send the first audio stream to the fourth terminal; the receiving module 81, further configured to receive a third audio stream sent by the fourth terminal, where the language of the third audio stream is the third The sending module 82 is further configured to send the third audio stream to the fifth terminal; the receiving module 81 is further configured to receive the second audio stream sent by the fifth terminal.
  • the device 80 for implementing simultaneous interpretation further includes a storage module 84, and the storage module 84 is configured to store the second audio stream; the sending module 82 is further configured to send the media server from the The second audio stream stored before the determined time starts to send the second audio stream to the first terminal, and the determined time is for the media service to determine that the language of the second audio stream is the language that the first terminal expects to receive language moment.
  • the receiving module 81 is further configured to receive first language setting information sent by the first terminal, where the first language setting information is used to indicate the language of the audio stream that the first terminal expects to receive; Receive second language setting information sent by the second terminal, where the second language setting information is used to indicate the language of the audio stream that the second terminal expects to receive.
  • FIG. 9 is a device 90 for implementing simultaneous interpretation provided by an embodiment of the present application.
  • the device 90 includes a processor 91 , a memory 92 , and a communication interface 93 .
  • the processor 91, the memory 92, and the communication interface 93 realize the communication connection by means such as wired or wireless transmission.
  • the memory 92 is used to store instructions, and the processor 91 is used to execute the instructions.
  • the memory 92 stores program codes, and the processor 91 can call the program codes stored in the memory 92 to perform the following operations:
  • the second audio stream is an audio stream translated according to the first audio stream; send the second audio stream to the AI device to identify the second audio stream language; sending the second audio stream to the first terminal according to the language of the second audio stream, where the language of the second audio stream is the language of the audio stream that the first terminal expects to receive.
  • the processor 91 may be a CPU, or other general-purpose processors that can execute stored program codes.
  • the memory 92 which may include read-only memory and random access memory, provides instructions and data to the processor 91 .
  • Memory 92 may also include non-volatile random access memory.
  • memory 92 may also store device type information.
  • the memory 92 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM).
  • RAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • bus 94 may also include a power bus, a control bus, a status signal bus, and the like. However, for the sake of clarity, the various buses are designated as bus 94 in the figure.
  • the present application also provides a system for realizing simultaneous interpretation.
  • the system includes a device 80 for realizing simultaneous interpretation and an AI device.
  • the apparatus 80 for implementing simultaneous interpretation and the AI device are deployed in the same server.
  • Each device in the system for implementing simultaneous interpretation executes the methods shown in each of FIGS. 4-7 , which are not repeated here for brevity.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请公开了一种实现同声传译的方法,涉及会议通信领域。媒体服务器接收会议发言人的音频流以及根据该音频流翻译后的音频流,然后将该翻译后的音频流发送给AI设备以识别该翻译后的音频流的语种,最后再根据识别的结果将翻译后的音频流转发给相应的终端。该种同声传译的方法减少了人工的参与,提高了同声传译的效率。

Description

一种实现同声传译的方法、装置及系统 技术领域
本申请涉及会议通信领域,尤其涉及一种实现同声传译的方法、装置及系统。
背景技术
随着全球化进程的加快,国际会议数量明显增加,各个国家的语言丰富多样,因此也产生了在会议中进行同声传译的需求。同声传译,是指翻译员在不打断演讲者讲话的情况下,不间断地将内容口译给听众的一种翻译方式。一场大型的研讨会和国际会议,通常由两名到三名翻译员接替进行。
在实际的应用场景中,一场会议的发言人是不断切换的,来自不同国家的发言人会使用不同类型的语种。进而,翻译员也需要跟随发言人的切换调整翻译输出的语种。例如,一个翻译员的工作内容为中英双向翻译,当发言人更换,且发言的语种由汉语切换成英语时,翻译员也将由中翻英切换成英翻中。同时,翻译员在更换自己的输出语种时需要在相关设备上手动设置自己输出语种由英语变成汉语,以便于机器将该翻译后的音频流发送给属于同一语种的听众。此外,如果会议现场还需要配有专门的会议管理人员设置当前发言人的语种,以便于媒体服务器识别当前发言人的语种并将翻译后的音频流返回到会议现场的大屏终端。
然而,这样的操作方式及其容易出错。对于高强度的翻译员而言,在切换自己输出语种的同时需要在相关设备上设置自己新的输出语种,容易发生遗漏,导致最终效果异常。对于会议管理人员,需要集中精力关注发言语种的切换、辨别发言人的语种,如果切换不及时或者切换错误也会导致错乱。总体而言,这样的方式操作难度较大,用户体验不佳。
发明内容
本申请提供了一种实现同声传译的方法、装置及系统,减少人工在同声传译中的参与度,提高了会议同声传译的效率。
第一方面,本申请提供一种实现同声传译的方法。媒体服务器接收第一音频流以及根据第一音频流翻译而成的第二音频流;然后向AI设备发送第二音频流以识别第二音频流的语种;再根据第二音频流语种向第一终端发送该第二音频流,其中,第二音频流的语种为第一终端期望接收的音频流的语种。媒体服务器利用AI设备识别翻译后的音频流(第二音频流)的语种,翻译员无需再通过翻译终端手动设置自己翻译后的语种,减轻了翻译人员的压力,降低了会议语言系统的出错率,提高了同声传译的效率。
在一种可能的实现方式中,媒体服务器向AI设备发送第一音频流以识别该第一音频流的语种,然后根据该第一音频流的语种向第二终端发送该第一音频流,其中,第一音频流的语种为第二终端期望接收的音频流的语种。媒体服务利用AI设备识别发言人的原声音频流(第一音频流)的语种,无需会议管理人员再通过会议室终端手动设置发言人的语种,在整个的同声传译过程中减少了人工的参与,提高了同声传译的效率。
在另一种可能的实现方式中,媒体服务器根据AI设备返回的第二音频流的语种识别结果确定所述第二音频流的语种。在该种实现方式中,AI设备直接返回语种识别结果,媒体服务器无需再对结果进行任何处理,然后根据该语种识别结果将第二音频流转发给第一终端。
在另一种可能的实现方式中,媒体服务器接收AI设备返回的与第二音频流对应的文本,然后根据该文本确定第二音频流的语种。AI设备将音频流转换成了文本发给媒体服务器,媒体服务器根据文本确定第二音频流的语种类型。在该实现方式下,媒体服务器接收了AI设备返回的文本后,还可以根据各个终端的设置将该文本转发到对应的终端,以实现实时字幕。
在另一种可能的实现方式中,媒体服务器向所有翻译员使用的翻译终端发送第一音频流,然后接收第二音频流,该第二音频流是所有翻译终端返回的音频流中的一个。在该实现方式下,媒体服务器向翻译员发送发言人原声音频流时采用全员发送策略,无需考虑翻译员的翻译能力,减少对媒体服务器计算资源的占用,降低了同声传译的出错率。
在另一种可能的实现方式中,第一音频流的语种为第一语种,第二音频流的语种为第二语种,媒体服务器根据AI设备对第一音频流的语种识别结果以及第一翻译能力参数向第三终端发送第一音频流,其中,第一翻译能力参数用于指示使用第三终端的第一翻译员的翻译能力包括将第一语种翻译成第二语种;然后媒体服务器接收第三终端发送的第二音频流。在该实现方式下,媒体服务器在向翻译员转发发言人原声音频流时考虑了翻译员的翻译能力,即只向涉及到该第一音频流语种相关业务的翻译员转发原声音频流,减少了冗余信息的传递,减少了对网络传输资源的占用。
在另一种可能的实现方式中,媒体服务器接收第三终端发送的第一翻译能力参数。第一翻译员通过第三终端向媒体服务器反馈自身的翻译能力参数,例如中英双向翻译、英法双向翻译等等。
在另一种可能的实现方式中,媒体服务器在会议开始前指定第三终端对应的翻译能力参数,翻译员根据自身的翻译能力选择第三终端接收发言人的原声音频流以及发送翻译之后的音频流。
在另一种可能的实现方式中,第一音频流的语种为第一语种,第二音频流的语种为第二语种,媒体服务器根据AI设备对第一音频流的语种识别结果、第二翻译能力参数和第三翻译能力参数确定第四终端和第五终端,所述第二翻译能力参数用于指示使用所述第四终端的第二翻译员的翻译能力包括将所述第一语种翻译成第三语种,所述第三翻译能力参数用于指示使用所述第五终端的第三翻译员的翻译能力包括将所述第三语种翻译成所述第二语种;媒体服务器向所述第四终端发送所述第一音频流;媒体服务器接收所述第四终端发送的第三音频流,所述第三音频流的语种为第三语种;媒体服务器向第五终端发送所述第三音频流;媒体服务器接收所述第五终端发送的所述第二音频流。媒体服务器根据第一音频流的语种识别结果以及翻译员的翻译能力参数信息确定翻译接力策略,以确保会议翻译服务的正常运行。
在另一种可能的实现方式中,媒体服务器在向第一终端发送第二音频流之前,媒体服务器还存储第二音频流,在确定时刻之后,媒体服务器从确定时刻前存储的所述第二音频流开始向第一终端发送第二音频流,所述确定时刻为所述媒体服务确定所述第二音频流的语种为所述第一终端期望接收的语种的时刻。在向第一终端发送第二音频流之前进行对第二音频流进行缓存,当确认语种信息之后再进行转发,减少了会场串音的概率,提升用户体验。
在另一种可能的实现方式中,媒体服务器接收所述第一终端发送的第一语种设置信息,所述第一语种设置信息用于指示所述第一终端期望接收的音频流的语种;媒体服务器接收所述第二终端发送的第二语种设置信息,所述第二语种设置信息用于指示所述第二终端期望接收的音频流的语种。媒体服务器根据各个终端的语种设置信息确定各个终端期望接收的音频流的语种。
在另一种可能的实现方式中,AI设备和媒体服务器部署在同一个服务器中。当AI设备 和媒体服务器部署在同一个服务器中时,AI设备和媒体服务器之间的通信延迟降低,降低了网络对同声传译服务的影响。
通过上述描述,本申请提供的同声传译方法通过AI设备对各个音频流进行语种识别以实现高效率的会议同传服务。对于会议管理人员来说,无需在发言人更换语种时通过会议终端手动更改当前发言人的语种,减少了人工在同声传译过程中的参与度;对于翻译员来说,也无需在执行翻译工作之前通过翻译终端设置自己即将输出的语种,缓解了翻译员的工作压力,减少了会议语言出错的概率。总之,该同声传译方法缓解了工作人员的压力,提高了会议同声传译的效率。
第二方面,本申请提供一种实现同声传译的装置,所述装置包括用于执行第一方面或第一方面任一种可能实现方式中的实现同声传译方法的各个模块。
第三方面,本申请提供一种实现同声传译的系统,所述系统包括媒体服务器和AI设备。媒体服务器用于接收第一音频流和第二音频流,所述第二音频流为根据所述第一音频流翻译后的音频流,还用于向AI设备发送所述第二音频流;AI设备用于接收所述第二音频流,并向所述媒体服务器发送第一语种识别信息;媒体服务器还用于根据所述第一语种识别信息确定所述第二音频流的语种,并向第一终端发送所述第二音频流。
在另一种可能的设计中,媒体服务器还用于向所述AI设备发送所述第一音频流;AI设备还用于接收所述第一音频流,并向所述媒体服务器发送第二语种识别信息;媒体服务器还用于根据所述第二语种识别信息确定所述第二音频流的语种,并向第一终端发送所述第二音频流。
在另一种可能的设计中,第一语种识别信息包括第一音频流的语种识别结果或者第一音频流对应的文本。
上述第三方面任一种可能的设计所能达到的技术效果可参照上述第一方面所能达到的技术效果,这里不再重复赘述。
第四方面,本申请提供一种同声传译设备,所述同声传译设备包括处理器、存储器、通信接口、总线,所述处理器、存储器和通信接口之间通过总线连接并完成相互间的通信,所述存储器中用于存储计算机执行指令,所述同声传译设备运行时,所述处理器执行所述存储器中的计算机执行指令以利用所述设备中的硬件资源执行第一方面或第一方面任一种可能实现方式中所述方法中媒体服务器所执行的操作步骤。
第五方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
第六方面,本申请提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请提供的手工设置输出语种以实现会议同声传译的系统架构图。
图2为本申请提供的手工设置输出语种以实现会议同声传译的方法示意图。
图3为本申请提供的实现会议同声传译的系统架构图。
图4为本申请提供的实现会议同声传译的整体流程图。
图5为本申请提供的实现会议同声传译的某一方法示意图。
图6为本申请提供的实现会议同声传译的某一方法流程图。
图7为本申请提供的实现会议同声传译的另一方法流程图。
图8为本申请提供的会议同声传译设备的结构示意图。
图9为本申请提供的另一会议同声传译设备的结构示意图。
具体实施方式
为了增强本申请的可读性,在介绍本申请提供的实施例之前,首先对一些名词术语进行解释:
多媒体控制单元(Multimedia Control Unit,MCU):一种基于中心式架构的媒体处理服务器,可对音视频码流进行解码、混流、编码等操作。用于接入多个终端以进行多点的音视频通信。
选择转发单元(Selective Forwarding Unit,SFU):一种基于中心式架构的媒体处理服务器,对音视频码流只做转发,不做解码、混流、编码等操作。用于接入多个终端以进行多点的音视频通信。
脉冲编码调制(Pulse Code Modulation,PCM):数字通信的编码方式之一。主要过程是将话音、图像等模拟信号每隔一定时间进行取样,使其离散化,同时将抽样值按分层单位四舍五入取整量化,同时将抽样值按一组二进制码来表示抽样脉冲的幅值。在一个会议系统中,一般先由麦克风等收音设备采集发言人的声音形成模拟信号,再由会议终端将该模拟信号编码为PCM数字码流,然后再将PCM码流编码成AAC-LD(Advanced Audio Coding-Low Delay,低延迟高级音频编码)等协议下的音频流发送给媒体服务器。也就是说,一般情况下,从会议室向外发送的原声音频流是对模拟信号进行两次编码后的音频流。
随着全球化进程的不断推进,各国之间的协作关系也是空前密切。在这样的大环境下,国际会议的数量也不断增加。由于各国语言的多样性,也引申出了在会议中对同声传译服务的需求。同声传译作为一种翻译方式,其最大特点在于效率高,原文与译文翻译的平均间隔时间是三至四秒因此可以保证讲话者作连贯发言,而不会影响或中断讲话者的思路,有利于听众对发言全文的通篇理解。
在会议进行的时候,翻译员会坐在隔音较好的狭小房间(俗称“箱子”)内,使用专业的设备,将其从耳机中听到的内容同步口译为目标语言,并通过话筒输出。需要同声传译服务的与会者,可以通过接收音频流的装置,设置自己需要的语种,然后获得翻译后的信息。
如图1所示,一个同声传译系统主要包括翻译终端11、会议室终端12、用户终端13、媒体服务器14。
翻译终端11为翻译员接收发言人音频流的设备或者为输出自己翻译的音频流的设备,该设备可以是手持移动终端,例如手机,或者该设备可以是个人电脑。此外,作为一个国际会议的翻译员,其翻译业务一般是双向的,例如为中英两个语种的双向翻译。在会议中该翻译员会根据发言人的语种变化而随时变更自己输出的语种。
会议室终端12一般处于发言人所在的会议室中,与发言人的话筒相连,采集发言人的原声。同时,会议终端12还包括信号收发模块,用于向媒体服务器发送压缩后的原声音频流,还用于接收翻译后的音频流。此外,会议室终端12还可以包括扬声器单元,例如喇叭或者音响,用于广播发言人原声以及翻译后的音频流。会议终端在向媒体服务器发送音频流之前需 要首先将话筒采集到的模拟信号编码成PCM码流,再将PCM码流编码成ACC-LD等协议下的音频流,然后发送给媒体服务器。会议系统中对音频流编解码属于现有技术,本发明对此不多做赘述。为了便于描述,本申请下文中出现的各种音频流(原声音频流、翻译后的音频流)主要强调的是音频流的语种,并不限定该音频流是否编解码。会议室终端12有多种形态,可以是大屏终端,也可以是一台与音响相连的电脑,本申请对会议终端的具体形态不做限定。
用户终端13包括终端1、2、3,对应的使用者为用户1、2、3。用户1、2、3的语种可以相同也可以不同。用户可以选择只接收一种语种的音频流,也可以选择接收原声音频流和另外一个语种的音频流。用户1、2、3可以在会议室中,也可以在其他任何地方。用户终端可以是手持移动终端,也可以是个人电脑等各种可以输出音频的设备。
媒体服务器14是一种媒体处理服务器,可以是MCU或者SFU。媒体服务器可以部署在云端也可以部署在本地机房。媒体服务器在本发明实施例中主要用于根据各终端(用户终端或者会议室终端)配置的语种转发发言人的音频流以及翻译员输出的音频流。
各个终端与媒体服务器14之间可以通过网络进行通信。其中,网络包括以有线和/或无线传输的方式,其中,有线的传输方式包括利用以太、光纤等形式进行数据传输。无线传输方式包括3G(Third generation)、4G(Fourth generation)、或5G(Fifth generation)等宽带蜂窝网络传输方式。
如图2所示,假设该场会议支持中、英两种语言。通过手工设置输出的语种以实现同声传译的流程包括如下步骤:
步骤1:汉语用户A、英语用户B、英语用户C在加入会议时通过终端设置自己想接收的音频流的语种分别为汉语、英语、英语,以便于媒体服务器识别各个用户期望的语种,然后将对应的音频流转发给相应的终端。
步骤2:假设当前发言人的语种是汉语,会议管理员需要手动将会议室的语种设置为汉语,以便于媒体服务器识别发言人的语种为汉语。即会议管理员的手动操作可以帮助媒体服务器识别会议室终端发送的原声音频流的语种。
步骤3:会议室终端向媒体服务器发送原声音频流,媒体服务器接收到该原声音频流之后转发给翻译员。在另一种实现方式中,会议室终端还可以将原声音频流直接发送给翻译终端。
步骤4:中英翻译员的翻译能力参数为中英双向翻译,在接收到原声音频流之后通过话筒或者麦克风等收音设备输出自己翻译后的音频流。且在翻译之前,该翻译员必须要在翻译终端上手动设置自己输出的语种为英语,以便于媒体服务器识别翻译后的音频流的语种。
步骤5:媒体服务器根据用户A、B、C设置的语种将原声音频流以及翻译后的音频流发送给相应的用户终端。
步骤6:当更换发言人,且更换后的发言人使用英语发言时,会议管理员需要手动再将会议室的语种设置为英语以便于媒体服务器识别发言人的语种。同时,在翻译员接收到英语原声音频流以后,在正式翻译之前,还需要重新通过翻译终端将自己的输出语种更改设置为汉语,以便于媒体服务器识别翻译后音频流的语种。然后,媒体服务器再根据用户以及会议室的设置重新转发相对应的音频流。
上述同声传译的方法,效率较低且容易出错。对于会议管理员而言,需要一直紧跟会议的进程,在发言人更换语种的时候及时切换会议室的输出语种,如果管理员走神或者切换错误将会导致媒体服务器无法识别会议室输出的音频流的语种,进而导致会议语言错乱。对于翻译员而言,在接收到发言人的原声音频流之后必须首先设置自己的输出语种。然而翻译员的工作强度本身就很高,如果还需要在翻译前设置自己的输出语种,很容易忘记更改设置或 者切换错误,这依旧会使得会议的语言发生错乱,影响会议人员的体验,甚至导致会议中断。
为了解决上述问题,本申请提供了一种实现同声传译的方法。其对应的系统架构如图3所示,在图1架构的基础上增加了一个AI(Artificial Intelligence,人工智能)服务器15。AI设备15可以部署在云端,也可以部署在本地机房。从服务器的硬件架构来看,AI设备一般是采用异构形式的服务器,在异构方式上可以根据应用的范围采用不同的组合方式,如CPU+GPU、CPU+TPU、CPU+其他的加速卡等。目前,AI设备普遍采用CPU+GPU的形式,GPU与CPU不同,采用的是并行计算的模式,擅长梳理密集型的数据运算,如图形渲染、机器学习等。为了保证通信效率,在一种可能的实现方式中,AI设备15和媒体服务器14集成在一起,部署在云端。在另一种可能的实现方式中,媒体服务器14部署客户机房,AI设备部署在云端。在另一种可能的实现方式中,AI设备15和媒体服务器14集成在一起,部署在客户机房。
在图3所示的系统架构中会议室终端的功能包括发送发言人的原声音频流以及接收翻译后的音频流。在另一种可能的实现方式中,会议室终端可能只广播原声音频,不广播翻译后的音频流。在这样的情况下,媒体服务器无需将翻译后的音频流转发给会议室终端,只需将翻译后的音频流发送给用户终端即可。即翻译后的音频流的流向主要取决于会议管理员的设置以及用户的设置。另外,在一些情况下,发言人的原声也可以直接传输给翻译员进行翻译而无需经过媒体服务器转发。但在接下来的描述中,本申请实施例仍以需要通过媒体服务器转发发言人的原声给翻译员为例,即,由媒体服务器进行整体的调控转发。
在介绍本申请实施例具体的实施方法之前,先对同声传译的会议场景进行一些补充描述。在实际的国际会议场景中,一个会议往往包括对应于两个以上语种的参会人。然而,出于现场的体验考虑,一般会议现场只会广播两种语言。该广播的两种语言由会议管理员提前预置。示例性的,一个标准的联合国国际会议一般包括阿拉伯语、汉语、英语、法语、俄语和西班牙语这6种语言,对应的,会议现场也包括这6个国家的领导人以及记者或者相关的工作人员。如果发言人每说一句话都将对应的其余5种语言广播出来的话会导致现场语言杂乱,影响用户体验。所以,在现场一般只会广播会议管理员设定的两种语言。假设会议管理员设定的语言是中英语,当中国发言人发言时,现场会先广播中国发言人的原声音频流,同时,会场的大屏终端也会广播翻译后的英语。而来自阿、法、俄、西这四个国家的现场观众就需要佩戴耳机接收对应语种的音频,即来自这四个国家的现场观众就相当于图3中的用户终端1、2、3对应的用户1、2、3。需要说明的是,上述对会议现场的描述只是为了增加方案的完整性,不对本发明构成任何限定。
下面将结合图4介绍本申请实施例提供的整体方法流程。
步骤S41:媒体服务器接收用户、会议管理员通过各自终端发送的会议加入请求,该请求中包括语种设置信息。用户、会议管理员在加入会议时通过各自的终端生成语种设置信息并发送给媒体服务器。语种设置信息用于表示用户或者会议室期望接收的音频流的语种。
对于用户来说,其可以设置只接收某一语种的音频流,或者可以设置接收原声音频流以及另一种语种的音频流。对于会议管理员来说,其可以设置不接收翻译后的音频流,即会议室内只广播原声音频流;或者还可以设置接收某一种或者两种语种的音频流。
在一种实现方式中,媒体服务器在接收到用户或者会议管理员发送的会议加入请求之后,会为用户终端以及会议终端分配不同的UDP(User Datagram Protocol,用户数据报协议)端口。媒体服务器只需监听对应的UDP端口,即可对各个终端(用户终端、会议室终端)进行音频流的收发。在另一种实现方式中,各个终端(用户终端、会议室终端)通过该会议加 入请求分别和媒体服务器协商自己的SSRC(synchronization source identifier,同步信源标识符),之后各个终端在发送音频流报文时在报文中携带预先协商的SSRC以便于媒体服务器区分来自各个终端的音频流。
步骤S42:媒体服务器接收翻译员通过终端发送的会议加入请求。例如,翻译员可以和步骤S41中的用户以及会议管理员输入同一个会议ID以加入同一个会议。在一种可能的实现方式中,翻译员的会议加入请求中还包括翻译能力参数,该翻译能力参数用于指示翻译员的业务范围,例如是中英双向翻译或者英法双向翻译等等。同步骤S42,媒体服务器在接收到翻译员通过翻译终端发送的会议加入请求之后为翻译员终端分配UDP端口,之后会通过该UDP端口对翻译员终端进行音频流的收发。
步骤S43:媒体服务器接收会议室终端发送的原声音频流。会议正式开始后,发言人上台发言,会议室终端将原声音频流转发给媒体服务器。
步骤S44:媒体服务器将原声音频流转发给AI设备以识别原声音频流的语种。媒体服务器根据AI设备返回的语种识别信息识别原声音频流的语种。语种识别信息可以直接是AI设备语种识别的结果,也可以是AI设备生成的与原声音频流对应的文本。
步骤S45:媒体服务器将原声音频流转发给翻译终端。
在一种实现方式中,媒体服务器将音频流转发给所有的翻译终端。也就是说,无论翻译员的业务范围是什么,都将原声音频流转发给所有翻译员使用的翻译终端。
在另一种可能的实现方式中,媒体服务器根据翻译员的翻译能力参数转发原声音频流给相应的翻译终端。示例性的,翻译员1的翻译能力参数为中英双向翻译,翻译员2的翻译能力参数为英法双向翻译,当根据步骤S44识别出原声音频流为汉语时,则只将原声音频流转发给翻译员1,不转发给翻译员2。在该实现方式下,步骤S44必须在步骤S45之前执行,同时,媒体服务器也必须在会议开始前获取翻译员的翻译能力参数。有多种方式可以获取翻译员的翻译能力参数,例如,步骤S42中翻译终端发送的加入请求中可以携带翻译能力参数;又或者媒体服务器给各个翻译终端提前设置该终端对应的翻译员的翻译能力参数,然后翻译员根据各个终端的设置选择对应的终端开展翻译工作。
步骤S46:媒体服务器接收翻译终端发送的翻译后的音频流。
步骤S47:媒体服务器将翻译后的音频流发送给AI设备以识别翻译后的音频流。识别的方法同步骤44,在此不多做赘述。
步骤S48:媒体服务器根据原声音频流的语种、翻译后的音频流的语种将原声音频流以及翻译后的音频流转发给各个终端(用户终端或会议室终端或翻译终端)。
步骤S41-S48描述了一个较为完整的实现同声传译的方法流程,需要说明的是,上述步骤的序号并不代表执行的先后顺序,例如,在一些情况下步骤S46可以在步骤S44之前执行。此外,根据AI设备以及媒体服务器的部署形态,部分步骤也可以直接省略。例如,当AI设备和媒体服务器部署在一起时,向AI设备转发待识别音频流以及从AI设备接收返回的信息这类步骤可以省略,由媒体服务器直接识别音频流的语种。
在一种可能的实现方式中,所有用户的语种设置都包括接收原声音频流,也就是说不管是任何语种的发言人发言,用户都期望接收发言人的原声音频流。在这样的情况下,媒体服务器也无需将原声音频流送至AI设备进行识别,即无需执行步骤S44。同时,由于没有识别原声音频流的语种,媒体服务器在转发原声音频流给翻译员的时候(步骤S45)也需要采用全员转发的策略,即每个翻译员都接收原声音频流。
在上述实现同声传译的方法中,对于会议管理人员而言,无需再跟随发言人的变化而手 动切换会议室输出的语种,减少了会场语言出错的概率;对于翻译员而言,也无需在每次切换翻译方向时在终端上更改自己输出的语种,减轻了翻译人员的压力。整个会议过程都由媒体服务器进行调控转发,减少了人工参与,提高了会场同声传译的效率。
下面以图5、图6为例具体介绍本申请实施例提供的实现同声传译的方法。为了方便描述,假设整个会议涉及汉语、英语、俄语三种语言。假设中国用户、英国用户、俄罗斯用户通过自己的移动终端向媒体服务器发送会议加入请求时,分别设置了自己期望接收的音频流语种为汉语、英语、俄语。同时,会议管理员通过会议室终端发送的加入会议的请求中也设定会议室接收的翻译后的音频流的语种是汉语或者英语。在本申请实施例中,会场的翻译员有中英翻译员和英俄翻译员。用户终端、会议室终端、翻译终端对应的用户、会议管理员、翻译员可以通过输入同一个会议ID(Identification)以加入同一个会议。下面将直接从发言人开始发言这一阶段进行介绍。
步骤S51:假设英语发言人首先上台发言,会议室终端通过话筒等收音设备采集发言人的原声音频流并发送给媒体服务器。
步骤S52:媒体服务器将原声音频流发送到AI设备以识别原始视频流的语种。AI设备可以直接返回语种识别的结果,也可以返回原声音频流对应的文本以使媒体服务器根据文本判断原声音频流的语种。需要说明的是,如果AI设备与媒体服务器部署在同一服务器集群上,则该步骤可以直接省去,即媒体服务器可以直接识别原声音频流的语种。
步骤S53:媒体服务器将发言人的原声音频流发送给翻译员。翻译员通过移动终端或者电脑等设备接收该音频流。在该步骤中,媒体服务器可以将原声音频流选择性地转发给具有不同翻译能力参数的翻译员,也可以选择将原声音频流转发给所有翻译员,这主要取决于会议设置或者翻译员的设置。当进行有选择地转发时,需要提前收集翻译员的翻译能力参数。在本申请实施例中,媒体服务器向翻译员的翻译终端发送原声音频流时实行全员转发策略,即,将原声音频流转发送给所有的翻译员(如图5所示的中英、英俄翻译员)。
步骤S54:翻译员根据原声音频进行翻译,翻译终端将翻译后的音频码流发送给媒体服务器。在本申请提供的实施例中,翻译员无需再关注自己输出的语种,只需要根据职业本能将听到的内容翻译成另一个语言即可。假设中英翻译员翻译后的音频流为音频流A,英俄翻译员翻译后的音频流为音频流B。
步骤S55:媒体服务器将翻译员通过翻译终端发送的翻译后的音频流(音频流A、B)发送给AI设备以识别翻译后音频流的语种。同步骤S52,媒体服务器可以接收AI设备范围的语种识别结果或者音频流对应的文本以确定音频流的语种。另外,如果AI设备与媒体服务器部署在同一服务器集群上则发送音频流与接收语种识别结果的动作可以直接省去。
步骤S56:媒体服务器根据会议室设定的语种发送翻译后的音频流。出于用户体验的考虑,一个会议室至多广播两种类型的语言。如前所述,会议管理员已经设定接收的翻译后的音频流的语种为英语或汉语,广播时以英语为主。在这种设定规则下,如果是汉语发言人发言则会场会播放发言人原声以及跟英语翻译;如果是俄语发言人发言,则会场会播放发言人原声以及英语翻译;如果是英语发言人发言,则会场会播放发言人原声以及汉语翻译。媒体服务器在步骤S52中根据AI设备返回结果已经确定了当前会议室的发言人的语种为英语,在步骤S55中已经确定音频流A为汉语,音频流B为俄语。根据会议管理员设定的规则(期望接收的音频流的语种包括汉语、英语),媒体服务器中英翻译员输出的音频流A发送给会议室终端。
步骤S57:媒体服务器根据用户的语种设置转发对应的音频流。媒体服务器将原声音频流转发给英国用户,将音频流A转发给中国用户,将音频流B转发给俄罗斯用户。
需要说明的是,本申请对于步骤S52-S55的先后顺序不做具体限定。可以是接收到原声音频流之后就发给AI设备进行识别,然后再将原声音频流发给翻译终端;也可以是接收到原声音频流之后就转发给翻译终端,然后再将原声音频流和翻译后的音频流发送给AI设备进行识别。
需要补充说明的是,语种识别的频率也可以采用不同的策略。在一种实现方式中,需要将发言人的原声音频流不间断地传送给AI设备进行识别,以便于能快速识别发言人语种的变化,进而实现准确地转发。而AI设备可以在识别出的语种发生变化时,再向媒体服务器发送语种识别结果。在另一种实现方式中,媒体服务器可以间断发送原声音频流以节省网络传输资源或者向媒体服务器,间隔的大小可以根据经验设定。
上述实施例中,会场涉及的语言较少。然而,在实际的情况中,会议涉及的语种较多,且出于会议成本的考虑,翻译员的数量可能不足,进而无法实现针对每个语种的发言人都有对应的其他所有语种的翻译员。示例性的,假设发言人的语种是俄语,而会议支持的语种包括汉语(即可能存在汉语听众),而现场并不存在俄中翻译员,此时就需要进行翻译接力。也就是需要一个翻译员先将俄语翻译成英语,再由另一个翻译员将英语翻译成汉语。在这样的情况下,媒体服务器向翻译员转发原声音频流的策略仍然同上面一样,可以是全员转发也可以是只转发给涉及相关语种的翻译员。但是,不同的是,在翻译接力时,媒体服务器还需再将俄英翻译员输出的英语流转发给英中翻译员以获取汉语音频流。出于翻译效果的考虑,一般情况下只会接力一次。媒体服务器可以根据发言人的语种,翻译员的翻译能力,以及会议最终需要的音频流的语种来确定最优的接力策略。其中,会议最终需要的音频流的语种可以由会议管理员统一设置,也可以根据用户上报的期望获取的音频流语种来确定。在翻译接力的情况下,媒体服务器需要根据计算出来的接力策略实现翻译终端之间的音频流转发。
当发言人切换时,本申请实施例的优势将体现得更加明显。假设更换发言人,发言的语种由英语切换成为俄语。基于场上的翻译员的翻译能力,可以确定需要进行翻译接力。在需要翻译接力的情况下,媒体服务器需要提前获取场上各个翻译员的翻译能力参数,以便于实行各翻译终端之间的音频流转发。结合图7,切换后的同声传译流程如下:
步骤S71:俄语发言人发言,会议室终端向媒体服务器发送原声音频流。
步骤S72:媒体服务器向AI设备转发原声音频流以识别该原声音频流的语种为俄语。
步骤S73:媒体服务器向翻译终端发送原声音频流。假设在本申请实施例中,依旧实行全员发送策略,即所有的翻译员(中英翻译员、英俄翻译员)使用的翻译终端都会接收到原声音频流。
步骤S74:英俄翻译员接收到原声音频流之后,直接将听到的俄语翻译成英语,并将翻译后的音频流1发送给媒体服务器而无需在终端上设置自己输出的语种。
步骤S75:媒体服务器向AI设备发送英俄翻译员输出的音频流,以识别该翻译后的音频流类型为英语。
步骤S76:媒体服务器确定需要翻译接力,计算翻译接力策略。媒体服务器根据会议管理员的设定或者接入的用户情况确定当前会议需要中、英、俄三种语种的音频流。根据步骤S72可以确定原声音频流为俄语,根据步骤S75可以确定有一个翻译员输出的翻译后的音频流为英语,从而可以确定此时缺少汉语音频流。根据翻译员提供的翻译能力参数确定场上存在一个中英翻译员,因此可以将英语翻译音频流转发给该翻译员以获取汉语翻译音频流。
步骤S77:媒体服务器向中英翻译员发送翻译后的音频流1。根据步骤S75媒体服务器已经确定音频流流1为英语音频流,再根据步骤S76计算的接力策略,将翻译后的音频流1转 发给中英翻译员。
步骤S78:媒体服务器接收中英翻译员发送的翻译后的音频流2。在该步骤中,中英翻译员接收到英语音频流之后,直接根据职业本能将英语翻译成汉语即可,无需再在终端上手动设置自己输出的音频流的语种。
步骤S79:媒体服务器将接收到的翻译后的音频流2发送给AI设备以识别该音频流的语种为汉语。
步骤S710:媒体服务器根据会议管理员的设置转发相应的音频流。假设会议管理员设定会议终端接收的翻译后的音频流的语种为汉语和英语,则媒体服务器将翻译后的音频流1和2都转发给会议终端。
步骤S711:媒体服务器根据用户的设置转发相应的音频流。根据用户的设置,媒体服务器将原声音频流转发给俄语用户,将翻译后的音频流1转发给英语用户,将翻译后的音频流2转发给汉语用户。
需要说明的是,上述步骤序号并不一定代表执行的先后顺序。而且,在一些情况下,有些步骤可以省略。比如步骤S79,由于这是一个接力翻译的场景,媒体服务器已经确定了翻译员的翻译能力,那么向中英翻译员转发英语音频流后,收到的应该是汉语音频流,此时媒体服务器也无需再向AI服务发送音频流2以确定音频流2的语种。但是出于准确率的考虑,为了确保会议翻译万无一失,一般情况下还是需要将获得的音频流都发送到AI设备以识别语种。
除了上面描述的方法流程以外,在具体的实现本方案时,为了减少串音情况,媒体服务器还需要在向用户终端或者会议室终端发送音频流之前,对音频流进行缓存。在一种实现方式中,音频流传输以时间为维度形成音频流的传输单位,假设每100ms形成一个音频包。即会议终端或者翻译终端每100ms向媒体服务器发送一次音频流报文。媒体服务器每接收到一个音频包就将该音频包发送给AI设备以识别该音频包的语种。假设AI设备识别一个音频包的语种需要300ms,忽略媒体服务器与AI设备之间的传输时延,则媒体服务器在收到三个音频包之后才能收到第一个音频包的语种识别结果,进而将第一个音频包转发给相应的用户终端或者会议室终端。如果媒体服务器不缓存,则媒体服务器在接收到第三个音频包的时候才发现第一个音频包已经改变了语种,那么第一个音频包和第二个音频包会错发给用户或者会议室,形成串音的情况,影响用户体验。在另外一种实现方式中,媒体服务器将待识别音频流发送给AI设备,由AI设备根据预设的规则将接收到的音频流进行划分,进而再向媒体服务器反馈划分后各段音频流的语种识别信息。示例性的,AI设备的预设规则包括根据发言人的断句识别语种类型。也就是说,AI设备要首先识别音频流中的断句情况,然后以每一个断句为单位划分接收到的音频流,进而向媒体服务器返回每一句话的语种识别信息。总之,本申请实施例对于缓存以及识别的音频流的单位、大小、时长等不做具体限定,视情况而定。
上述实现同声传译的方法减少了人工的参与度,提高了翻译效率。该方法无需配备专门的会议管理人员设置会议室的语种(当前发言人的语种),减少了对人力的占用以及出错的几率。翻译员也无需每次在切换语种时设置自己输出的语种,减轻了翻译员的压力。由AI设备统一对发言人语种以及翻译员输出的语种进行识别,提高了语言切换的准确度,减少人为因素对同声传译的影响。
随着时代的进步,翻译员的翻译工作还可以由AI设备代替,即由AI设备实现会议全程的同声传译工作。
图8为本申请实施例提供的一种实现同声传译的装置80,该装置80可以通过软件、硬件 或者两者的结合实现成为装置中的部分或者全部。本申请实施例提供的装置可以实现本申请实施例图4-7所述的流程,装置80包括:接收模块81、发送模块82,其中,
接收模块81用于接收第一音频流和第二音频流,所述第二音频流为根据所述第一音频流翻译后的音频流;
发送模块82用于向人工智能AI设备发送所述第二音频流以识别所述第二音频流的语种;还用于根据所述第二音频流的语种向第一终端发送所述第二音频流,其中,所述第二音频流的语种为所述第一终端期望接收的音频流的语种。
可选的,发送模块82还用于向所述AI设备发送所述第一音频流以识别所述第一音频流的语种;还用于根据所述第一音频流的语种向所述第二终端发送所述第一音频流,其中,所述第一音频流的语种为所述第二终端期望接收的音频流的语种。
可选的,实现同声传译的装置80还包括处理模块83,该处理模块83用于根据所述AI设备返回的对第二音频流的语种识别结果确定所述第二音频流的语种。
可选的,接收模块81还用于接收AI设备返回的与所述第二音频流对应的文本;处理模块83还用于根据所述文本确定所述待第二音频流的语种。
可选的,发送模块82还用于向所有翻译员使用的翻译终端发送所述第一音频流;接收模块81还用于接收所述第二音频流,所述第二音频流为所述所有翻译员使用的翻译终端返回的音频流中的一个。
可选的,所述第一音频流的语种为第一语种,所述第二音频流的语种为第二语种,发送模块82还用于根据所述AI设备对第一音频流的语种识别结果和第一翻译能力参数向第三终端发送所述第一音频流,所述第一翻译能力参数用于指示使用所述第三终端的第一翻译员的翻译能力包括将所述第一语种翻译成所述第二语种;接收模块81还用于接收所述第三终端发送的所述第二音频流。
可选的,接收模块81还用于接收所述第三终端发送的所述第一翻译能力参数。
可选的,所述第一音频流的语种为第一语种,所述第二音频流的语种为第二语种,处理模块83,还用于根据所述AI设备对第一音频流的语种识别结果、第二翻译能力参数和第三翻译能力参数确定第四终端和第五终端,所述第二翻译能力参数用于指示使用所述第四终端的第二翻译员的翻译能力包括将所述第一语种翻译成第三语种,所述第三翻译能力参数用于指示使用所述第五终端的第三翻译员的翻译能力包括将所述第三语种翻译成所述第二语种;发送模块82,还用于向所述第四终端发送所述第一音频流;接收模块81,还用于接收所述第四终端发送的第三音频流,所述第三音频流的语种为第三语种;发送模块82,还用于向第五终端发送所述第三音频流;接收模块81,还用于接收所述第五终端发送的所述第二音频流。
可选的,实现同声传译的装置80还包括存储模块84,所述存储模块84用于存储所述第二音频流;发送模块82还用于在确定时刻之后,所述媒体服务器从所述确定时刻前存储的所述第二音频流开始向第一终端发送所述第二音频流,所述确定时刻为所述媒体服务确定所述第二音频流的语种为所述第一终端期望接收的语种的时刻。
可选的,接收模块81还用于接收所述第一终端发送的第一语种设置信息,所述第一语种设置信息用于指示所述第一终端期望接收的音频流的语种;还用于接收所述第二终端发送的第二语种设置信息,所述第二语种设置信息用于指示所述第二终端期望接收的音频流的语种。
图9为本申请实施例提供的一种实现同声传译的设备90,如图所示,所述设备90包括处理器91、存储器92、通信接口93。其中,处理器91、存储器92、通信接口93通过有线或者无线传输等手段实现通信连接。该存储器92用于存储指令,该处理器91用于执行该指令。 该存储器92存储程序代码,且处理器91可以调用存储器92中存储的程序代码执行以下操作:
接收第一音频流和第二音频流,所述第二音频流为根据所述第一音频流翻译后的音频流;向AI设备发送所述第二音频流以识别所述第二音频流的语种;根据所述第二音频流的语种向第一终端发送所述第二音频流,其中,所述第二音频流的语种为所述第一终端期望接收的音频流的语种。
应理解,在本申请实施例中,该处理器91可以是CPU,或者其他可执行存储的程序代码的通用处理器。
该存储器92可以包括只读存储器和随机存取存储器,并向处理器91提供指令和数据。存储器92还可以包括非易失性随机存取存储器。例如,存储器92还可以存储设备类型的信息。该存储器92可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM)。通过示例性但不是限制性说明,许多形式的RAM可用,例如动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
该总线94除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线94。
作为一种可能的实施例,本申请还提供一种实现同声传译的系统。该系统包括实现同声传译的装置80以及AI设备。在一种可能的实现方式中,实现同声传译的装置80和AI设备部署在同一个服务器中。该实现同声传译的系统中的各个装置执行如各图4-7中所示的方法,为了简洁,在此不再赘述。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
以上所述,仅为本申请的具体实施方式。熟悉本技术领域的技术人员根据本申请提供的具体实施方式,可想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (27)

  1. 一种实现同声传译的方法,其特征在于,所述方法包括:
    媒体服务器接收第一音频流和第二音频流,所述第二音频流为根据所述第一音频流翻译后的音频流;
    所述媒体服务器向人工智能AI设备发送所述第二音频流以识别所述第二音频流的语种;
    所述媒体服务器根据所述第二音频流的语种向第一终端发送所述第二音频流,其中,所述第二音频流的语种为所述第一终端期望接收的音频流的语种。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述媒体服务器向所述AI设备发送所述第一音频流以识别所述第一音频流的语种;
    所述媒体服务器根据所述第一音频流的语种向所述第二终端发送所述第一音频流,其中,所述第一音频流的语种为所述第二终端期望接收的音频流的语种。
  3. 根据权利要求1或2所述的方法,其特征在于,所述媒体服务器向所述AI设备发送所述第二音频流以识别所述第二音频流的语种包括:
    所述媒体服务器根据所述AI设备返回的对第二音频流的语种识别结果确定所述第二音频流的语种。
  4. 根据权利要求1或2所述的方法,其特征在于,所述媒体服务器向所述AI设备发送所述第二音频流以识别所述第二音频流的语种包括:
    所述媒体服务器接收所述AI设备返回的与所述第二音频流对应的文本;
    所述媒体服务器根据所述文本确定所述第二音频流的语种。
  5. 根据权利要求2-4任一所述的方法,其特征在于,所述媒体服务器接收所述第二音频流包括:
    所述媒体服务器向所有翻译员使用的翻译终端发送所述第一音频流;
    所述媒体服务器接收第二音频流,所述第二音频流来自所述所有翻译员使用的翻译终端中的一个。
  6. 根据权利要求2-5任一所述的方法,其特征在于,所述第一音频流的语种为第一语种,所述第二音频流的语种为第二语种,所述媒体服务器接收所述第二音频流包括:
    所述媒体服务器根据所述AI设备对第一音频流的语种识别结果和第一翻译能力参数向第三终端发送所述第一音频流,所述第一翻译能力参数用于指示使用所述第三终端的第一翻译员的翻译能力包括将所述第一语种翻译成所述第二语种;
    所述媒体服务器接收所述第三终端发送的所述第二音频流。
  7. 根据权利要求6所述的方法,其特征在于,在所述媒体服务器向第三终端发送所述第一音频流之前,所述方法还包括:
    所述媒体服务器接收所述第三终端发送的所述第一翻译能力参数。
  8. 根据权利要求2-5任一项所述的方法,其特征在于,所述第一音频流的语种为第一语种, 所述第二音频流的语种为第二语种,所述媒体服务器接收所述第二音频流包括:
    所述媒体服务器根据所述AI设备对第一音频流的语种识别结果、第二翻译能力参数和第三翻译能力参数确定第四终端和第五终端,所述第二翻译能力参数用于指示使用所述第四终端的第二翻译员的翻译能力包括将所述第一语种翻译成第三语种,所述第三翻译能力参数用于指示使用所述第五终端的第三翻译员的翻译能力包括将所述第三语种翻译成所述第二语种;
    所述媒体服务器向所述第四终端发送所述第一音频流;
    所述媒体服务器接收所述第四终端发送的第三音频流,所述第三音频流为根据所述第一音频流翻译后的音频流,所述第三音频流的语种为所述第三语种;
    所述媒体服务器向第五终端发送所述第三音频流;
    所述媒体服务器接收所述第五终端发送的所述第二音频流。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述媒体服务器向第一终端发送所述第二音频流之前,所述方法还包括:
    所述媒体服务器存储所述第二音频流;
    在确定时刻之后,所述媒体服务器从所述确定时刻前存储的所述第二音频流开始向第一终端发送所述第二音频流,所述确定时刻为所述媒体服务确定所述第二音频流的语种为所述第一终端期望接收的语种的时刻。
  10. 根据权利要求1-9任一所述的方法,其特征在于,所述方法还包括:
    所述媒体服务器接收所述第一终端发送的第一语种设置信息,所述第一语种设置信息用于指示所述第一终端期望接收的音频流的语种;
    所述媒体服务器接收所述第二终端发送的第二语种设置信息,所述第二语种设置信息用于指示所述第二终端期望接收的音频流的语种。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,所述AI设备和所述媒体服务器部署在同一服务器中。
  12. 一种实现同声传译的装置,其特征在于,所述装置包括接收模块和发送模块,
    所述接收模块,用于接收第一音频流和第二音频流,所述第二音频流为根据所述第一音频流翻译后的音频流;
    所述发送模块,用于向人工智能AI设备发送所述第二音频流以识别所述第二音频流的语种;还用于根据所述第二音频流的语种向第一终端发送所述第二音频流,其中,所述第二音频流的语种为所述第一终端期望接收的音频流的语种。
  13. 根据权利要求12所述的装置,其特征在于,
    所述发送模块,还用于向所述AI设备发送所述第一音频流以识别所述第一音频流的语种;还用于根据所述第一音频流的语种向所述第二终端发送所述第一音频流,其中,所述第一音频流的语种为所述第二终端期望接收的音频流的语种。
  14. 根据权利要求12或13所述的装置,其特征在于,所述装置还包括处理模块,
    所述处理模块用于根据所述AI设备返回的对第二音频流的语种识别结果确定所述第二音频流的语种。
  15. 根据权利要求12或13所述的装置,其特征在于,所述装置还包括处理模块,
    所述接收模块还用于接收所述AI设备返回的与所述第二音频流对应的文本;
    所述处理模块还用于根据所述文本确定所述待第二音频流的语种。
  16. 根据权利要求13-15任一项所述的装置,其特征在于,
    所述发送模块,还用于向所有翻译员使用的翻译终端发送所述第一音频流;
    所述接收模块,还用于接收所述第二音频流,所述第二音频流来自所述所有翻译员使用的翻译终端中的一个。
  17. 根据权利要求13-16任一项所述的装置,其特征在于,所述第一音频流的语种为第一语种,所述第二音频流的语种为第二语种,
    所述发送模块,还用于根据所述AI设备对第一音频流的语种识别结果和第一翻译能力参数向第三终端发送所述第一音频流,所述第一翻译能力参数用于指示使用所述第三终端的第一翻译员的翻译能力包括将所述第一语种翻译成所述第二语种;
    所述接收模块,还用于接收所述第三终端发送的所述第二音频流。
  18. 根据权利要求17所述的装置,其特征在于,所述接收模块还用于:接收所述第三终端发送的所述第一翻译能力参数。
  19. 根据权利要求13-16任一项所述的装置,其特征在于,所述第一音频流的语种为第一语种,所述第二音频流的语种为第二语种,
    所述处理模块,还用于根据所述AI设备对第一音频流的语种识别结果、第二翻译能力参数和第三翻译能力参数确定第四终端和第五终端,所述第二翻译能力参数用于指示使用所述第四终端的第二翻译员的翻译能力包括将所述第一语种翻译成第三语种,所述第三翻译能力参数用于指示使用所述第五终端的第三翻译员的翻译能力包括将所述第三语种翻译成所述第二语种;
    所述发送模块,还用于向所述第四终端发送所述第一音频流;
    所述接收模块,还用于接收所述第四终端发送的第三音频流,所述第三音频流为根据所述第一音频流翻译后的音频流,所述第三音频流的语种为第三语种;
    所述发送模块,还用于向第五终端发送所述第三音频流;
    所述接收模块,还用于接收所述第五终端发送的所述第二音频流。
  20. 根据权利要求12-19任一项所述的装置,其特征在于,所述装置还包括存储模块,
    所述存储模块,用于存储所述第二音频流;
    所述发送模块,还用于在确定时刻之后,所述媒体服务器从所述确定时刻前存储的所述第二音频流开始向第一终端发送所述第二音频流,所述确定时刻为所述媒体服务确定所述第二音频流的语种为所述第一终端期望接收的语种的时刻。
  21. 根据权利要求12-20任一项所述的装置,其特征在于,
    所述接收模块,还用于接收所述第一终端发送的第一语种设置信息,所述第一语种设置信息用于指示所述第一终端期望接收的音频流的语种;还用于接收所述第二终端发送的第二语种设置信息,所述第二语种设置信息用于指示所述第二终端期望接收的音频流的语种。
  22. 根据权利要求12-21任一项所述的装置,其特征在于,所述装置和所述AI设备部署在同一服务器中。
  23. 一种实现同声传译的系统,其特征在于,所述系统包括媒体服务器和AI设备,
    所述媒体服务器用于接收第一音频流和第二音频流,所述第二音频流为根据所述第一音频流翻译后的音频流,还用于向人工智能AI设备发送所述第二音频流;
    所述AI设备用于接收所述第二音频流,并向所述媒体服务器发送第一语种识别信息;
    所述媒体服务器还用于根据所述第一语种识别信息确定所述第二音频流的语种,并向第一终端发送所述第二音频流。
  24. 根据权利要求23所述的系统,其特征在于:
    所述媒体服务器还用于向所述AI设备发送所述第一音频流;
    所述AI设备还用于接收所述第一音频流,并向所述媒体服务器发送第二语种识别信息;
    所述媒体服务器还用于根据所述第二语种识别信息确定所述第一音频流的语种,并向第二终端发送所述第一音频流。
  25. 根据权利要求23或24所述的系统,其特征在于,所述第一语种识别信息包括第二音频流的语种识别结果或者第二音频流对应的文本。
  26. 一种同声传译设备,其特征在于,所述同声传译设备包括处理器和存储器,所述存储器存储有计算机指令,所述处理器执行所述存储器中的计算机指令以执行权利要求1-11中任一项所述的方法。
  27. 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得所述计算机执行如权利要求1-11任一项所述的方法。
PCT/CN2021/138353 2020-12-15 2021-12-15 一种实现同声传译的方法、装置及系统 WO2022127826A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21905752.8A EP4246962A4 (en) 2020-12-15 2021-12-15 METHOD, APPARATUS AND SYSTEM FOR SIMULTANEOUS INTERPRETATION
US18/333,877 US20230326448A1 (en) 2020-12-15 2023-06-13 Method, Apparatus, and System for Implementing Simultaneous Interpretation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011481105.5A CN114638237A (zh) 2020-12-15 2020-12-15 一种实现同声传译的方法、装置及系统
CN202011481105.5 2020-12-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/333,877 Continuation US20230326448A1 (en) 2020-12-15 2023-06-13 Method, Apparatus, and System for Implementing Simultaneous Interpretation

Publications (1)

Publication Number Publication Date
WO2022127826A1 true WO2022127826A1 (zh) 2022-06-23

Family

ID=81945476

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/138353 WO2022127826A1 (zh) 2020-12-15 2021-12-15 一种实现同声传译的方法、装置及系统

Country Status (4)

Country Link
US (1) US20230326448A1 (zh)
EP (1) EP4246962A4 (zh)
CN (1) CN114638237A (zh)
WO (1) WO2022127826A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057365B (zh) * 2023-08-11 2024-04-05 深圳市台电实业有限公司 混合会议翻译方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246172A1 (en) * 2010-03-30 2011-10-06 Polycom, Inc. Method and System for Adding Translation in a Videoconference
CN108650484A (zh) * 2018-06-29 2018-10-12 中译语通科技股份有限公司 一种基于音视频通讯的远程同声传译的方法及装置
CN208675397U (zh) * 2018-06-29 2019-03-29 中译语通科技股份有限公司 一种基于音视频通讯的远程同声传译的装置
CN109688367A (zh) * 2018-12-31 2019-04-26 深圳爱为移动科技有限公司 多终端多语言实时视频群聊的方法和系统
US20190220520A1 (en) * 2018-01-16 2019-07-18 Chih Hung Kao Simultaneous interpretation system, server system, simultaneous interpretation device, simultaneous interpretation method, and computer-readable recording medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BE1022611A9 (nl) * 2014-10-19 2016-10-06 Televic Conference Nv Toestel voor audio input/output
US20160170970A1 (en) * 2014-12-12 2016-06-16 Microsoft Technology Licensing, Llc Translation Control
EP3896687A4 (en) * 2018-12-11 2022-01-26 NEC Corporation TREATMENT SYSTEM, TREATMENT PROCEDURE AND PROGRAM

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246172A1 (en) * 2010-03-30 2011-10-06 Polycom, Inc. Method and System for Adding Translation in a Videoconference
US20190220520A1 (en) * 2018-01-16 2019-07-18 Chih Hung Kao Simultaneous interpretation system, server system, simultaneous interpretation device, simultaneous interpretation method, and computer-readable recording medium
CN108650484A (zh) * 2018-06-29 2018-10-12 中译语通科技股份有限公司 一种基于音视频通讯的远程同声传译的方法及装置
CN208675397U (zh) * 2018-06-29 2019-03-29 中译语通科技股份有限公司 一种基于音视频通讯的远程同声传译的装置
CN109688367A (zh) * 2018-12-31 2019-04-26 深圳爱为移动科技有限公司 多终端多语言实时视频群聊的方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4246962A4 *

Also Published As

Publication number Publication date
EP4246962A4 (en) 2024-05-01
CN114638237A (zh) 2022-06-17
US20230326448A1 (en) 2023-10-12
EP4246962A1 (en) 2023-09-20

Similar Documents

Publication Publication Date Title
CN108076306B (zh) 会议实现方法、装置、设备和系统、计算机可读存储介质
JP5534813B2 (ja) 多言語会議を実現するシステム、方法、及び多地点制御装置
US10176366B1 (en) Video relay service, communication system, and related methods for performing artificial intelligence sign language translation services in a video relay service environment
AU2011200857B2 (en) Method and system for adding translation in a videoconference
TWI516080B (zh) 使用n路選擇語言處理之即時voip通訊方法及系統
US8531994B2 (en) Audio processing method, system, and control server
US8868430B2 (en) Methods, devices, and computer program products for providing real-time language translation capabilities between communication terminals
CN102226944B (zh) 混音方法及设备
US20160170970A1 (en) Translation Control
CN109618120B (zh) 视频会议的处理方法和装置
CN109640028B (zh) 一种将多个视联网终端和多个互联网终端进行组会的方法和装置
US20180203850A1 (en) Method for Multilingual Translation in Network Voice Communications
CN110475094B (zh) 视频会议处理方法、装置及可读存储介质
CN109005190B (zh) 一种在网页上实现全双工语音对话和页面控制的方法
TWI720600B (zh) 基於遠程會議的線上翻譯方法、系統、設備及電腦可讀取記錄媒體
WO2022127826A1 (zh) 一种实现同声传译的方法、装置及系统
CN113114688B (zh) 多媒体会议管理方法及装置、存储介质、电子设备
WO2015131750A1 (zh) 一种基于Web RTC多方通话建立的方法、设备和系统
CN110049275B (zh) 视频会议中的信息处理方法、装置及存储介质
US20230096543A1 (en) Systems and methods for providing real-time automated language translations
US20220139417A1 (en) Performing artificial intelligence sign language translation services in a video relay service environment
CN110505432B (zh) 一种视频会议操作结果的展示方法和装置
CN111951821A (zh) 通话方法和装置
CN110475089B (zh) 一种多媒体数据的处理方法和视联网终端
CN112738446A (zh) 基于线上会议的同声传译方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21905752

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021905752

Country of ref document: EP

Effective date: 20230613

NENP Non-entry into the national phase

Ref country code: DE