CN108650484A

CN108650484A - A kind of method and device of the remote synchronous translation based on audio/video communication

Info

Publication number: CN108650484A
Application number: CN201810694423.6A
Authority: CN
Inventors: 王语; 程国艮
Original assignee: Chinese Translation Language Through Polytron Technologies Inc
Current assignee: Chinese Translation Language Through Polytron Technologies Inc
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2018-10-12

Abstract

The present invention provides a kind of remote synchronous translation method and apparatus based on the communication of audio and video high in the clouds, spokesman's audio stream and video flowing are acquired by meeting-place microphone and camera and are stored in the meeting-place ends PC, the audio stream of storage and video flowing are uploaded to audio and video Cloud Server by one-way communication and handled by the meeting-place ends PC, interpreter end interpreter selects input language direct broadcasting room and output language direct broadcasting room from audio and video Cloud Server, the first language being stored in audio and video Cloud Server is translated into second language, and translation accuracy can be improved according to video flowing, last meeting-place audience selects to obtain the language for needing to listen to from audio and video Cloud Server by audience's listening equipment.Terminal of the methods and apparatus of the present invention using audio and video cloud server as different language meets long-range simultaneous interpretation translation demand, while live audience can obtain required object language in time.

Description

A kind of method and device of the remote synchronous translation based on audio/video communication

Technical field

The invention belongs to remote synchronous translation fields, specifically, belonging to the remote synchronous translation based on audio/video communication Field.

Background technology

Traditional simultaneous interpretation needs interpreter to be sitting between the simultaneous interpretation of meeting-place, and instant translation is provided by dedicated simultaneous interpretation equipment Service.Interpreter, which must arrive scene, could provide translation service.Which greatly limits the flexibilities of translation service, in particular for more The case where a interpreter and interpreter are strange lands, had not only affected the efficiency of meeting, but also increase the cost of translation service.

Existing remote synchronous translation apparatus connecting interpreter as CN201156746Y provides one kind based on broadband internet With the system at meeting scene, in this mode, if there is the audience of different language at meeting scene, it is required that interpreter must be based on Translation of the meeting-place spokesman's languages to meeting-place different language audience, i.e., such as meeting-place spokesman's interpretation from French, and audience have Chinese, German, English audient, then correspond to interpreter should be in method, method moral, method English translator, requirement higher of this mode to interpreter, and And cost bigger.For another example CN104427294A provides a kind of simultaneous interpretation of the support video conference based on cloud server The audio data of acquisition is converted to text data by method and apparatus, beyond the clouds, server, is then generated according to this article notebook data The audio data of required other languages and output, the conversion that this mode can cause audio to arrive audio again to text are excessively numerous It is trivial.

The present invention is to substitute traditional simultaneous interpretation special equipment with a set of long-range simultaneous interpretation audio and video Cloud Server, and interpreter can not To meeting scene, simultaneous interpretation service is provided by internet remote mode, and meet multilingual translation demand.

Invention content

The purpose of the present invention is to provide a kind of method and apparatus of the remote synchronous translation based on audio and video Cloud Server, It is listened in use by means of traditional audio stream and video flowing collecting device, the implement at interpreter end and audience Equipment realizes meeting-place end and interpreter end and interpreter end and interpreter end by the direct broadcasting room that is arranged in audio and video Cloud Server Data transmission, to meet multilingual translation demand.

While the present invention proposes concept between voice broadcast in simultaneous interpretation application scenarios, by the voice of live spokesman Be stored in Cloud Server and be defined as primary sound direct broadcasting room, this with reference to conventional on-site simultaneous interpretation equipment " primary sound channel " it is general It reads.The interpreter of languages identical with primary sound selects primary sound direct broadcasting room to select oneself as the interpreter of " input " and primary sound different language The direct broadcasting room of languages is used as " input ", by this set method, can significantly reduce the cost for finding target interpreter.

The present invention also passes the video of spokesman other than the audio of meeting-place spokesman is remotely passed to interpreter in real time Interpreter is given, interpreter can observe action and the expression of spokesman in real time, improve translation quality and efficiency

Other characteristics and advantages of the present invention will be apparent from by the following detailed description, or partially by the present invention Practice and acquistion.

According to the first aspect of the invention, a kind of method of the remote synchronous translation based on audio/video communication is provided, It is characterized in that, this method includes：

Step 1 acquires meeting-place spokesman's audio stream and video flowing by microphone and camera and is transferred to meeting-place PC respectively End system；

Step 2, meeting-place PC end systems are by network by above-mentioned audio stream and video flowing one-way transmission to audio and video cloud service Device, above-mentioned audio stream are formed as primary sound direct broadcasting room in Cloud Server, i.e., audio and video identical with spokesman's language are automatically stored To primary sound direct broadcasting room, above-mentioned video flowing stores in Cloud Server is formed as common video stream, is transferred for interpreter end interpreter；

Step 3, interpreter end interpreter select primary sound direct broadcasting room as input terminal, primary sound direct broadcasting room sound intermediate frequency stream are translated into First languages audio stream, which is output in audio and video Cloud Server, to be stored, and the first direct broadcasting room is formed as；

Step 4, interpreter end interpreter selects the first direct broadcasting room as input terminal, by the first languages sound in the first direct broadcasting room Frequency stream, which is translated into the second languages audio stream and is output in audio and video Cloud Server, to be stored, and the second direct broadcasting room is formed as；

Step 5, meeting-place audience select the audio stream of primary sound direct broadcasting room or the first direct broadcasting room or the second direct broadcasting room as needed It is listened to.

In other embodiments of the present invention, aforementioned schemes are based on, audio stream is stored simultaneously in each direct broadcasting room and regards Frequency flows, and it is that the interpreter that input language is translated transfers to be provided with the audio stream language.

In some embodiments of the invention, aforementioned schemes are based on, further include interpreter end interpreter from original after step 4 Audio stream is transferred in sound direct broadcasting room or first or second direct broadcasting room carries out translation formation different from primary sound, the first languages and the second language The third languages of kind, and be transmitted to Cloud Server and preserved, be formed as third direct broadcasting room.

In some embodiments of the invention, meeting-place audience can select not listen to from Cloud Server, and select directly from Spokesman's speech is listened in meeting-place, and the location of audience ensures that he can not hear spokesman's speech at this time, and audience's receives languages Identical or audience can understand spokesman's speech languages with spokesman's languages.

According to the second aspect of the invention, a kind of device of the remote synchronous translation based on audio/video communication is provided, It is characterized in that, which includes：

Audio collection microphone is used to acquire the audio stream of spokesman from meeting-place；The audio collection microphone can be with battle array The mode of row arranges, clearly accurately to be acquired to the progress of the audio of spokesman；

Video acquisition camera is used to acquire the video flowing of spokesman from meeting-place；The video acquisition camera can also It is arranged with array way, acquires the video flowing of spokesman from different perspectives；

The meeting-place ends PC, are used to store above-mentioned audio stream and video flowing；

Audio and video Cloud Server is inside formed as primary sound direct broadcasting room, the first direct broadcasting room, the second direct broadcasting room；

Interpreter's end device, interpreter extract input audio stream and input video stream by the device from audio and video Cloud Server, Output audio stream and outputting video streams are output to audio and video Cloud Server after operation processing；

Meeting-place audience's listening equipment, audience extract required sound from the direct broadcasting room in audio and video Cloud Server by the equipment Frequency flows.

The ends above-mentioned meeting-place PC and interpreter end and the network connection of audio and video Cloud Server, meeting-place audience's listening equipment and audio and video Cloud Server network connection.

In some embodiments of the invention, interpreter translates from the audio stream extracted in Cloud Server in primary sound direct broadcasting room It is stored in the first direct broadcasting room to be output to Cloud Server different from the first languages of primary sound languages；Other interpreter can be from cloud The first languages audio stream in above-mentioned first direct broadcasting room is extracted in server, is translated as being different from primary sound languages and the first languages Second languages are simultaneously output in Cloud Server and are stored in the second direct broadcasting room.

In other embodiments of the present invention, correspondence is stored with simultaneously in each direct broadcasting room in audio and video Cloud Server The audio stream and video flowing of languages.

In other embodiments of the present invention, corresponding language is only stored in each direct broadcasting room in audio and video Cloud Server The audio stream of kind, and video flowing is stored in individual memory block in Cloud Server.

It should be understood that above general description and following detailed description is only exemplary and explanatory, not It can the limitation present invention.

Description of the drawings

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the present invention Example, and be used to explain the principle of the present invention together with specification.It should be evident that the accompanying drawings in the following description is only the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.In the accompanying drawings：

Fig. 1 shows a kind of simultaneous interpretation telework flow based on audio and video Cloud Server；

Fig. 2 diagrammatically illustrates remote synchronous translation fundamental diagram according to an embodiment of the invention；

Fig. 3 diagrammatically illustrates remote synchronous translation work system figure according to an embodiment of the invention.

Specific implementation mode

Example embodiment is described more fully with reference to the drawings.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein；On the contrary, thesing embodiments are provided so that the present invention will more Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.

In addition, described feature, structure or characteristic can be incorporated in one or more implementations in any suitable manner In example.In the following description, many details are provided to fully understand the embodiment of the present invention to provide.However, It will be appreciated by persons skilled in the art that technical scheme of the present invention can be put into practice without one or more in specific detail, Or other methods, constituent element, device, step may be used etc..In other cases, it is not shown in detail or describes known side Method, device, realization or operation are to avoid fuzzy each aspect of the present invention.

Block diagram shown in attached drawing is only functional entity, not necessarily must be corresponding with physically separate entity. I.e., it is possible to realize these functional entitys using software form, or realized in one or more hardware modules or integrated circuit These functional entitys, or these functional entitys are realized in heterogeneous networks and/or processor device and/or microcontroller device.

Flow chart shown in attached drawing is merely illustrative, it is not necessary to including all content and operation/step, It is not required to execute by described sequence.For example, some operation/steps can also decompose, and some operation/steps can close And or part merge, therefore the sequence actually executed is possible to be changed according to actual conditions.

It is a kind of method of the remote synchronous translation based on audio/video communication as shown in Figure 1, this method includes：Pass through microphone Array and camera array acquisition meeting-place spokesman's audio stream and video flowing, and it is input to meeting-place PC end systems storage；Meeting-place PC End system is by network by above-mentioned audio stream and video flowing one-way transmission to audio and video Cloud Server (i.e. the ends server), microphone battle array Spokesman's audio stream of row acquisition is formed as primary sound direct broadcasting room in Cloud Server, and above-mentioned video flowing is formed as in Cloud Server Common video stream；Interpreter end interpreter selects primary sound direct broadcasting room as input terminal using the ends interpreter PC software, will be in primary sound direct broadcasting room Primary sound audio stream translates into the first languages audio stream and is output to audio and video Cloud Server, is formed as the first direct broadcasting room；Interpreter end its His interpreter selects the first direct broadcasting room as input terminal, and the first languages audio stream in the first direct broadcasting room is translated into the second languages sound Frequency stream is output to audio and video Cloud Server, is formed as the second direct broadcasting room；In the process, interpreter end interpreter can transfer and deposit at any time The video stream data in Cloud Server is stored up, action and the expression of spokesman are observed convenient for interpreter, to provide preferably translation clothes Business；Meeting-place audience utilizes meeting-place audience APP softwares or other listening equipments selection primary sound direct broadcasting room or the first live streaming as needed Between or the audio stream of the second direct broadcasting room listened to, it is, of course, understood that if live spokesman's sound quality is fine, audience Spokesman's speech languages can be understood, the meeting-place speech for directly listening to spokesman can be selected, without being taken from cloud by equipment Business device is listened to.

In the program, before meeting starts, meeting presider creates meeting room on a line, and a meeting corresponds on a line Meeting room, meeting room includes video and multiple one-way voice direct broadcasting rooms all the way on a line, this video, that is, camera array all the way The video flowing of the spokesman of acquisition, meeting-place host can control camera by the ends PC software and be directed at current speaker, will send out The video image of speech people is transmitted to interpreter by long-range simultaneous interpretation audio and video software systems；The each languages being related in meeting correspond to One one-way voice direct broadcasting room, wherein spokesman's audio stream is defaulted as primary sound direct broadcasting room.On the software of interpreter end, interpreter according to The source language and the target language of oneself select corresponding input direct broadcasting room and output direct broadcasting room.

In aforementioned schemes, video flowing can be packaged from audio stream and be stored in different direct broadcasting rooms, such as spokesman's audio Stream is stored in primary sound direct broadcasting room with the packing of spokesman's video flowing；The packing of first languages audio stream and spokesman's video flowing is stored in the One direct broadcasting room, and so on.Interpreter end interpreter extracts audio stream and video flowing simultaneously from the direct broadcasting room of Cloud Server, and meeting-place Audience can then select different equipment, such as the equipment with display and earphone to be obtained from the direct broadcasting room of Cloud Server simultaneously Audio stream and video flowing are taken, the equipment that can also select only listening function only obtains audio from the direct broadcasting room of Cloud Server Stream.

Fig. 2 shows a kind of long-range simultaneous interpretation fundamental diagram by taking three Chinese, English, French languages as an example, this fields Technical staff it is understood that other languages or more languages working method can with and so on.

Audio stream is input to meeting-place PC end systems by meeting-place spokesman (French) by microphone array, passes through camera array Video flowing is input to the ends PC, by Internet, the ends PC are by above-mentioned audio stream and video flowing one-way transmission to audio and video cloud Server, audio stream are stored in primary sound direct broadcasting room, video flowing storage to individual position；With primary sound languages (French) for source language It is primary sound direct broadcasting room that the interpreter of speech chooses input direct broadcasting room by interpreter end software, and it is Chinese direct broadcasting room to choose output direct broadcasting room, French audio stream is obtained from primary sound direct broadcasting room, while obtaining video flowing from Cloud Server, in conjunction with french audio stream and video flowing French Translator at Chinese (object language) and is exported into the translation completed to the storage of Chinese direct broadcasting room from French to Chinese.

And Sino-British interpreter then selects Chinese direct broadcasting room to input direct broadcasting room, selects English direct broadcasting room to export direct broadcasting room.

Thus can imagine, if meeting is held in China, the French speech of a French spokesman only needs to have in method Its speech is translated into Chinese by interpreter, the interpreter of other object languages, for example the interpreters such as Sino-British, Sino-German, middle Portugal, Chinese and Western can Using Chinese direct broadcasting room as input, object language is exported, without finding the translation such as method English, method moral, method Portugal interpreter.

Meeting-place audience utilize listening equipment, select corresponding direct broadcasting room can the corresponding language of uppick, Chinese audience The Chinese direct broadcasting room of selection, English audience select English direct broadcasting room, and French audience can directly listen spokesman's primary sound at scene, from hair The audience of speech people farther out can also select primary sound direct broadcasting room to listen to.

Fig. 3 shows a kind of device of the remote synchronous translation based on audio/video communication, which is characterized in that the device packet It includes：

Audio collecting device is used to acquire the audio stream of spokesman from meeting-place；The audio collecting device can be with battle array The microphone that the mode of row arranges, clearly accurately to be acquired to the progress of the audio of spokesman；

Video capture device is used to acquire the video flowing of spokesman from meeting-place；The video capture device can be with battle array The camera that row mode is arranged acquires the video flowing of spokesman from different perspectives；

Audio and video Cloud Server, is inside formed as primary sound direct broadcasting room, the first direct broadcasting room, the second direct broadcasting room, and wherein primary sound is straight Middle storage spokesman's spoken audio stream between broadcasting, the first direct broadcasting room store the first languages audio stream, and the second direct broadcasting room stores the second language Kind audio stream；

Interpreter's end device, interpreter input audio by interpreter's end device from extraction between respective live in audio and video Cloud Server Output audio stream and outputting video streams are output to audio and video Cloud Server and are stored in phase by stream and input video stream after operation processing Answer direct broadcasting room；

Meeting-place audience's listening equipment, audience by the equipment between the respective live in audio and video Cloud Server extraction needed for Audio stream.

The ends above-mentioned meeting-place PC and interpreter end and the network connection of audio and video Cloud Server, meeting-place audience's listening equipment and audio and video Cloud Server network connection

According to a kind of specific embodiment, spokesman's interpretation from French, audio collecting device acquires french audio stream as former Sound audio is transferred to the meeting-place ends PC, and video capture device acquires the transmission of video of spokesman to the meeting-place ends PC；Wherein audio collection Equipment can be that microphone can be either camera or preferably by taking the photograph preferably by microphone array video capture device As head array.

The meeting-place ends PC by network by primary sound audio and transmission of video to audio and video Cloud Server, primary sound audio stream is stored At primary sound direct broadcasting room (French), video flowing is stored in video memory block；

In interpreter end method interpreter by middle method interpreter end the extraction method voice from the primary sound direct broadcasting room of audio and video Cloud Server Frequency stream, from video memory block, extraction spokesman's video flowing is used as input, carries out operation processing and makees the Chinese audio stream after translation The Chinese direct broadcasting room being transferred to for output in audio and video Cloud Server；Interpreter end China and Britain interpreter is then regarded by Sino-British interpreter end from sound The Chinese direct broadcasting room of frequency Cloud Server extracts the Chinese audio stream that method interpreter among the above is output to audio and video Cloud Server, Yi Jicun It is stored in spokesman's video flowing of the video memory block of Cloud Server, carries out operation processing using the English video flowing after translation as defeated Go out to be transmitted to the English direct broadcasting room of audio and video Cloud Server.

Meeting-place audience is by corresponding listening equipment, for example, Chinese audience is taken by Chinese listening equipment from audio and video cloud The Chinese audio stream of Chinese direct broadcasting room extraction being engaged in device, English audience is by English listening equipment from audio and video Cloud Server English direct broadcasting room extracts English language audio stream, and French audience both can directly listen to the scene speech of live spokesman, can also It is extracted from the primary sound direct broadcasting room in audio and video Cloud Server/French direct broadcasting room by primary sound listening equipment or French listening equipment Audio stream is listened to.

Particularly, in audio and video Cloud Server, corresponding audio stream and spokesman can be stored in each direct broadcasting room simultaneously Video flowing, interpreter end interpreter directly extract input audio stream and video flowing simultaneously from the direct broadcasting room of Cloud Server.

Meeting-place audience can select different listening equipments.For example, the listening equipment with display screen, while from audio and video The audio stream and video flowing of needs are extracted in Cloud Server.

It should be noted that although being referred to several modules or list for acting the equipment executed in above-detailed Member, but this division is not enforceable.In fact, according to the embodiment of the present invention, it is above-described two or more The feature and function of module either unit can embody in a module or unit.Conversely, an above-described mould Either the feature and function of unit can be further divided into and embodied by multiple modules or unit block.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the present invention Its embodiment.This application is intended to cover the present invention any variations, uses, or adaptations, these modifications, purposes or Person's adaptive change follows the general principle of the present invention and includes undocumented common knowledge in the art of the invention Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are wanted by right It asks and points out.

It should be understood that the invention is not limited in the precision architectures for being described above and being shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.

Claims

1. a kind of method of the remote synchronous translation based on audio/video communication, which is characterized in that this method includes：

Meeting-place spokesman's audio stream is input to meeting-place PC end systems by audio collecting device, passes through video acquisition by step 1 Meeting-place spokesman's video flowing is input to meeting-place PC end systems by equipment；

Step 2, meeting-place PC end systems by network by above-mentioned audio stream and video flowing one-way transmission to audio and video Cloud Server, Above-mentioned audio stream is formed as primary sound direct broadcasting room in Cloud Server, and above-mentioned video flowing is formed as public video in Cloud Server Stream；

Step 3, interpreter end interpreter select primary sound direct broadcasting room as input terminal, and primary sound direct broadcasting room sound intermediate frequency stream is translated into first Languages audio stream is output to audio and video Cloud Server, is stored in the first direct broadcasting room；

Step 4, interpreter end interpreter selects the first direct broadcasting room as input terminal, by the first languages audio stream in the first direct broadcasting room It translates into the second languages audio stream and is output to audio and video Cloud Server, be stored in the second direct broadcasting room；

Step 5, meeting-place audience select primary sound direct broadcasting room or the audio stream of the first direct broadcasting room or the second direct broadcasting room to carry out as needed It listens to.

2. according to the method described in claim 1, it is characterized in that, the step 2 sound intermediate frequency stream and video flowing packing are stored in Primary sound direct broadcasting room in audio and video Cloud Server；The first languages audio stream and video flowing packing are stored in sound and regard in the step 3 The first direct broadcasting room in frequency Cloud Server；The second languages audio stream and video flowing packing are stored in audio and video cloud in the step 4 The second direct broadcasting room in server.

3. method according to claim 1 or 2, which is characterized in that audio collecting device is microphone or microphone array, video Collecting device is camera or camera array.

4. method according to any one of claim 1-3, which is characterized in that further include being formed as after step 4 Different from the third direct broadcasting room of the first languages and the storage third languages of the second languages.

5. according to the described method of any one of claim 1-4, which is characterized in that in step 5, audience can select directly The speech of meeting-place spokesman is listened to without being listened to from audio and video cloud server.

6. according to the described method of any one of claim 1-4, meeting-place audience only listens to the audio of direct broadcasting room in Cloud Server Stream.

7. a kind of device of the remote synchronous translation based on audio/video communication, which is characterized in that the device includes：

Audio collecting device is used to acquire the audio stream of spokesman from meeting-place；

Video capture device is used to acquire the video flowing of spokesman from meeting-place；

Audio and video Cloud Server is inside formed with primary sound direct broadcasting room, the first direct broadcasting room, the second direct broadcasting room；

Interpreter's end device, interpreter extract input audio stream and input video by interpreter's end device from audio and video Cloud Server Output audio stream and outputting video streams are output to audio and video Cloud Server by stream after operation processing；

Meeting-place audience's listening equipment, audience extract required audio from the direct broadcasting room in audio and video Cloud Server by the equipment Stream.

8. device according to claim 7, which is characterized in that the primary sound direct broadcasting room, the first direct broadcasting room and the second live streaming Between include corresponding languages audio stream and spokesman's video flowing.

9. device according to claim 7, which is characterized in that the primary sound direct broadcasting room, the first direct broadcasting room and the second live streaming Between in contain only the audio streams of corresponding languages, video flowing is stored separately in audio and video Cloud Server.

10. according to the device described in any one of claim 7-9, audio and video Cloud Server and interpreter end, the meeting-place ends PC and meeting Field audience's listening equipment is carried out data transmission by network connection.