CN113721704A - Simultaneous interpretation system of video stream and implementation method thereof - Google Patents

Simultaneous interpretation system of video stream and implementation method thereof Download PDF

Info

Publication number
CN113721704A
CN113721704A CN202111001773.8A CN202111001773A CN113721704A CN 113721704 A CN113721704 A CN 113721704A CN 202111001773 A CN202111001773 A CN 202111001773A CN 113721704 A CN113721704 A CN 113721704A
Authority
CN
China
Prior art keywords
audio
video
frame
buffer
translator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111001773.8A
Other languages
Chinese (zh)
Inventor
杨丰羽
刘盾
唐学怡
杨皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Chinamcloud Technology Co ltd
Original Assignee
Chengdu Chinamcloud Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Chinamcloud Technology Co ltd filed Critical Chengdu Chinamcloud Technology Co ltd
Priority to CN202111001773.8A priority Critical patent/CN113721704A/en
Publication of CN113721704A publication Critical patent/CN113721704A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/04Generating or distributing clock signals or signals derived directly therefrom
    • G06F1/12Synchronisation of different clock signals provided by a plurality of clock generators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/74Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention relates to a simultaneous interpretation system of video stream and a realization method thereof, which comprises a decapsulator component for decapsulating the input video stream and splitting the decapsulated video stream into a pure video stream composed of a series of video frames arranged according to a time sequence and a pure audio stream composed of a series of audio frames arranged according to a time sequence; the audio processing component is used for decoding, translating, adjusting and encoding the input audio frames of the A languages; the audio and video buffer component is used for respectively buffering the video frame and the audio frame and feeding back the duration of the audio and video frame to the audio processing component in real time; the clock driving wrapper component is used for taking out the audio and video with the required time length from the audio and video buffer component for packaging and outputting. The invention adjusts the speech rate parameter in the audio translator by judging the calculation results among the cached video frame time length, the cached audio frame time length and the translation delay time length of the audio translator, so that the audio and video of the finally output video stream data are kept synchronous.

Description

Simultaneous interpretation system of video stream and implementation method thereof
Technical Field
The invention relates to the technical field of language identification processing, in particular to a simultaneous interpretation system of video streams and an implementation method thereof.
Background
Due to the development of speech recognition, machine translation and speech synthesis technologies, it is possible to replace artificial simultaneous interpretation with machines, and similar systems are currently on the market: inputting an audio stream (for example, the speech in Chinese), and outputting the results of speech recognition (for example, Chinese characters) and corresponding translated characters (for example, English characters) in real time; or inputting an audio and video and outputting the translated audio stream in a delayed manner; however, since the translated audio stream may be very different from the original audio stream in length, if the translation is simple, the video and audio will be completely out of synchronization, which affects the practicability of the output video stream; therefore, how to solve the problem needs to be considered at the present stage.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a simultaneous interpretation system of video streams and an implementation method thereof, and solves the problem that the audio and the video are not synchronous after the video language is translated in the prior art.
The purpose of the invention is realized by the following technical scheme: a simultaneous interpretation system of video stream comprises a decapsulator component, an audio processing component, an audio and video buffer component and a clock driving encapsulator component;
the decapsulator assembly: the system comprises a decoder, a decoder and a decoder, wherein the decoder is used for de-encapsulating an input video stream and splitting the input video stream into a pure video stream composed of a series of video frames arranged according to a time sequence and a pure audio stream composed of a series of audio frames arranged according to a time sequence;
the audio processing component: the system comprises a voice frequency decoder, a voice speed parameter adjusting module and a voice frequency processing module, wherein the voice frequency decoder is used for decoding an input audio frequency frame of A language to obtain audio frequency baseband data, translating the audio frequency baseband data and adjusting the voice speed parameter to obtain audio frequency baseband data of B language capable of controlling the voice speed of the output audio frequency, and encoding the audio frequency baseband data into an audio frequency frame;
the audio and video buffer component: the system is used for respectively caching the video frame and the audio frame, regularly checking the duration of the video frame and the duration of the audio frame and feeding back the duration to the audio processing component;
the clock driven wrapper component: the audio/video buffer component is used for taking out the video frames and the audio frames with required time length from the audio/video buffer component under the driving of the internal clock, and packaging and outputting the video frames and the audio frames.
The audio processing component comprises an audio decoder, an audio translator and an audio encoder; the audio decoder is used for decoding the input audio frame of the language A to obtain audio baseband data; the audio translator is used for translating the audio baseband data obtained by the audio decoder to obtain audio baseband data of B languages, calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer component and synthesizing the speech; and the audio encoder encodes the audio baseband data of the synthesized speech of the audio translator into audio frames.
The audio and video buffer component comprises an audio buffer and a video buffer; the audio buffer is used for buffering the audio frames output by the audio encoder, and regularly checking the duration of the audio frames and feeding the duration back to the audio translator; the video buffer is used for buffering the video frames split by the decapsulator component and regularly checking the duration of the video frames and feeding back the duration to the audio translator.
The calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer component comprises the following steps: after receiving the real-time audio and video frame duration fed back by the audio and video buffer component, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator; if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster, and if the calculation result is positive, decreasing the speech rate parameter value in the audio translator to make the output speech rate slower.
The clock driving wrapper component is internally provided with a clock, the clock generates a trigger signal every a period of time, and after receiving the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.
An implementation method of a simultaneous interpretation system of video streams, the implementation method comprising:
the decapsulator component decapsulates the acquired video stream data to obtain an audio coded frame and a video coded frame, inputs the audio coded frame into an audio decoder, and inputs the video coded frame into a video buffer;
the audio decoder obtains audio coding frames, decodes the audio coding frames to obtain audio baseband data and inputs the audio baseband data to the audio translator;
the audio translator translates the acquired audio baseband data, calculates the language speed according to the read audio and video durations fed back by the audio buffer and the video buffer, synthesizes the audio baseband data, inputs the audio baseband data into the audio encoder to encode into audio frames, and buffers the audio frames into the audio buffer;
when the clock driving wrapper component receives the trigger signal, the video frame and the audio frame with a certain length are read from the audio buffer and the video buffer for wrapping, and finally the video stream data with audio and video synchronization is output.
The calculating the language speed according to the read audio buffer and the audio and video duration fed back by the video buffer comprises the following steps:
after the audio translator receives the real-time audio and video frame duration fed back by the audio and video buffer component, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator;
if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster;
if the calculation result is positive, the speech rate parameter value in the audio translator is reduced, so that the output speech rate is slowed down.
The implementation method further comprises the following steps: after the clock driving wrapper component receives the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.
The invention has the following advantages: a simultaneous interpretation system of video stream and a realization method thereof are provided, which adjust the speech rate parameter in an audio translator by judging the calculation results among the cached video frame time, the cached audio frame time and the translation delay time of the audio translator, thereby keeping the audio and video of the finally output video stream data synchronous.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention;
FIG. 2 is a schematic flow diagram of the operation of the decapsulator;
FIG. 3 is a flow chart illustrating the operation of an audio decoder;
FIG. 4 is a schematic flow diagram of the operation of the audio translator;
FIG. 5 is a flow chart illustrating the operation of an audio encoder;
FIG. 6 is a schematic flow chart of the audio/video buffer operation;
fig. 7 is a flow chart illustrating the operation of the clock driven wrapper.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, one embodiment of the present invention relates to a simultaneous interpretation system for video streams, which includes a decapsulator component, an audio processing component, an audio and video buffer component, and a clock driven encapsulator component;
a decapsulator: the system comprises a decoder, a decoder and a decoder, wherein the decoder is used for de-encapsulating an input video stream and splitting the input video stream into a pure video stream composed of a series of video frames arranged according to a time sequence and a pure audio stream composed of a series of audio frames arranged according to a time sequence;
an audio processing component: the system comprises a voice frequency decoder, a voice speed parameter adjusting module and a voice frequency processing module, wherein the voice frequency decoder is used for decoding an input audio frequency frame of A language to obtain audio frequency baseband data, translating the audio frequency baseband data and adjusting the voice speed parameter to obtain audio frequency baseband data of B language capable of controlling the voice speed of the output audio frequency, and encoding the audio frequency baseband data into an audio frequency frame;
audio and video buffer: the system is used for respectively caching the video frame and the audio frame, regularly checking the duration of the video frame and the audio frame and feeding back the duration to the audio processing;
a clock-driven wrapper: the audio/video buffer component is used for taking out the video frames and the audio frames with required time length from the audio/video buffer component under the driving of the internal clock, and packaging and outputting the video frames and the audio frames.
The audio processing component comprises an audio decoder, an audio translator and an audio encoder; the audio decoder is used for decoding the input audio frame of the language A to obtain audio baseband data; the audio translator is used for translating the audio baseband data obtained by the audio decoder to obtain audio baseband data of B languages, calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer component and synthesizing the speech; and the audio encoder encodes the audio baseband data of the synthesized speech of the audio translator into audio frames.
If Chinese is input, English is output, and if the input audio is 'Ju', the river continent. Deep "takes 2.2 seconds, then 2.2 seconds of audio baseband data is used as the input to the" audio translator ". It should be noted here that "Fair lady" is not mentioned here, but only the word "deep" is mentioned. If the duration of this "deep" word is 0.2 seconds, then the previous "make switch Ju, in river continent. "the duration of the voice is 2 seconds.
The audio translator will output signals including only the Ju switch on the continent of the river. "english audio baseband data: "" Guan, Guan! "go the fish hashs, on an islet in the river," this sentence is longer in english (longer in duration than chinese speech) assuming that its speech duration is 4 seconds.
Before any other Chinese speech is continuously input, the 0.2 second speech information of the 'deep' word is stored in an audio buffer and is not represented in the output.
Inputting Chinese audio baseband data of 2.2 seconds into the audio translator, actually only 2 seconds of Chinese is translated, and English audio baseband data of 4 seconds is obtained by output, 0.2 seconds of Chinese audio data is still in the audio translator, and the corresponding English audio is not output; this 0.2 second time period can be considered as the delay of the audio translator.
In addition to the audio translator delay, the audio translator has a parameter of the audio translator speech rate, which is adjusted to control the speech rate of the audio output by the audio translator.
Further, the audio and video buffer comprises an audio buffer and a video buffer; the audio buffer is used for buffering the audio frames output by the audio encoder, and regularly checking the duration of the audio frames and feeding the duration back to the audio translator; the video buffer is used for buffering the video frames split by the decapsulator component and feeding back the duration of the video frames to the audio translator at regular time (for example, every 0.2 s).
Calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer comprises the following steps: after receiving the real-time audio and video frame duration fed back by the audio and video buffer, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator; if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster, and if the calculation result is positive, decreasing the speech rate parameter value in the audio translator to make the output speech rate slower.
The clock driving wrapper component receives the trigger signal, and if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame to be output.
The clock drives the wrapper to acquire video frames of about 40ms and audio frames of 40ms each time, because the duration of each segment of audio frames in the audio buffer may not be necessary, if some audio frames are 43ms, some audio frames are 41ms, the duration of each acquired audio frame may be greater than 40ms or less than 40ms, and here, an average duration is 40 ms; and set the clock wrapper driver delay to 10 seconds.
Specifically, in the acquired real-time video stream, 10 seconds of video and audio are received in the first 10 seconds, because the clock-driven wrapper delay is 10 seconds, the clock-driven wrapper has no output in the first 10 seconds, 10 seconds of video frames are buffered in the video buffer, and the audio frame data in the audio buffer is assumed to be 12 seconds, meanwhile, assuming that the audio translator delay is 0.5 second, 9.5 seconds of 10 seconds of chinese audio are translated to obtain corresponding 12 seconds of english speech and output to the audio buffer, at this time, the video buffer feeds back to the audio translator that the length of the currently buffered video frame is 10 seconds, and the audio buffer feeds back to the audio translator that the length of the current audio is 12 seconds, then the audio translator receives the feedback and calculates the result of "the length of the video frame in the video buffer-the length of the audio frame in the audio buffer-the audio translator delay", i.e., 10s-12s-0.5 s-2.5 s, the result is negative, and the audio translator will adjust (increase) the speech rate parameter value to speed up the subsequent output english speech rate.
In the period from 10 seconds to 15 seconds, the video and audio of 5 seconds are received, and because the english speech rate output by the audio translator is accelerated in the foregoing, it is assumed that the audio translator only outputs 2 seconds of english speech, and the delay of the audio translator is still 0.5 seconds, at this time, the clock-driven wrapper has started to operate, and it has taken out the video frame of 5S and the audio frame data of 5S from the audio/video buffer to output, respectively, at this time, the length of the video frame in the video buffer is 10S +5S-5S ═ 10S, the length of the audio frame in the audio buffer is 12S +2S-5S ═ 9S, the audio/video buffer feeds back the video frame length currently buffered to the audio translator to 10S, the audio frame length is 9S, the audio translator receives the feedback, and then calculates the result of "the duration of the video frame in the video buffer-the duration of the audio frame in the audio buffer-the audio translator delay", i.e., 10s-9s-0.5s is 0.5s, and the result is positive, the audio translator will adjust (decrease) the speech rate parameter value to slow down the subsequent output english speech rate.
Between the 15 th and 30 th seconds thereafter, 15 seconds of video and audio data are received, and if none of these 15 seconds ever speak, the audio translator will not have any output if none. Then at 30 seconds the situation would be: the length of the video frame in the video buffer is 10 s; and the length of the audio frame in the audio buffer is 0 s; then at the next time instant the clock driven encapsulator will take the video frames from the video buffer and output them after encapsulation into muted audio frames.
As shown in fig. 2 to 7, another embodiment of the present invention relates to an implementation method of a simultaneous interpretation system of video streams, the implementation method comprising:
the decapsulator component decapsulates the acquired video stream data to obtain an audio coded frame and a video coded frame, inputs the audio coded frame into an audio decoder, and inputs the video coded frame into a video buffer;
the audio decoder obtains audio coding frames, decodes the audio coding frames to obtain audio baseband data and inputs the audio baseband data to the audio translator;
the audio translator translates the acquired audio baseband data, calculates the language speed according to the read audio and video durations fed back by the audio buffer and the video buffer, synthesizes the audio baseband data, inputs the audio baseband data into the audio encoder to encode into audio frames, and buffers the audio frames into the audio buffer;
when the clock driving wrapper component receives the trigger signal, the video frame and the audio frame with a certain length are read from the audio buffer and the video buffer for wrapping, and finally the video stream data with audio and video synchronization is output.
The calculating the language speed according to the read audio buffer and the audio and video duration fed back by the video buffer comprises the following steps:
after the audio translator receives the real-time audio and video frame duration fed back by the audio and video buffer component, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator;
if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster;
if the calculation result is positive, the speech rate parameter value in the audio translator is reduced, so that the output speech rate is slowed down.
Further, the implementation method further comprises: after the clock driving wrapper component receives the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A simultaneous interpretation system for video streams, comprising: the audio and video packaging device comprises a decapsulator component, an audio processing component, an audio and video buffer component and a clock driving wrapper component;
the decapsulator assembly: the system comprises a decoder, a decoder and a decoder, wherein the decoder is used for de-encapsulating an input video stream and splitting the input video stream into a pure video stream composed of a series of video frames arranged according to a time sequence and a pure audio stream composed of a series of audio frames arranged according to a time sequence;
the audio processing component: the system comprises a voice frequency decoder, a voice speed parameter adjusting module and a voice frequency processing module, wherein the voice frequency decoder is used for decoding an input audio frequency frame of A language to obtain audio frequency baseband data, translating the audio frequency baseband data and adjusting the voice speed parameter to obtain audio frequency baseband data of B language capable of controlling the voice speed of the output audio frequency, and encoding the audio frequency baseband data into an audio frequency frame;
the audio and video buffer component: the system is used for respectively caching the video frame and the audio frame, regularly checking the duration of the video frame and the duration of the audio frame and feeding back the duration to the audio processing component;
the clock driven wrapper component: the audio/video buffer component is used for taking out the video frames and the audio frames with required time length from the audio/video buffer component under the driving of the internal clock, and packaging and outputting the video frames and the audio frames.
2. The system of claim 1, wherein: the audio processing component comprises an audio decoder, an audio translator and an audio encoder; the audio decoder is used for decoding the input audio frame of the language A to obtain audio baseband data; the audio translator is used for translating the audio baseband data obtained by the audio decoder to obtain audio baseband data of B languages, calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer component and synthesizing the speech; and the audio encoder encodes the audio baseband data of the synthesized speech of the audio translator into audio frames.
3. The system of claim 2, wherein: the audio and video buffer component comprises an audio buffer and a video buffer; the audio buffer is used for buffering the audio frames output by the audio encoder, and regularly checking the duration of the audio frames and feeding the duration back to the audio translator; the video buffer is used for buffering the video frames split by the decapsulator component and regularly checking the duration of the video frames and feeding back the duration to the audio translator.
4. The system of claim 1, wherein: the calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer component comprises the following steps: after receiving the real-time audio and video frame duration fed back by the audio and video buffer component, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator; if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster, and if the calculation result is positive, decreasing the speech rate parameter value in the audio translator to make the output speech rate slower.
5. The system of claim 1, wherein: the clock driving wrapper component is internally provided with a clock, the clock generates a trigger signal every a period of time, and after receiving the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.
6. A method for implementing a simultaneous interpretation system of video streams according to any of claims 1 to 5, characterized by: the implementation method comprises the following steps:
the decapsulator component decapsulates the acquired video stream data to obtain an audio coded frame and a video coded frame, inputs the audio coded frame into an audio decoder, and inputs the video coded frame into a video buffer;
the audio decoder obtains audio coding frames, decodes the audio coding frames to obtain audio baseband data and inputs the audio baseband data to the audio translator;
the audio translator translates the acquired audio baseband data, calculates the language speed according to the read audio and video durations fed back by the audio buffer and the video buffer, synthesizes the audio baseband data, inputs the audio baseband data into the audio encoder to encode into audio frames, and buffers the audio frames into the audio buffer;
when the clock driving wrapper component receives the trigger signal, the video frame and the audio frame with a certain length are read from the audio buffer and the video buffer for wrapping, and finally the video stream data with audio and video synchronization is output.
7. The method of claim 6, wherein the system further comprises: the calculating the language speed according to the read audio buffer and the audio and video duration fed back by the video buffer comprises the following steps:
after the audio translator receives the real-time audio and video frame duration fed back by the audio and video buffer component, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator;
if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster;
if the calculation result is positive, the speech rate parameter value in the audio translator is reduced, so that the output speech rate is slowed down.
8. The method of claim 6, wherein the system further comprises: the implementation method further comprises the following steps: after the clock driving wrapper component receives the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.
CN202111001773.8A 2021-08-30 2021-08-30 Simultaneous interpretation system of video stream and implementation method thereof Pending CN113721704A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111001773.8A CN113721704A (en) 2021-08-30 2021-08-30 Simultaneous interpretation system of video stream and implementation method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111001773.8A CN113721704A (en) 2021-08-30 2021-08-30 Simultaneous interpretation system of video stream and implementation method thereof

Publications (1)

Publication Number Publication Date
CN113721704A true CN113721704A (en) 2021-11-30

Family

ID=78678844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111001773.8A Pending CN113721704A (en) 2021-08-30 2021-08-30 Simultaneous interpretation system of video stream and implementation method thereof

Country Status (1)

Country Link
CN (1) CN113721704A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102724559A (en) * 2012-06-13 2012-10-10 天脉聚源(北京)传媒科技有限公司 Method and system for synchronizing encoding of videos and audios
US8874429B1 (en) * 2012-05-18 2014-10-28 Amazon Technologies, Inc. Delay in video for language translation
CN108063970A (en) * 2017-11-22 2018-05-22 北京奇艺世纪科技有限公司 A kind of method and apparatus for handling live TV stream
US20180336891A1 (en) * 2015-10-29 2018-11-22 Hitachi, Ltd. Synchronization method for visual information and auditory information and information processing device
CN111050201A (en) * 2019-12-10 2020-04-21 Oppo广东移动通信有限公司 Data processing method and device, electronic equipment and storage medium
CN112562721A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Video translation method, system, device and storage medium
CN112562719A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Method, system, device and storage medium for matching synthesized voice and original video

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8874429B1 (en) * 2012-05-18 2014-10-28 Amazon Technologies, Inc. Delay in video for language translation
CN102724559A (en) * 2012-06-13 2012-10-10 天脉聚源(北京)传媒科技有限公司 Method and system for synchronizing encoding of videos and audios
US20180336891A1 (en) * 2015-10-29 2018-11-22 Hitachi, Ltd. Synchronization method for visual information and auditory information and information processing device
CN108063970A (en) * 2017-11-22 2018-05-22 北京奇艺世纪科技有限公司 A kind of method and apparatus for handling live TV stream
CN111050201A (en) * 2019-12-10 2020-04-21 Oppo广东移动通信有限公司 Data processing method and device, electronic equipment and storage medium
CN112562721A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Video translation method, system, device and storage medium
CN112562719A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Method, system, device and storage medium for matching synthesized voice and original video

Similar Documents

Publication Publication Date Title
US11211060B2 (en) Using machine-learning models to determine movements of a mouth corresponding to live speech
JP4783449B2 (en) Method and apparatus for matching code sequences, and decoder
CN113035205B (en) Audio packet loss compensation processing method and device and electronic equipment
US20090119098A1 (en) Signal processing method, processing apparatus and voice decoder
US20070263672A1 (en) Adaptive jitter management control in decoder
US20120105719A1 (en) Speech substitution of a real-time multimedia presentation
KR100722707B1 (en) Transmission system for transmitting a multimedia signal
CN111464876B (en) Translation text subtitle stream type display method, device and equipment
CN108184135A (en) Method for generating captions and device, storage medium and electric terminal
CN108924583B (en) Video file generation method, device, system and storage medium thereof
US10354660B2 (en) Audio frame labeling to achieve unequal error protection for audio frames of unequal importance
CN103646645B (en) A kind of method exported based on voice translation text
CN110691204A (en) Audio and video processing method and device, electronic equipment and storage medium
US9177570B2 (en) Time scaling of audio frames to adapt audio processing to communications network timing
CN117321603A (en) System, method, and interface for multilingual processing
CN113721704A (en) Simultaneous interpretation system of video stream and implementation method thereof
US20030208359A1 (en) Method and apparatus for controlling buffering of audio stream
US6233695B1 (en) Data transmission control system in set top box
KR101516113B1 (en) Voice decoding apparatus
JP2003050598A (en) Voice decoding device
JP6486582B2 (en) Electronic device, voice control method, and program
WO2006040827A1 (en) Transmitting apparatus, receiving apparatus and reproducing apparatus
US20070186146A1 (en) Time-scaling an audio signal
US11532312B2 (en) User-perceived latency while maintaining accuracy
CN111798854B (en) Voice text conversion method based on video communication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211130