CN113721704A

CN113721704A - Simultaneous interpretation system of video stream and implementation method thereof

Info

Publication number: CN113721704A
Application number: CN202111001773.8A
Authority: CN
Inventors: 杨丰羽; 刘盾; 唐学怡; 杨皓
Original assignee: Chengdu Chinamcloud Technology Co ltd
Current assignee: Chengdu Chinamcloud Technology Co ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-11-30

Abstract

The invention relates to a simultaneous interpretation system of video stream and a realization method thereof, which comprises a decapsulator component for decapsulating the input video stream and splitting the decapsulated video stream into a pure video stream composed of a series of video frames arranged according to a time sequence and a pure audio stream composed of a series of audio frames arranged according to a time sequence; the audio processing component is used for decoding, translating, adjusting and encoding the input audio frames of the A languages; the audio and video buffer component is used for respectively buffering the video frame and the audio frame and feeding back the duration of the audio and video frame to the audio processing component in real time; the clock driving wrapper component is used for taking out the audio and video with the required time length from the audio and video buffer component for packaging and outputting. The invention adjusts the speech rate parameter in the audio translator by judging the calculation results among the cached video frame time length, the cached audio frame time length and the translation delay time length of the audio translator, so that the audio and video of the finally output video stream data are kept synchronous.

Description

Simultaneous interpretation system of video stream and implementation method thereof

Technical Field

The invention relates to the technical field of language identification processing, in particular to a simultaneous interpretation system of video streams and an implementation method thereof.

Background

Due to the development of speech recognition, machine translation and speech synthesis technologies, it is possible to replace artificial simultaneous interpretation with machines, and similar systems are currently on the market: inputting an audio stream (for example, the speech in Chinese), and outputting the results of speech recognition (for example, Chinese characters) and corresponding translated characters (for example, English characters) in real time; or inputting an audio and video and outputting the translated audio stream in a delayed manner; however, since the translated audio stream may be very different from the original audio stream in length, if the translation is simple, the video and audio will be completely out of synchronization, which affects the practicability of the output video stream; therefore, how to solve the problem needs to be considered at the present stage.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a simultaneous interpretation system of video streams and an implementation method thereof, and solves the problem that the audio and the video are not synchronous after the video language is translated in the prior art.

The purpose of the invention is realized by the following technical scheme: a simultaneous interpretation system of video stream comprises a decapsulator component, an audio processing component, an audio and video buffer component and a clock driving encapsulator component;

the decapsulator assembly: the system comprises a decoder, a decoder and a decoder, wherein the decoder is used for de-encapsulating an input video stream and splitting the input video stream into a pure video stream composed of a series of video frames arranged according to a time sequence and a pure audio stream composed of a series of audio frames arranged according to a time sequence;

the audio processing component: the system comprises a voice frequency decoder, a voice speed parameter adjusting module and a voice frequency processing module, wherein the voice frequency decoder is used for decoding an input audio frequency frame of A language to obtain audio frequency baseband data, translating the audio frequency baseband data and adjusting the voice speed parameter to obtain audio frequency baseband data of B language capable of controlling the voice speed of the output audio frequency, and encoding the audio frequency baseband data into an audio frequency frame;

the audio and video buffer component: the system is used for respectively caching the video frame and the audio frame, regularly checking the duration of the video frame and the duration of the audio frame and feeding back the duration to the audio processing component;

the clock driven wrapper component: the audio/video buffer component is used for taking out the video frames and the audio frames with required time length from the audio/video buffer component under the driving of the internal clock, and packaging and outputting the video frames and the audio frames.

The audio processing component comprises an audio decoder, an audio translator and an audio encoder; the audio decoder is used for decoding the input audio frame of the language A to obtain audio baseband data; the audio translator is used for translating the audio baseband data obtained by the audio decoder to obtain audio baseband data of B languages, calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer component and synthesizing the speech; and the audio encoder encodes the audio baseband data of the synthesized speech of the audio translator into audio frames.

The audio and video buffer component comprises an audio buffer and a video buffer; the audio buffer is used for buffering the audio frames output by the audio encoder, and regularly checking the duration of the audio frames and feeding the duration back to the audio translator; the video buffer is used for buffering the video frames split by the decapsulator component and regularly checking the duration of the video frames and feeding back the duration to the audio translator.

The calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer component comprises the following steps: after receiving the real-time audio and video frame duration fed back by the audio and video buffer component, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator; if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster, and if the calculation result is positive, decreasing the speech rate parameter value in the audio translator to make the output speech rate slower.

The clock driving wrapper component is internally provided with a clock, the clock generates a trigger signal every a period of time, and after receiving the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.

An implementation method of a simultaneous interpretation system of video streams, the implementation method comprising:

the decapsulator component decapsulates the acquired video stream data to obtain an audio coded frame and a video coded frame, inputs the audio coded frame into an audio decoder, and inputs the video coded frame into a video buffer;

the audio decoder obtains audio coding frames, decodes the audio coding frames to obtain audio baseband data and inputs the audio baseband data to the audio translator;

the audio translator translates the acquired audio baseband data, calculates the language speed according to the read audio and video durations fed back by the audio buffer and the video buffer, synthesizes the audio baseband data, inputs the audio baseband data into the audio encoder to encode into audio frames, and buffers the audio frames into the audio buffer;

when the clock driving wrapper component receives the trigger signal, the video frame and the audio frame with a certain length are read from the audio buffer and the video buffer for wrapping, and finally the video stream data with audio and video synchronization is output.

The calculating the language speed according to the read audio buffer and the audio and video duration fed back by the video buffer comprises the following steps:

after the audio translator receives the real-time audio and video frame duration fed back by the audio and video buffer component, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator;

if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster;

if the calculation result is positive, the speech rate parameter value in the audio translator is reduced, so that the output speech rate is slowed down.

The implementation method further comprises the following steps: after the clock driving wrapper component receives the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.

The invention has the following advantages: a simultaneous interpretation system of video stream and a realization method thereof are provided, which adjust the speech rate parameter in an audio translator by judging the calculation results among the cached video frame time, the cached audio frame time and the translation delay time of the audio translator, thereby keeping the audio and video of the finally output video stream data synchronous.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is a schematic flow diagram of the operation of the decapsulator;

FIG. 3 is a flow chart illustrating the operation of an audio decoder;

FIG. 4 is a schematic flow diagram of the operation of the audio translator;

FIG. 5 is a flow chart illustrating the operation of an audio encoder;

FIG. 6 is a schematic flow chart of the audio/video buffer operation;

fig. 7 is a flow chart illustrating the operation of the clock driven wrapper.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, one embodiment of the present invention relates to a simultaneous interpretation system for video streams, which includes a decapsulator component, an audio processing component, an audio and video buffer component, and a clock driven encapsulator component;

a decapsulator: the system comprises a decoder, a decoder and a decoder, wherein the decoder is used for de-encapsulating an input video stream and splitting the input video stream into a pure video stream composed of a series of video frames arranged according to a time sequence and a pure audio stream composed of a series of audio frames arranged according to a time sequence;

an audio processing component: the system comprises a voice frequency decoder, a voice speed parameter adjusting module and a voice frequency processing module, wherein the voice frequency decoder is used for decoding an input audio frequency frame of A language to obtain audio frequency baseband data, translating the audio frequency baseband data and adjusting the voice speed parameter to obtain audio frequency baseband data of B language capable of controlling the voice speed of the output audio frequency, and encoding the audio frequency baseband data into an audio frequency frame;

audio and video buffer: the system is used for respectively caching the video frame and the audio frame, regularly checking the duration of the video frame and the audio frame and feeding back the duration to the audio processing;

a clock-driven wrapper: the audio/video buffer component is used for taking out the video frames and the audio frames with required time length from the audio/video buffer component under the driving of the internal clock, and packaging and outputting the video frames and the audio frames.

If Chinese is input, English is output, and if the input audio is 'Ju', the river continent. Deep "takes 2.2 seconds, then 2.2 seconds of audio baseband data is used as the input to the" audio translator ". It should be noted here that "Fair lady" is not mentioned here, but only the word "deep" is mentioned. If the duration of this "deep" word is 0.2 seconds, then the previous "make switch Ju, in river continent. "the duration of the voice is 2 seconds.

The audio translator will output signals including only the Ju switch on the continent of the river. "english audio baseband data: "" Guan, Guan! "go the fish hashs, on an islet in the river," this sentence is longer in english (longer in duration than chinese speech) assuming that its speech duration is 4 seconds.

Before any other Chinese speech is continuously input, the 0.2 second speech information of the 'deep' word is stored in an audio buffer and is not represented in the output.

Inputting Chinese audio baseband data of 2.2 seconds into the audio translator, actually only 2 seconds of Chinese is translated, and English audio baseband data of 4 seconds is obtained by output, 0.2 seconds of Chinese audio data is still in the audio translator, and the corresponding English audio is not output; this 0.2 second time period can be considered as the delay of the audio translator.

In addition to the audio translator delay, the audio translator has a parameter of the audio translator speech rate, which is adjusted to control the speech rate of the audio output by the audio translator.

Further, the audio and video buffer comprises an audio buffer and a video buffer; the audio buffer is used for buffering the audio frames output by the audio encoder, and regularly checking the duration of the audio frames and feeding the duration back to the audio translator; the video buffer is used for buffering the video frames split by the decapsulator component and feeding back the duration of the video frames to the audio translator at regular time (for example, every 0.2 s).

Calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer comprises the following steps: after receiving the real-time audio and video frame duration fed back by the audio and video buffer, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator; if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster, and if the calculation result is positive, decreasing the speech rate parameter value in the audio translator to make the output speech rate slower.

The clock driving wrapper component receives the trigger signal, and if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame to be output.

The clock drives the wrapper to acquire video frames of about 40ms and audio frames of 40ms each time, because the duration of each segment of audio frames in the audio buffer may not be necessary, if some audio frames are 43ms, some audio frames are 41ms, the duration of each acquired audio frame may be greater than 40ms or less than 40ms, and here, an average duration is 40 ms; and set the clock wrapper driver delay to 10 seconds.

Specifically, in the acquired real-time video stream, 10 seconds of video and audio are received in the first 10 seconds, because the clock-driven wrapper delay is 10 seconds, the clock-driven wrapper has no output in the first 10 seconds, 10 seconds of video frames are buffered in the video buffer, and the audio frame data in the audio buffer is assumed to be 12 seconds, meanwhile, assuming that the audio translator delay is 0.5 second, 9.5 seconds of 10 seconds of chinese audio are translated to obtain corresponding 12 seconds of english speech and output to the audio buffer, at this time, the video buffer feeds back to the audio translator that the length of the currently buffered video frame is 10 seconds, and the audio buffer feeds back to the audio translator that the length of the current audio is 12 seconds, then the audio translator receives the feedback and calculates the result of "the length of the video frame in the video buffer-the length of the audio frame in the audio buffer-the audio translator delay", i.e., 10s-12s-0.5 s-2.5 s, the result is negative, and the audio translator will adjust (increase) the speech rate parameter value to speed up the subsequent output english speech rate.

In the period from 10 seconds to 15 seconds, the video and audio of 5 seconds are received, and because the english speech rate output by the audio translator is accelerated in the foregoing, it is assumed that the audio translator only outputs 2 seconds of english speech, and the delay of the audio translator is still 0.5 seconds, at this time, the clock-driven wrapper has started to operate, and it has taken out the video frame of 5S and the audio frame data of 5S from the audio/video buffer to output, respectively, at this time, the length of the video frame in the video buffer is 10S +5S-5S ═ 10S, the length of the audio frame in the audio buffer is 12S +2S-5S ═ 9S, the audio/video buffer feeds back the video frame length currently buffered to the audio translator to 10S, the audio frame length is 9S, the audio translator receives the feedback, and then calculates the result of "the duration of the video frame in the video buffer-the duration of the audio frame in the audio buffer-the audio translator delay", i.e., 10s-9s-0.5s is 0.5s, and the result is positive, the audio translator will adjust (decrease) the speech rate parameter value to slow down the subsequent output english speech rate.

Between the 15 th and 30 th seconds thereafter, 15 seconds of video and audio data are received, and if none of these 15 seconds ever speak, the audio translator will not have any output if none. Then at 30 seconds the situation would be: the length of the video frame in the video buffer is 10 s; and the length of the audio frame in the audio buffer is 0 s; then at the next time instant the clock driven encapsulator will take the video frames from the video buffer and output them after encapsulation into muted audio frames.

As shown in fig. 2 to 7, another embodiment of the present invention relates to an implementation method of a simultaneous interpretation system of video streams, the implementation method comprising:

Further, the implementation method further comprises: after the clock driving wrapper component receives the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A simultaneous interpretation system for video streams, comprising: the audio and video packaging device comprises a decapsulator component, an audio processing component, an audio and video buffer component and a clock driving wrapper component;

2. The system of claim 1, wherein: the audio processing component comprises an audio decoder, an audio translator and an audio encoder; the audio decoder is used for decoding the input audio frame of the language A to obtain audio baseband data; the audio translator is used for translating the audio baseband data obtained by the audio decoder to obtain audio baseband data of B languages, calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer component and synthesizing the speech; and the audio encoder encodes the audio baseband data of the synthesized speech of the audio translator into audio frames.

3. The system of claim 2, wherein: the audio and video buffer component comprises an audio buffer and a video buffer; the audio buffer is used for buffering the audio frames output by the audio encoder, and regularly checking the duration of the audio frames and feeding the duration back to the audio translator; the video buffer is used for buffering the video frames split by the decapsulator component and regularly checking the duration of the video frames and feeding back the duration to the audio translator.

4. The system of claim 1, wherein: the calculating the speech speed according to the duration of the video and the audio fed back by the audio and video buffer component comprises the following steps: after receiving the real-time audio and video frame duration fed back by the audio and video buffer component, the audio translator judges whether the duration of the video frame in the video buffer subtracts the duration of the audio frame in the audio buffer and then subtracts the positive and negative of the calculation result of the translation delay duration of the audio translator; if the calculation result is negative, increasing the speech rate parameter value in the audio translator to make the output speech rate faster, and if the calculation result is positive, decreasing the speech rate parameter value in the audio translator to make the output speech rate slower.

5. The system of claim 1, wherein: the clock driving wrapper component is internally provided with a clock, the clock generates a trigger signal every a period of time, and after receiving the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.

6. A method for implementing a simultaneous interpretation system of video streams according to any of claims 1 to 5, characterized by: the implementation method comprises the following steps:

7. The method of claim 6, wherein the system further comprises: the calculating the language speed according to the read audio buffer and the audio and video duration fed back by the video buffer comprises the following steps:

8. The method of claim 6, wherein the system further comprises: the implementation method further comprises the following steps: after the clock driving wrapper component receives the trigger signal, if the interior of the audio buffer is empty, the clock driving wrapper component normally takes the video frame in the video buffer and generates a mute audio frame for output.