CN112188241A

CN112188241A - Method and system for real-time subtitle generation of live stream

Info

Publication number: CN112188241A
Application number: CN202011072549.3A
Authority: CN
Inventors: 唐杰; 王遥远; 李庆瑜; 戴立言
Original assignee: SHANGHAI WONDERTEK SOFTWARE CO Ltd
Current assignee: SHANGHAI WONDERTEK SOFTWARE CO Ltd
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2021-01-05

Abstract

The invention relates to the technical field of audio and video live broadcast, and provides a method and a system for generating subtitles in real time by live broadcast stream, wherein the method comprises the following steps: decoding a direct broadcasting information source to obtain a video frame and an audio frame; establishing communication between an AI speech translation engine and a transcoding engine; the AI speech translation engine acquires the audio frames from the transcoding engine, translates the audio frames in real time and outputs translation contents; and the transcoding engine acquires the translation content from the AI speech translation engine in real time, and encapsulates the translation content with the original video frame and the audio frame in any mode including burning the translation content into the video frame and filling and packaging the translation content into a caption frame to output the live broadcast stream. The audio stream is translated in real time to generate the caption during live broadcasting, real caption and audio and video synchronization is realized, the caption of multiple languages is output simultaneously, live broadcasting streams with the caption of different live broadcasting stream pushing protocols are met, and the player can freely select the caption of different languages for display.

Description

Method and system for real-time subtitle generation of live stream

Technical Field

The invention relates to the technical field of audio and video live broadcast, in particular to a method and a system for generating subtitles in real time by live broadcast streams.

Background

With the development of the times, the live broadcast industry has a qualitative leap, and the problems of high definition image quality, low time delay, sound and picture synchronization and the like are optimized to the utmost extent, but the requirements of users are not met.

In some scenes, such as various sporting events, large conference reports, online education and training, subtitles are generally seen only after being burnt into a video after being manually translated by post-production of live videos, but the good experience brought to users by the subtitles is lost in the process of live broadcasting.

Subtitles can help the audiences with weak hearing to understand the interface content, and because many words are homophonic, the program content can be clearer only by watching the programs through the combination of subtitle characters and audio. In addition, subtitles can also be used to translate foreign language programs, so that viewers who do not understand the foreign language can understand the program contents while hearing the original sound.

The function of live broadcasting real-time generation of the subtitles is not mature, and especially, the problem of synchronization of sound subtitle pictures during live broadcasting is that the subtitles and the sound pictures are not synchronous, and the experience for users is poor due to time advance and time delay.

In the existing stream pushing protocol, for example, rtmp does not support plug-in subtitles, and subtitles are burned into a video with subtitles. Some users need to freely select subtitles of different languages when playing, and the subtitles need to exist in a plug-in mode.

Also, the users may come from different countries, and multiple languages need to be generated simultaneously to meet the user requirements of different countries.

Disclosure of Invention

In view of the foregoing problems, an object of the present invention is to provide a method and system for generating subtitles in real time in a live stream. Aiming at the defects of the existing live broadcast real-time subtitle generation technology, the audio stream is translated in real time to generate subtitles during live broadcast, real subtitles and audio and video synchronization are realized, multi-language subtitles are output simultaneously, live broadcast streams with subtitles of different live broadcast stream push protocols are met, and the player can freely select subtitles in different languages for display.

The above object of the present invention is achieved by the following technical solutions:

a method for generating subtitles in real time by using a live stream comprises the following steps:

s1: acquiring a live broadcast information source, starting a decapsulation decoding thread through a transcoding engine, and decoding the live broadcast information source to obtain a video frame and an audio frame;

s2: establishing an AI voice translation engine and establishing communication between the AI voice translation engine and the transcoding engine;

s3: the AI speech translation engine acquires the audio frame from the transcoding engine, translates the audio frame in real time and outputs translation content;

s4: and the transcoding engine acquires the translated content from the AI speech translation engine in real time, and encapsulates the translated content with the original video frame and the original audio frame together to output a live stream in any one mode of burning the translated content into the video frame and filling and packaging the translated content into a subtitle frame.

Further, establishing communication between the AI speech translation engine and the transcoding engine through a websocket specifically includes:

establishing a websocket server A and a websocket client B on the transcoding engine;

establishing a websocket client C and a websocket server D on the AI speech translation engine;

the websocket client C initiates an authentication request to the websocket server A, connection is established successfully through authentication, and the AI speech translation engine acquires the audio frame from the transcoding engine in real time through websocket communication;

and the websocket client B initiates an authentication request to the websocket server D, connection is successfully established through authentication, and the transcoding engine acquires the translation content from the AI speech translation engine in real time through websocket communication.

Further, an authentication request initiated by a websocket client, including the websocket client B and the websocket client C, to a websocket server including the websocket server a and the websocket server D, specifically includes the following steps:

the websocket client side is preset with an agreed key, and the agreed key is encrypted through an MD5 algorithm to obtain a first MD5 encryption key;

the websocket client appends the first MD5 encryption key to the URL request in the form of a parameter;

after receiving the request of the websocket client, the websocket server analyzes the URL without the parameters and the first MD5 encryption key;

the websocket server side encrypts the agreed key again through the MD5 algorithm to obtain a second MD5 encryption key;

and the websocket server compares the first MD5 encryption key with the second MD5 encryption key, if the first MD5 encryption key and the second MD5 encryption key are equal, the authentication is successful, and otherwise, the authentication fails.

Further, in step S4, when the live stream is output by being encapsulated together with the original video frame and the original audio frame by burning the translated content into the video frame, the method is suitable for a streaming media server that does not support pushing the plug-in subtitle stream.

Further, in step S4, when the live stream is output by being encapsulated together with the original video frame and the original audio frame by using the manner of padding and packaging the translation content into the subtitle frame, the method is suitable for satisfying the requirement that the streaming media server displaying the subtitle can be freely selected by using the form of pushing the plug-in subtitle stream.

Further, the translated content is burned into the video frame or separately encapsulated into the caption frame by using one or more translation languages.

Further, in step S1, the method further includes:

and the transcoding engine performs timestamp correction on the video frame and the audio frame obtained by decoding, and ensures that timestamps are aligned and continuously increased.

Further, in step S3, the method further includes:

and the AI intelligent speech engine multiplexes the translated content obtained by translation with the timestamp carried by the audio frame, so that after the translated content and the original video frame and the audio frame are encapsulated, the synchronization of three parties of audio, picture and subtitle is ensured.

The invention also provides a system corresponding to the method for executing the live stream to generate the subtitles in real time, which comprises the following steps:

the live stream decoding module is used for acquiring a live information source, starting a decapsulation decoding thread through a transcoding engine, and decoding the live information source to obtain a video frame and an audio frame;

the communication establishing module is used for establishing an AI voice translation engine and establishing communication between the AI voice translation engine and the transcoding engine;

the translation module is used for providing the AI speech translation engine with the audio frame acquired from the transcoding engine, translating the audio frame in real time and outputting translation contents;

and the packaging module is used for providing the transcoding engine with the translated content to the AI speech translation engine in real time, and packaging the translated content together with the original video frame and the audio frame in any one mode including burning the translated content into the video frame and filling and packaging the translated content into a subtitle frame to output a live broadcast stream.

An electronic device comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the above method for real-time subtitle generation for a live stream.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the above-mentioned method for generating subtitles in real time from a live stream.

Compared with the prior art, the invention has at least one of the following beneficial effects:

(1) the method for generating the subtitles in real time by the live stream comprises the following steps: s1: acquiring a live broadcast information source, starting a decapsulation decoding thread through a transcoding engine, and decoding the live broadcast information source to obtain a video frame and an audio frame; s2: establishing an AI voice translation engine and establishing communication between the AI voice translation engine and the transcoding engine; s3: the AI speech translation engine acquires the audio frame from the transcoding engine, translates the audio frame in real time and outputs translation content; s4: and the transcoding engine acquires the translated content from the AI speech translation engine in real time, and encapsulates the translated content with the original video frame and the original audio frame together to output a live stream in any one mode of burning the translated content into the video frame and filling and packaging the translated content into a subtitle frame. According to the technical scheme, the subtitles are generated in real time through one-time transcoding task, and three-way synchronization of audio and video subtitles is achieved.

(2) Burning the video frames or separately encapsulating the video frames into the subtitle frames by using the translated content in one or more translation languages. According to the technical scheme, the subtitles of multiple languages are output simultaneously through one transcoding task, and the selective display of the subtitles by the player can be met.

(3) And outputting a live broadcast stream by jointly packaging the original video frame and the audio frame by using any one of the ways including burning the translation content into the video frame and filling and packaging the translation content into a caption frame. The technical scheme can be compatible with any streaming media server which does not support or supports the push of the plug-in subtitle stream.

Drawings

Fig. 1 is an overall flowchart of a method for generating subtitles in real time for a live stream according to the present invention;

FIG. 2 is a schematic diagram of a transcoding engine and an AI speech translation engine establishing websocket communication according to the present invention;

FIG. 3 is a flowchart illustrating a websocket client sending an authentication request to a websocket server according to the present invention;

FIG. 4 is a flow chart of a live broadcast stream of a subtitle burned into a video finally obtained by the live broadcast information source according to the method of the present invention;

FIG. 5 is a flowchart of a live broadcast stream with a subtitle plug-in obtained by the live broadcast source according to the present invention;

fig. 6 is an overall structural diagram of a method for generating subtitles in real time in a live stream according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

First embodiment

As shown in fig. 1, the present embodiment provides a method for generating subtitles in real time in a live stream, including the following steps:

s1: and acquiring a live broadcast information source, starting a decapsulation decoding thread through a transcoding engine, and decoding the live broadcast information source to obtain a video frame and an audio frame.

Specifically, after a live broadcast information source is acquired in real time, the live broadcast information source immediately enters a transcoding engine, the live broadcast information source is unpacked through a unpacking and decoding thread started by the transcoding engine, and the live broadcast information source is decoded after being packaged to obtain a video frame and an audio frame.

Further, after the video frame and the audio frame are obtained by decoding, in order to ensure that the timestamps of the video frame and the audio frame are kept consistent, and prevent the occurrence of the situation that the user experience is affected due to asymmetry of video and audio when the video is played subsequently, the timestamps of the video frame and the audio frame obtained by decoding need to be corrected after decoding, the alignment of the timestamps of the video frame and the audio frame is kept, and the timestamp between each frame is continuously increased.

S2: and establishing an AI voice translation engine and establishing communication between the AI voice translation engine and the transcoding engine.

Specifically, in order to avoid interfering with the transcoding process and occupy the transcoding resources, an AI speech translation engine for translating speech is separately established in the invention, and in order to realize the translation of an audio stream, communication between the AI speech translation engine and the transcoding engine needs to be established first.

In this embodiment, the communication between the AI speech translation engine and the transcoding engine is established through a websocket, as shown in fig. 2, which specifically includes the following steps:

s211: and establishing a websocket server A and a websocket client B on the transcoding engine.

S212: and establishing a websocket client C and a websocket server D on the AI speech translation engine.

S213: the websocket client C sends an authentication request to the websocket server A, connection is established successfully through authentication, and the AI speech translation engine acquires the audio frame from the transcoding engine in real time through websocket communication.

S214: and the websocket client B initiates an authentication request to the websocket server D, connection is successfully established through authentication, and the transcoding engine acquires the translation content from the AI speech translation engine in real time through websocket communication.

Further, after establishing communication between the transcoding engine and the AI speech translation engine, before transmitting data, an authentication request initiated from a websocket client including the websocket client B and the websocket client C to a websocket server including the websocket server a and the websocket server D is also required, as shown in fig. 3, specifically including the following steps:

s221: the websocket client side is preset with a convention key, and the convention key is encrypted through an MD5 algorithm to obtain a first MD5 encryption key.

S222: the websocket client appends the first MD5 encryption key to the URL request in the form of a parameter.

S223: after receiving the request of the websocket client, the websocket server analyzes the URL without the parameters and the first MD5 encryption key;

s224: the websocket server side encrypts the agreed key again through the MD5 algorithm to obtain a second MD5 encryption key;

s225: and the websocket server compares the first MD5 encryption key with the second MD5 encryption key, if the first MD5 encryption key and the second MD5 encryption key are equal, the authentication is successful, and otherwise, the authentication fails.

S3: and the AI speech translation engine acquires the audio frame from the transcoding engine, translates the audio frame in real time and outputs translation content.

Specifically, in the translation, the translated content can be translated into one or more languages, and the video frames are burned or separately packaged into the caption frames. When the live broadcasting is carried out in the subsequent playing, any one or more subtitles can be selected to be synchronously displayed, so that the requirements of different people are met.

Furthermore, the AI intelligent speech engine multiplexes the translated content obtained by translation with the timestamp carried by the audio frame, so as to realize the three-party synchronization of the translated content, the original video frame and the original audio frame after the translated content is encapsulated, and ensure the three-party synchronization of the audio, the drawing and the caption.

Second embodiment

As shown in fig. 4, when the translated content is burned into the video frame and is packaged together with the original video frame and the audio frame to output a live stream, the method is suitable for a streaming media server that does not support pushing a plug-in subtitle stream, and the overall steps are as follows:

(1) the live broadcast information source obtains video frames and audio frames through decapsulation and decoding of the transcoding engine, and carries out timestamp correction on the video frames and the audio frames, so that timestamps are aligned and continuously increased.

(2) The transcoding engine sends the audio frame into the AI speech translation engine to translate and output the translated content in real time, the translated content obtained by translating the audio frame multiplexes the timestamp carried by the audio frame, and the translated content is aligned with the timestamp of the video frame when being burnt into the video frame, so that the three-party synchronization of audio, pictures and subtitles is ensured.

(3) The translated content is overlaid into the video frame through a transcoding engine burn-in module, and one or more translation languages can be selected from the burn-in module and burned into the video frame.

(4) And the transcoding engine encodes and encapsulates the video frames and the audio frames with the overlapped subtitles and outputs the encoded and encapsulated video frames and the audio frames as a live stream with the subtitles burnt into the video.

Third embodiment

As shown in fig. 5, when the live stream is output by encapsulating the translated content into the subtitle frame together with the original video frame and the audio frame in a manner of padding and packaging the translated content into the subtitle frame, the method is suitable for satisfying the requirement of freely selecting the streaming media server for displaying the subtitles in a form of pushing the plug-in subtitle stream, and the overall steps are as follows:

(2) The transcoding engine sends the audio frame into the AI speech translation engine to translate and output the translated content in real time, the translated content obtained by translating the audio frame multiplexes the timestamp carried by the audio frame, and the subtitle frame encapsulated by the translated content is aligned with the timestamp carried by the audio frame and the video frame of the audio frame, so that the three-party synchronization of audio, picture and subtitle is ensured.

(3) The translated content is filled and packed into caption frames through a transcoding engine, and the translated content can be selected from one or more translation languages to be packed into one or more language caption frames.

(4) And the transcoding engine encodes and encapsulates the video frames, the audio frames and the subtitle frames and outputs the encoded and encapsulated video frames, audio frames and subtitle frames as a live stream with a subtitle plug-in function.

Fourth embodiment

As shown in fig. 6, this embodiment provides a system corresponding to the method for generating subtitles in real time in a live stream in the first embodiment, and includes:

the live stream decoding module 1 is used for acquiring a live information source, starting a decapsulation decoding thread through a transcoding engine, and decoding the live information source to obtain a video frame and an audio frame;

the communication establishing module 2 is used for establishing an AI voice translation engine and establishing communication between the AI voice translation engine and the transcoding engine;

the translation module 3 is used for providing the AI speech translation engine with the audio frame acquired from the transcoding engine, translating the audio frame in real time and outputting a translation content;

and the packaging module 4 is used for providing the transcoding engine with the translated content to the AI speech translation engine in real time, and packaging the translated content together with the original video frame and the original audio frame to output a live stream by using any one of the modes including burning the translated content into the video frame and filling and packaging the translated content into a subtitle frame.

It should be noted that other specific execution steps are the same as those in the first embodiment, and are not described in detail in this embodiment.

The present invention also provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a method for live streaming real-time subtitle generation as in the first embodiment.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for generating subtitles in real time by using a live stream is characterized by comprising the following steps:

2. The method for generating subtitles in real time in a live stream according to claim 1, further comprising:

establishing communication between the AI voice translation engine and the transcoding engine through websocket, specifically:

3. The method for generating subtitles in real time by using a live stream according to claim 2, wherein an authentication request initiated by a websocket client including the websocket client B and the websocket client C to a websocket server including the websocket server a and the websocket server D specifically comprises the following steps:

4. The method for generating subtitles in real time from a live stream as claimed in claim 1, wherein in step S4, the method is suitable for a streaming media server that does not support pushing plug-in subtitle streams when outputting live streams by encapsulating the original video frames and audio frames together with the translated content burned into the video frames.

5. The method for generating subtitles in real time according to claim 1, wherein in step S4, when the live stream is output by being encapsulated together with the original video frames and audio frames by using the method of stuffing and packaging the translated content into subtitle frames, a streaming media server capable of displaying subtitles in a form of pushing plug-in subtitle streams is freely selected.

6. The method for generating subtitles in real time in a live stream according to claim 1, further comprising:

and burning the translated content into the video frame or independently encapsulating the video frame into the subtitle frame by using one or more translation languages.

7. The method for generating subtitles in real time in a live stream according to claim 1, wherein in step S1, the method further comprises:

8. The method for generating subtitles in real time in a live stream according to claim 7, wherein in step S3, the method further comprises:

9. A corresponding system for implementing the method for generating subtitles in real time for live streams according to claims 1 to 8, comprising:

10. A computer-readable storage medium, having a computer program stored thereon, which, when being executed by a processor, implements a method for live stream real-time generation of subtitles according to any one of claims 1 to 8.