CN112616062A

CN112616062A - Subtitle display method and device, electronic equipment and storage medium

Info

Publication number: CN112616062A
Application number: CN202011460160.6A
Authority: CN
Inventors: 李秋平; 刘坚; 李磊; 王明轩
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-04-06
Anticipated expiration: 2040-12-11
Also published as: CN112616062B

Abstract

The embodiment of the disclosure discloses a subtitle display method, a subtitle display device, subtitle display equipment and a storage medium. The method comprises the following steps: collecting live broadcast audio and video data streams in real time, and caching the live broadcast audio and video data streams based on preset delay time; determining subtitle data corresponding to the live audio and video data stream within the preset delay time; superposing the subtitle data to the live audio and video data stream; and when the preset delay time length is over, playing the live broadcast audio-video data stream with the subtitle data. The subtitle display method provided by the embodiment of the disclosure can solve the technical problems that the subtitle is displayed in a typewriter mode in the prior art, the subtitle has large jitter in a live broadcast picture due to instable subtitle, the vision of a user is difficult to focus, and the visual fatigue is easy, can effectively ensure the accuracy of subtitle determination and the stability of subtitle display in a live broadcast audio-video data stream, and greatly improves the user experience.

Description

Subtitle display method and device, electronic equipment and storage medium

Technical Field

The disclosed embodiments relate to the field of computer technologies, and in particular, to a subtitle display method and apparatus, an electronic device, and a storage medium.

Background

Currently, simultaneous interpretation is widely used in various fields such as conferences, media activities, broadcast lectures, and the like. Especially, in many cross-language live broadcasts, simultaneous interpretation subtitles are often provided, and the language of a speaker is converted into the language of audiences through a voice recognition technology and a machine translation technology and displayed in a live broadcast picture in real time, so that the problem that the audiences cannot understand the live broadcast content without understanding the foreign language is solved.

In the related art, simultaneous interpretation subtitles are displayed in a live broadcast picture mainly based on a typewriter, namely, the simultaneous interpretation subtitles are displayed in the live broadcast picture while performing voice recognition and machine translation on collected voice information along with the speaking speed of a speaker. Because the sentence-break mode and sentence structure of the caption obtained by speech recognition are not fixed, the caption needs to be adjusted continuously according to the caption content, so that the caption translated by a machine can be adjusted adaptively, and the caption is displayed without being stabilized by a typewriter, so that the simultaneous interpretation caption has great jitter in a live broadcast picture. The continuous jitter of simultaneous interpretation of subtitles makes the viewers easily feel tired in vision when watching the subtitles, the viewers hardly focus the vision, and the context easily interferes with understanding. On the other hand, the dwell time of each sentence of subtitle content is short, and the viewer is likely to jump to the next sentence of subtitle without reading the current subtitle, resulting in poor actual reading experience of the viewer.

Disclosure of Invention

The embodiment of the disclosure provides a subtitle display method, a subtitle display device, an electronic device and a storage medium, which can effectively ensure the stability of subtitle display in a live audio and video data stream and greatly improve user experience.

In a first aspect, an embodiment of the present disclosure provides a subtitle display method, including:

collecting live broadcast audio and video data streams in real time, and caching the live broadcast audio and video data streams based on preset delay time;

determining subtitle data corresponding to the live audio and video data stream within the preset delay time;

superposing the subtitle data to the live audio and video data stream;

and when the preset delay time length is over, playing the live broadcast audio-video data stream with the subtitle data.

In a second aspect, an embodiment of the present disclosure further provides a subtitle display apparatus, including:

the audio and video data caching module is used for acquiring a live broadcast audio and video data stream in real time and caching the live broadcast audio and video data stream based on preset delay time;

the caption data determining module is used for determining the caption data corresponding to the live audio and video data stream within the preset delay time;

the caption data superposition module is used for superposing the caption data on the live audio and video data stream;

and the audio and video data playing module is used for playing the live audio and video data stream with the subtitle data when the preset delay time duration is over.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, where the electronic device includes:

one or more processing devices;

storage means for storing one or more programs;

when the one or more programs are executed by the one or more processing devices, the one or more processing devices are caused to implement the subtitle display method according to the embodiment of the present disclosure.

In a fourth aspect, the disclosed embodiments also provide a computer readable medium, on which a computer program is stored, which when executed by a processing device, implements a subtitle display method according to an embodiment of the present disclosure.

According to the embodiment of the disclosure, live audio and video data streams are collected in real time, and the live audio and video data streams are cached based on preset delay time; determining subtitle data corresponding to the live audio and video data stream within the preset delay time; superposing the subtitle data to the live audio and video data stream; and when the preset delay time length is over, playing the live broadcast audio-video data stream with the subtitle data. According to the subtitle display method provided by the embodiment of the disclosure, the live audio and video data stream collected in real time is cached based on the preset delay time, the subtitle data corresponding to the live audio and video data stream is accurately determined within the preset delay time, and the live audio and video data stream with the subtitle data is played when the delay time is over, so that the technical problems that in the prior art, the subtitle is displayed in a typewriter mode, the subtitle has large jitter in a live broadcast picture due to instability, the user vision is difficult to focus, and the visual fatigue is easy to occur are solved, the accuracy of subtitle determination and the stability of displaying the subtitle in the live audio and video data stream can be effectively ensured, and the user experience is greatly improved.

Drawings

Fig. 1 is a flowchart of a subtitle display method in an embodiment of the present disclosure;

fig. 2 is a flowchart of a subtitle display method in another embodiment of the present disclosure;

fig. 3 is a flowchart of a subtitle display method in another embodiment of the present disclosure;

fig. 4 is a flowchart of a subtitle display method in another embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a subtitle display apparatus according to another embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device in another embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart of a subtitle display method according to an embodiment of the present disclosure, where the method may be applied to a case where subtitles in a live audio and video data stream are displayed, and the method may be executed by a subtitle display apparatus, where the apparatus may be composed of hardware and/or software, and may be generally integrated in a device with a subtitle display function, where the device may be an electronic device such as a server, a mobile terminal, or a server cluster. As shown in fig. 1, the method specifically includes the following steps:

and step 110, acquiring a live broadcast audio and video data stream in real time, and caching the live broadcast audio and video data stream based on preset delay time.

In the embodiment of the present disclosure, live audio and video data streams are collected in real time, where the live audio and video data streams may include audio and video data streams collected in various live scenes such as conferences, media activities, broadcast lectures, and lectures. Illustratively, the live audio and video data stream collected in real time is an audio and video data stream collected during a live speech, and the live audio and video data stream not only contains audio data sent by a speaker during the speech, but also contains video data of the speaker during the speech. The live audio and video data streams are collected in real time, and the live audio and video data streams collected in real time are cached based on the preset delay time, so that the live audio and video data streams collected in real time are cached for the preset delay time and then are played instead of being played while being collected.

The preset delay time may be preset according to an actual requirement, and may be 1min, for example. Of course, the user can adjust the preset delay time at any time according to specific conditions, and the expected effect is achieved. For example, the preset delay time duration may be set according to a live scene of a live audio-video data stream, where the live scenes of the live audio-video data streams are different, the corresponding preset delay time durations may be different, the higher the real-time requirement on live broadcasting is, the shorter the corresponding preset delay time duration is, and otherwise, the lower the real-time requirement on live broadcasting is, the longer the corresponding preset delay time duration is. For example, if the live audio and video data stream is audio and video data acquired by live broadcasting a large football game, the preset delay time may be set to 10s, and if the live audio and video data stream is audio and video data acquired by live broadcasting an academic conference, the preset delay time may be set to 5 min.

And 120, determining subtitle data corresponding to the live audio and video data stream within the preset delay time.

In the embodiment of the disclosure, within a preset extension time, determining subtitle data corresponding to a live audio and video data stream, where the subtitle data corresponding to the live audio and video data stream may be a recognition result of performing voice recognition on voice information included in the live audio and video data stream (the recognition result is a recognition text in the same language as the voice data in the live audio and video data stream), may also be a translation text in a target language obtained by translating the recognition text, and may also be data composed of the recognition text and the translation text together. Optionally, the subtitle data corresponding to the live audio and video data stream may also be a text modified from an identification text of voice information included in the live audio and video data stream, and/or a text modified from a translation text that translates the identification text.

Optionally, determining subtitle data corresponding to the live audio and video data stream within the preset delay time includes: extracting an audio data stream from the live audio and video data stream within the preset delay time; and determining subtitle data corresponding to the live audio and video data stream based on the audio and video data stream. Specifically, the audio data stream may be directly extracted from the entire audio data stream, or may be extracted from the audio data stream and the video data stream in a multi-thread manner. Illustratively, an audio data stream is extracted from a live audio and video data stream, the audio data stream is identified based on an Automatic Speech Recognition (ASR) technology, and the identification result is used as subtitle data corresponding to the live audio and video data stream. As another example, an audio data stream is extracted from a live audio-video data stream, the audio data stream is input to the simultaneous interpretation model, and subtitle data corresponding to the live audio-video data stream is determined based on an input result of the simultaneous interpretation model. The simultaneous interpretation model can comprise a voice recognition model and a translation model, voice in the audio data is recognized through the voice recognition model, a corresponding recognition text is obtained, the translation model translates the recognition text to obtain a translation text, and the recognition text and the translation text are used as subtitle data corresponding to the live audio and video data stream.

Optionally, determining, based on the audio data stream, subtitle data corresponding to the live audio and video data stream includes: performing voice recognition on the audio data stream to generate subtitle data belonging to a first language; translating the subtitle data belonging to the first language into subtitle data belonging to a second language; and determining the subtitle data belonging to the first language and the subtitle data belonging to the second language as the subtitle data corresponding to the live audio-video data stream. Illustratively, based on ASR technology, the audio data stream is speech-recognized to generate subtitle data belonging to a first language, where the first language is a language to which speech information in the audio data belongs. The method comprises the steps of translating subtitle data belonging to a first language into subtitle data of a second language based on a machine translation technology, wherein the second language is different from the first language, and the second language can be one language or multiple languages. For example, the first language and the second language may be chinese, english, french, german, korean, and the like. And jointly determining the subtitle data belonging to the first language and the subtitle data belonging to the second language as subtitle data corresponding to the live broadcast audio-video data stream, namely the simultaneous interpretation subtitle data corresponding to the live broadcast audio-video data stream.

Optionally, determining the subtitle data belonging to the first language and the subtitle data belonging to the second language as subtitle data corresponding to the live audio-video data stream, including: correcting the subtitle data belonging to the first language and the subtitle data belonging to the second language according to the received correction information; and determining the modified subtitle data belonging to the first language and the modified subtitle data belonging to the second language as the subtitle data corresponding to the live broadcast audio-video data stream. The advantage of this arrangement is that the simultaneous interpretation subtitles corresponding to the live audiovisual data stream can be accurately obtained.

For example, when the pronunciation of the speaker is not standard or the speech data contains professional term vocabulary, it may cause inaccuracy of the caption data belonging to the first language obtained by speech recognition of the audio data stream, and thus cause inaccuracy of the caption data belonging to the second language generated by translation. Therefore, the subtitle data belonging to the first language and/or the subtitle data belonging to the second language may be modified based on the modification information input by the interpreter. For example, the subtitle data belonging to the first language may be modified, and then the modified subtitle data belonging to the first language may be translated into subtitle data belonging to the second language based on machine translation techniques, and when the translator considers that the subtitle data belonging to the second language is still inaccurate, the subtitle data belonging to the second language may be modified based on the received modification information for the subtitle data belonging to the second language. For another example, the subtitle data belonging to the second language may be modified according to the received modification information for the subtitle data belonging to the second language, and the subtitle data belonging to the first language may be modified based on the modified subtitle data belonging to the second language.

And step 130, overlaying the subtitle data on the live audio and video data stream.

In the embodiment of the present disclosure, the subtitle data is superimposed on the corresponding live audio and video data stream, so that the subtitle data corresponds to or is synchronized with the voice information in the live audio and video data stream. Specifically, the subtitle data can be superimposed on the corresponding live audio and video data stream, so that when the live audio and video data stream is played in the audio and video live broadcast picture, the subtitle data is displayed in a preset position area of the audio and video live broadcast picture.

Optionally, before superimposing the subtitle data on the live audio-video data stream, the method further includes: acquiring a time stamp of subtitle data corresponding to the live audio and video data stream; and superposing the subtitle data to the live audio and video data stream, wherein the method comprises the following steps: and superposing the subtitle data to the live audio and video data stream according to the timestamp so as to synchronize the subtitle data with the live audio and video data stream. Specifically, the timestamp of the subtitle data is the corresponding start and end time of the subtitle data in the live audio-video data stream. The time stamp of the caption data can be acquired in the process of identifying the audio data stream in the live audio and video data stream to determine the corresponding caption data. And specifically, the subtitle data is added into the corresponding live audio and video data stream according to the timestamp of the subtitle data and the timestamp of the live audio and video data stream, so that the display time of the subtitle data is aligned with or synchronous with the playing time of the live audio and video data stream.

Illustratively, when the subtitle data corresponding to the live audio-video data stream comprises subtitle data belonging to a first language and subtitle data belonging to a second language, respectively acquiring a first timestamp of the subtitle data belonging to the first language and a second timestamp of the subtitle data belonging to the second language, and synchronizing the subtitle data belonging to the first language and the subtitle data belonging to the second language to the corresponding live audio-video data stream according to the corresponding relation of the first timestamp, the second timestamp and the timestamp of the audio-video data stream.

And step 140, when the preset delay time length is over, playing the live broadcast audio-video data stream with the subtitle data.

In the embodiment of the present disclosure, when the preset delay time duration is over, the live audio and video data stream with the subtitle data is played. It can be understood that, when watching the live audio and video data stream played by the audio and video playing interface, the user can not only see the video picture in the live audio and video data stream, hear the voice information in the live audio and video data stream, but also see the subtitle data synchronously displayed with the voice information in the live audio and video data stream.

Illustratively, when the preset delay time duration is over, the audio-video data stream with the subtitle data is played according to the playing speed or the acquisition frequency of the live audio-video data stream, and it can be understood that the preset delay time duration is played in a delayed manner in the form of the live audio-video data stream with the subtitle data as a whole. Specifically, live audio and video data with subtitle data are played at a time of 8:00-9:00 when live audio and video data collected in real time are live audio and video data in the time of 8:00-9:00 and the preset delay time is 5 min. Illustratively, for convenience of understanding, a live audio and video data stream can be split into 10 pieces of live audio and video data, each piece of live audio and video data corresponds to one piece of subtitle data, if the real-time acquisition time of the first piece of live audio and video data is 8:00, the playing time is 1min, the real-time acquisition time of the second piece of live audio and video data is 8:01, and the playing time is 2min, in the technical scheme provided by the embodiment of the disclosure, if the delay time is 5min, the first piece of live audio and video data with the subtitle data is played at 8:05, the playing time is 1min, the second piece of live audio and video data with the subtitle data is played at 8:06, and the playing time is 2min, and so on.

Specifically, live audio and video data streams collected in real time are cached by preset delay time, subtitle data corresponding to the live audio and video data streams are accurately determined within the preset delay time, and after the preset delay time is finished, the live audio and video data streams with the stable and accurate subtitle data superposed are played, so that the technical problems that in the prior art, subtitles are displayed in a typewriter mode, the subtitles are unstable, the jitter of the subtitles in live pictures is large, the vision of a user is difficult to focus, the visual fatigue is easy, the retention time of subtitle content is short, and the actual reading experience of audiences is poor are effectively solved.

According to the embodiment of the disclosure, live audio and video data streams are collected in real time, and the live audio and video data streams are cached based on preset delay time; determining subtitle data corresponding to the live audio and video data stream within the preset delay time; superposing the subtitle data to the live audio and video data stream; and when the preset delay time length is over, playing the live broadcast audio-video data stream with the subtitle data. According to the subtitle display method provided by the embodiment of the disclosure, the live broadcast audio and video data stream collected in real time is cached based on the preset delay time, the subtitle data corresponding to the live broadcast audio and video data stream is accurately determined within the preset delay time, and the live broadcast audio and video data stream with the subtitle data is played when the delay time is over, so that the accuracy of subtitle determination and the stability of displaying subtitles in the live broadcast audio and video data stream can be effectively ensured, and the user experience is greatly improved.

In some embodiments, superimposing the subtitle data onto the live audio video data stream comprises: and superposing the subtitle data belonging to the second language and the subtitle data belonging to the first language on the live broadcast audio-video data stream according to the mode that the subtitle data belonging to the second language and the subtitle data belonging to the first language are in the up-down corresponding relation. The advantage of setting up like this lies in, can be with the caption data (translation) that spectator's language corresponds last, the caption data that the speaker language corresponds is down to be shown in the mode of corresponding relation, it is important outstanding, simple and clear, can improve user's reading experience greatly.

Illustratively, the first language is a language to which voice information of the live audio-video data stream belongs, that is, the subtitle data of the first language is text information corresponding to a language of a speaker in the live audio-video data stream, and the subtitle data of the second language is text information corresponding to a language used by a user when the user watches the live audio-video data stream which is simultaneously translated. Therefore, the subtitle data belonging to the second language and the subtitle data belonging to the first language can be superposed on the live broadcast audio-video data stream in a manner of corresponding up and down, and it can be understood that when the live broadcast audio-video data stream with the bilingual subtitle data is played on an audio-video live broadcast picture, the subtitle data belonging to the second language is presented at a position above the subtitle data belonging to the first language, and the subtitle data belonging to the second language corresponds to the subtitle data belonging to the first language one by one.

In some embodiments, before performing speech recognition on the audio data stream to generate subtitle data belonging to the first language, the method further includes: determining a target language of voice recognition according to a language switching instruction input by a user; performing voice recognition on the audio data stream to generate subtitle data belonging to a first language, including: performing voice recognition on the audio data stream based on the target language to generate subtitle data belonging to a first language; wherein the first language is the same as the target language. The method has the advantages that the languages can be switched and identified by one key according to the language change of a speaker or a speaker in the live broadcast, the speed and the accuracy of subtitle data determination are improved, the synchronization of sound and pictures in the live broadcast can be further ensured, and a user can feel real-time and smooth subtitles in the whole process.

Illustratively, in a live broadcast scene in which a chinese person a visits an english person B, during a speech of the chinese person a, an audio data stream contained in the live broadcast audio-video data stream collected in real time is a speech of a chinese language, and during a speech of the english person B, an audio data stream contained in the live broadcast audio-video data stream collected in real time is a speech of an english language. Therefore, the speech recognition languages can be switched at the speech switching time of the Chinese A and the English B. Specifically, a language switching instruction input by a user is received, and the switched language is used as a target language of voice recognition, that is, when voice recognition is performed on an audio data stream in a live audio-video data stream, voice recognition is performed based on the target language. Illustratively, when the Chinese speech is switched to the English speech by the Chinese A, according to a language switching instruction input by a user, namely a switching instruction for switching the Chinese language to the English language, English is taken as a target language to be subjected to speech recognition at present, so that the situation that the speech recognition result is wrong when speech recognition is continuously performed on the content of the English speech based on the Chinese language can be effectively avoided.

In some embodiments, before overlaying the subtitle data onto the live audiovisual data stream, further comprising: segmenting the subtitle data according to a preset mode to generate at least one piece of subtitle data; determining the starting playing time and the ending playing time of each piece of sub subtitle data; superimposing the subtitle data onto the live audio video data stream comprises: and based on the starting playing time and the ending playing time, superposing each piece of sub-subtitle data to the corresponding live broadcast audio-video data stream. The method has the advantages that the subtitle data can be displayed in the audio and video live broadcast picture according to sentences, the length is moderate, the display mode of 'movie and courtyard level' subtitles is realized, and the experience of watching live audio and video by a user can be greatly improved.

Specifically, within a preset delay time, determined subtitle data corresponding to the live audio and video data stream may be very long, and when the live audio and video data stream is played, the subtitle data corresponding to the live audio and video data stream is displayed in a segment manner in an audio and video live broadcast picture, so that long subtitle data easily interferes with a user, and actual reading experience of the user is poor. Therefore, in the embodiment of the present disclosure, in the process of determining the subtitle data corresponding to the live audio and video data stream within the preset delay time, the subtitle data is segmented according to the preset manner, and at least one piece of sub subtitle data is generated. Optionally, the segmenting the subtitle data according to a preset mode includes: in the process of determining the subtitle data corresponding to the live audio and video data stream, segmenting the subtitle data based on a Voice Activity Detection (VAD) mode; and/or segmenting the subtitle data based on a knowledge graph; or segmenting the subtitle data based on the preset character number, so that each piece of subtitle data contains characters with the preset character number. Specifically, the live audio and video data stream is segmented based on the VAD method, audio data in the live audio and video data stream is segmented according to a speaking time interval, that is, the position where a sentence should be broken is judged through voice recognition according to a waveform diagram of audio in the audio data stream, so that subtitle data corresponding to the whole live audio and video data stream is segmented into at least one piece of subtitle data. The method for segmenting the subtitle data based on the knowledge graph is to segment proper nouns into the same piece of subtitle data as much as possible while ensuring the moderate length of the subtitle data generated by segmentation. The character subtitle data is segmented based on the preset character number, each set of segmented sub-subtitle data comprises characters with the preset character number, and the length of the segmented sub-subtitle data is fixed and moderate. In the embodiment of the disclosure, subtitle data can be segmented based on a VAD mode and a knowledge graph mode at the same time, specifically, subtitle data can be segmented based on the VAD mode, and then segmentation points of subtitle data are adjusted based on the knowledge graph mode, so that the generated subtitle data is moderate in length and reasonable in segmentation, and accords with the speaking mode of a user.

In the embodiment of the present disclosure, the start playing time and the end playing time of each piece of sub-subtitle data are determined according to the time stamp of the subtitle data, for example, the start time of the time stamp of the first character of each piece of sub-subtitle data may be used as the start playing time of the sub-subtitle data, and the end time of the time stamp of the last character of each piece of sub-subtitle data may be used as the end playing time of the sub-subtitle data. And according to the starting playing time and the ending playing time, superposing each piece of sub-subtitle data to the corresponding live broadcast audio-video data stream. The method has the advantages that when the live audio and video data stream is played, the subtitle data presented in the live audio and video data stream can be effectively ensured to be moderate in length, a user can watch, digest and understand the subtitle content conveniently, and the user can have the cinema-level subtitle experience similar to that of watching a movie in a cinema.

In some embodiments, in the process of determining the subtitle data corresponding to the live audio and video data stream within the preset delay time, the method further includes: judging whether an abnormal audio-video data stream exists in the live audio-video data stream; when the abnormal audio and video data stream exists in the live audio and video data stream, pausing the operation of determining the subtitle data corresponding to the live audio and video data stream, and acquiring the video advertisement with the same playing time length as the abnormal audio and video data stream; and replacing abnormal audio and video data streams in the live audio and video data streams based on the video advertisements. The method has the advantages that when abnormal audio and video data streams exist in the live audio and video data streams, the playing of the abnormal audio and video data streams can be effectively avoided in a video advertisement insertion mode, the live sound and picture synchronization can be continuously guaranteed, a user can feel real-time and smooth subtitles in the whole process, and the live watching experience of the user is improved.

Exemplarily, in the process of acquiring a live audio and video data stream in real time, the acquired live audio and video data stream may be abnormal due to camera movement or sudden transient power failure, for example, a sudden transient power failure causes the acquired live audio and video data stream to include an abnormal audio and video data stream (audio and video data stream with a completely black picture) in a partial time period, so that when it is determined that an abnormal audio and video data stream exists in the live audio and video data stream, the operation of determining the subtitle data corresponding to the live audio and video data stream is suspended, the occurrence of a situation that the timestamp of the subtitle data corresponding to the determined normal live audio and video data stream is misaligned with the timestamp of the normal live audio and video data stream can be effectively avoided, and the accuracy of calibrating the timestamps when the subtitle data is superimposed on the live audio and video data stream can be improved. And the pause duration of the determined operation on the subtitle data corresponding to the live audio and video data stream is the playing duration of the abnormal audio and video data stream. And acquiring the video advertisement with the same playing time length as the abnormal audio and video data stream, and covering or replacing the abnormal audio and video data stream in the live broadcast audio and video data stream based on the video advertisement, so that the playing of the abnormal audio and video data stream is avoided, and the user experience is improved.

In some embodiments, further comprising: determining the modification speed of the subtitle data by the user, and adjusting the preset delay time length based on the modification speed; or determining live scene information of the live audio and video data stream, and adjusting the preset delay time length based on the live scene information. The advantage of setting like this lies in, can be according to the nimble length of time of adjusting the time of presetting delay of demand to further guarantee live "sound and picture synchronous".

Illustratively, when the determined subtitle data corresponding to the live audio and video data stream is not accurate enough, the subtitle data can be modified and corrected manually within a preset delay time to ensure the accuracy of the subtitle data. If the preset delay time is short and the modification speed of the subtitle data by the user is slow, the live broadcast audio-video data stream is easily played when the accurate subtitle data is not acquired, so that the subtitle data presented in the audio-video live broadcast picture cannot accurately reflect the real voice information expressed by the speaker. Therefore, the modification speed of the user on the subtitle data can be determined, for example, the modification speed of the user on the subtitle data in the historical time period is obtained from the server, the modification speed of the user on the subtitle data in the historical time period is counted, and the average value of the modification speed of the user on the subtitle data can be used as the modification speed of the current user on the subtitle data. The preset delay time period is adjusted based on the modification speed, for example, when the modification speed is greater than a preset speed threshold, the preset delay time period may be appropriately decreased, and when the modification speed is less than the preset speed threshold, the preset delay time period may be appropriately increased.

For example, the shorter the preset delay time is, the better the real-time performance of the live broadcast av data stream is, and the longer the preset delay time is, the worse the real-time performance of the live broadcast av data stream is. And different live scenes have different requirements on live real-time performance. For example, the live broadcast of football games and media activities has higher real-time requirements, while the live broadcast of academic conferences and lecture games has relatively lower real-time requirements. Therefore, the live scene information of the live audio and video data stream can be determined, and the preset delay time duration can be adjusted according to the live scene information. Specifically, a pull-down menu containing various live scenes can be provided, and live scene information of the current live audio and video data stream is determined according to clicking operation of a user; live scene information may also be determined from audio data and/or video data in the live audiovisual data stream. The preset delay time is adjusted according to the live broadcast scene information, and specifically, when the live broadcast scene information is the scene information with higher requirements on live broadcast real-time performance, the preset delay time can be reduced; when the live broadcast scene information is the scene information with lower requirements on live broadcast real-time performance, the preset delay time can be increased. Optionally, the target delay time corresponding to the current live broadcast scene information may be determined according to a corresponding relationship table between the live broadcast scene and the preset delay time, and the preset delay time may be adjusted to the target delay time.

Fig. 2 is a flowchart of a subtitle display method according to another embodiment of the present disclosure, and as shown in fig. 2, the method includes the following steps:

and step 210, acquiring the live broadcast audio and video data stream in real time, and caching the live broadcast audio and video data stream based on preset delay time.

Optionally, the subtitle display method further includes: determining the modification speed of the subtitle data by the user, and adjusting the preset delay time length based on the modification speed; or determining live scene information of the live audio and video data stream, and adjusting the preset delay time length based on the live scene information.

Step 220, extracting the audio data stream from the live audio and video data stream within the preset delay time.

Step 230, performing voice recognition on the audio data stream to generate subtitle data belonging to the first language.

Step 240, the caption data belonging to the first language is translated into the caption data belonging to the second language.

Optionally, the second language is a different language from the first language, and the second language may be one or more languages.

And step 250, determining the caption data belonging to the first language and the caption data belonging to the second language as the caption data corresponding to the live broadcast audio-video data stream.

And step 260, overlapping the subtitle data belonging to the second language and the subtitle data belonging to the first language on the live broadcast audio-video data stream according to the mode that the subtitle data belonging to the second language and the subtitle data belonging to the first language are in the up-down corresponding relation.

Optionally, a first timestamp of the subtitle data belonging to the first language and a second timestamp of the subtitle data belonging to the second language may be obtained respectively, based on the first timestamp and the second timestamp, the subtitle data belonging to the second language and the subtitle data belonging to the first language are correspondingly superimposed on the live audio-video data stream in a manner that the subtitle data belonging to the second language and the subtitle data belonging to the first language are in a vertical correspondence, and the upper subtitle data and the live audio-video data stream are synchronized.

And 270, when the preset delay time length is over, playing the live broadcast audio-video data stream with the subtitle data.

According to the subtitle display method provided by the embodiment of the disclosure, the live audio and video data stream collected in real time is cached based on the preset delay time, the simultaneous interpretation subtitle data (the subtitle data belonging to the first language and the subtitle data belonging to the second language) corresponding to the live audio and video data stream is accurately determined within the preset delay time, and the live audio and video data stream with the subtitle data is played when the delay time is over, so that the technical problems that the subtitle is displayed in a typewriter mode in the prior art, the subtitle has large jitter in a live broadcast picture due to instable subtitle, the vision of a user is difficult to focus, and the visual fatigue is easy are solved, the accuracy of subtitle determination and the stability of the subtitle displayed in the live audio and video data stream can be effectively ensured, and the user experience is greatly improved. In addition, the caption data (the caption data belonging to the second language) corresponding to the languages of the audiences can be displayed in a corresponding relationship with the caption data (the caption data belonging to the first language) corresponding to the languages of the speaker in the upper part, so that the key point is prominent, the display is simple and clear, and the reading experience of the user can be greatly improved.

Fig. 3 is a flowchart of a subtitle display method according to another embodiment of the present disclosure, and as shown in fig. 3, the method includes the following steps:

and 310, acquiring a live audio and video data stream in real time, and caching the live audio and video data stream based on preset delay time.

And step 320, extracting the audio data stream from the live audio and video data stream within the preset delay time.

Step 330, determining the target language of the voice recognition according to the language switching instruction input by the user.

Step 340, performing voice recognition on the audio data stream based on the target language to generate subtitle data belonging to the first language; wherein the first language is the same as the target language.

Step 350, translating the caption data belonging to the first language into the caption data belonging to the second language.

And step 360, modifying the subtitle data belonging to the first language and the subtitle data belonging to the second language according to the received modification information.

Step 370, determining the modified subtitle data belonging to the first language and the modified subtitle data belonging to the second language as the subtitle data corresponding to the live broadcast audio-video data stream.

Step 380, obtaining the time stamp of the caption data corresponding to the live audio and video data stream.

And 390, superimposing the subtitle data on the live audio and video data stream according to the timestamp so as to synchronize the subtitle data with the live audio and video data stream.

And 3100, when the preset delay time length is over, playing the live audio and video data stream with the subtitle data.

According to the subtitle display method provided by the embodiment of the disclosure, live audio and video data streams acquired in real time are cached based on the preset delay time, simultaneous interpretation subtitle data corresponding to the live audio and video data streams are acquired through a voice recognition technology and a machine translation technology within the preset delay time, sufficient time is provided for manual work within the delay time, the simultaneous interpretation subtitles are corrected and corrected more comprehensively, the strong readability and the accuracy of the finally displayed subtitles are fully guaranteed, and the level of 'manual subtitles' is reached. In addition, the language can be identified by one-key switching according to the language change of a speaker or a speaker in the live broadcast, so that the speed and the accuracy of subtitle data determination are improved, the sound and picture synchronization of the live broadcast can be further ensured, and a user can feel real-time and smooth subtitles in the whole process.

Fig. 4 is a flowchart of a subtitle display method according to another embodiment of the present disclosure, and as shown in fig. 4, the method includes the following steps:

and step 410, acquiring the live broadcast audio and video data stream in real time, and caching the live broadcast audio and video data stream based on preset delay time.

And step 420, determining subtitle data corresponding to the live audio and video data stream within the preset delay time.

And 430, segmenting the subtitle data according to a preset mode to generate at least one piece of sub subtitle data.

In step 440, the start playing time and the end playing time of each piece of subtitle data are determined.

And step 450, based on the starting playing time and the ending playing time, superposing each piece of sub subtitle data to the corresponding live broadcast audio and video data stream.

Step 460, when the preset delay time length is over, playing the live broadcast audio-video data stream with the caption data.

According to the subtitle display method provided by the embodiment of the disclosure, the accuracy of subtitle determination and the stability of subtitle display in a live audio and video data stream can be effectively guaranteed through a delayed play strategy, and subtitle data can be presented in a sentence-by-sentence manner in an audio and video live broadcast picture through segmentation operation on subtitle data, so that the length is moderate, a presentation mode of 'movie and courtyard-level' subtitles is realized, and the experience of a user watching live audio and video can be greatly improved.

Fig. 5 is a schematic structural diagram of a subtitle display apparatus according to another embodiment of the present disclosure. As shown in fig. 5, the apparatus includes: an audio/video data buffer module 510, a caption data determination module 520, a caption data overlay module 530, and an audio/video data play module 540.

The audio and video data caching module 510 is configured to collect a live audio and video data stream in real time and cache the live audio and video data stream based on a preset delay time;

a caption data determining module 520, configured to determine caption data corresponding to the live audio and video data stream within the preset delay time;

a caption data overlaying module 530, configured to overlay the caption data onto the live audio and video data stream;

and the audio/video data playing module 540 is configured to play the live audio/video data stream with the subtitle data when the preset delay time duration is over.

According to the embodiment of the disclosure, live audio and video data streams are collected in real time, and the live audio and video data streams are cached based on preset delay time; determining subtitle data corresponding to the live audio and video data stream within a preset delay time; superposing subtitle data to a live audio-video data stream; and when the preset delay time length is over, playing the live broadcast audio-video data stream of the caption data. According to the subtitle display scheme provided by the embodiment of the disclosure, the live audio and video data stream collected in real time is cached based on the preset delay time, the subtitle data corresponding to the live audio and video data stream is accurately determined within the preset delay time, and the live audio and video data stream with the subtitle data is played when the delay time is over, so that the technical problems that in the prior art, the subtitle is displayed in a typewriter mode, the subtitle has large jitter in a live broadcast picture due to instability, the vision of a user is difficult to focus, and the visual fatigue is easy to occur can be solved, the accuracy of subtitle determination and the stability of displaying the subtitle in the live audio and video data stream can be effectively guaranteed, and the user experience is greatly improved.

Optionally, the subtitle data determining module includes:

the audio data extraction unit is used for extracting an audio data stream from the live audio and video data stream within the preset delay time;

and the subtitle data determining unit is used for determining subtitle data corresponding to the live audio and video data stream based on the audio data stream.

Optionally, the subtitle data determining unit includes:

the voice recognition subunit is used for performing voice recognition on the audio data stream to generate subtitle data belonging to a first language;

a translation subunit, configured to translate the subtitle data in the first language into subtitle data in a second language;

and the subtitle data determining subunit is used for determining the subtitle data belonging to the first language and the subtitle data belonging to the second language as the subtitle data corresponding to the live audio and video data stream.

Optionally, the subtitle data determining subunit is configured to:

correcting the subtitle data belonging to the first language and the subtitle data belonging to the second language according to the received correction information;

and determining the modified subtitle data belonging to the first language and the modified subtitle data belonging to the second language as the subtitle data corresponding to the live broadcast audio-video data stream.

Optionally, the subtitle data overlaying module includes:

and superposing the subtitle data belonging to the second language and the subtitle data belonging to the first language on the live broadcast audio-video data stream according to the mode that the subtitle data belonging to the second language and the subtitle data belonging to the first language are in the up-down corresponding relation.

Optionally, the apparatus further comprises:

the language switching module is used for determining a target language of voice recognition according to a language switching instruction input by a user before performing voice recognition on the audio data stream and generating subtitle data belonging to a first language;

the speech recognition subunit is configured to:

performing voice recognition on the audio data stream based on the target language to generate subtitle data belonging to a first language; wherein the first language is the same as the target language.

Optionally, the apparatus further comprises:

the time stamp obtaining module is used for obtaining the time stamp of the caption data corresponding to the live audio and video data stream before the caption data is superposed on the live audio and video data stream;

the caption data superposition module is used for:

and superposing the subtitle data to the live audio and video data stream according to the timestamp so as to synchronize the subtitle data with the live audio and video data stream.

Optionally, in the process of determining the subtitle data corresponding to the live audio and video data stream within the preset delay time, the method further includes:

judging whether an abnormal audio-video data stream exists in the live audio-video data stream;

when the abnormal audio and video data stream exists in the live audio and video data stream, pausing the operation of determining the subtitle data corresponding to the live audio and video data stream, and acquiring the video advertisement with the same playing time length as the abnormal audio and video data stream;

and replacing abnormal audio and video data streams in the live audio and video data streams based on the video advertisements.

Optionally, the apparatus further comprises:

the first adjusting module is used for determining the modification speed of the subtitle data by a user and adjusting the preset delay time length based on the modification speed; alternatively, the first and second electrodes may be,

and the second adjusting module is used for determining live scene information of the live audio and video data stream and adjusting the preset delay time length based on the live scene information.

Optionally, the apparatus further comprises:

the subtitle segmentation module is used for segmenting the subtitle data according to a preset mode before the subtitle data is superposed on the live audio-video data stream to generate at least one piece of sub subtitle data;

the time determining module is used for determining the starting playing time and the ending playing time of each piece of sub subtitle data;

the caption data superposition module is used for:

and based on the starting playing time and the ending playing time, superposing each piece of sub-subtitle data to the corresponding live broadcast audio-video data stream.

The device can execute the methods provided by all the embodiments of the disclosure, and has corresponding functional modules and beneficial effects for executing the methods. For technical details which are not described in detail in the embodiments of the present disclosure, reference may be made to the methods provided in all the aforementioned embodiments of the present disclosure.

Referring now to FIG. 6, a block diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like, or various forms of servers such as a stand-alone server or a server cluster. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 300 may include a processing means (e.g., central processing unit, graphics processor, etc.) 301 that may perform various appropriate actions and processes in accordance with a program stored in a read-only memory device (ROM)302 or a program loaded from a storage device 305 into a random access memory device (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 6 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program containing program code for performing a method for recommending words. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 309, or installed from the storage means 305, or installed from the ROM 302. The computer program, when executed by the processing device 301, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: collecting live broadcast audio and video data streams in real time, and caching the live broadcast audio and video data streams based on preset delay time; determining subtitle data corresponding to the live audio and video data stream within the preset delay time; superposing the subtitle data to the live audio and video data stream; and when the preset delay time length is over, playing the live broadcast audio-video data stream with the subtitle data.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, an embodiment of the present disclosure provides a subtitle display method, including:

superposing the subtitle data to the live audio and video data stream;

Further, determining subtitle data corresponding to the live audio and video data stream within the preset delay time duration includes:

extracting an audio data stream from the live audio and video data stream within the preset delay time;

and determining subtitle data corresponding to the live audio and video data stream based on the audio and video data stream.

Further, determining subtitle data corresponding to the live audio and video data stream based on the audio data stream comprises:

performing voice recognition on the audio data stream to generate subtitle data belonging to a first language;

translating the subtitle data belonging to the first language into subtitle data belonging to a second language;

and determining the subtitle data belonging to the first language and the subtitle data belonging to the second language as the subtitle data corresponding to the live audio-video data stream.

Further, determining the subtitle data belonging to the first language and the subtitle data belonging to the second language as the subtitle data corresponding to the live audio-video data stream, including:

Further, superimposing the subtitle data onto the live audio-video data stream includes:

Further, before performing speech recognition on the audio data stream and generating subtitle data belonging to the first language, the method further includes:

determining a target language of voice recognition according to a language switching instruction input by a user;

performing voice recognition on the audio data stream to generate subtitle data belonging to a first language, including:

Further, before superimposing the subtitle data on the live audio-video data stream, the method further includes:

segmenting the subtitle data according to a preset mode to generate at least one piece of subtitle data;

determining the starting playing time and the ending playing time of each piece of sub subtitle data;

and superposing the subtitle data to the live audio and video data stream, wherein the method comprises the following steps:

acquiring a time stamp of subtitle data corresponding to the live audio and video data stream;

Further, in the process of determining the subtitle data corresponding to the live audio and video data stream within the preset delay time, the method further includes:

Further, still include:

determining the modification speed of the subtitle data by the user, and adjusting the preset delay time length based on the modification speed; alternatively, the first and second electrodes may be,

and determining live scene information of the live audio and video data stream, and adjusting the preset delay time length based on the live scene information.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present disclosure and the technical principles employed. Those skilled in the art will appreciate that the present disclosure is not limited to the particular embodiments described herein, and that various obvious changes, adaptations, and substitutions are possible, without departing from the scope of the present disclosure. Therefore, although the present disclosure has been described in greater detail with reference to the above embodiments, the present disclosure is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present disclosure, the scope of which is determined by the scope of the appended claims.

Claims

1. A subtitle display method, comprising:

superposing the subtitle data to the live audio and video data stream;

2. The method of claim 1, wherein determining the subtitle data corresponding to the live audio and video data stream within the preset delay time duration comprises:

3. The method of claim 2, wherein determining subtitle data corresponding to the live audio video data stream based on the audio data stream comprises:

4. The method of claim 3, wherein determining the caption data belonging to the first language and the caption data belonging to the second language as the caption data corresponding to the live audio-video data stream comprises:

5. The method of claim 3, wherein superimposing the subtitle data onto the live audio video data stream comprises:

6. The method of claim 3, further comprising, prior to performing speech recognition on the audio data stream to generate caption data in the first language:

7. The method of claim 1, further comprising, prior to superimposing the subtitle data onto the live audio video data stream:

8. The method of claim 1, wherein in the process of determining the subtitle data corresponding to the live audio and video data stream within the preset delay time, the method further comprises:

9. The method of any of claims 1-8, further comprising:

10. A subtitle display apparatus, comprising:

11. An electronic device, characterized in that the electronic device comprises:

one or more processing devices;

storage means for storing one or more programs;

when executed by the one or more processing devices, cause the one or more processing devices to implement the subtitle display method of any of claims 1-9.

12. A computer-readable medium, on which a computer program is stored, which program, when being executed by processing means, is adapted to carry out a method of displaying subtitles according to any one of claims 1 to 9.