WO2024087732A1

WO2024087732A1 - Livestreaming data processing method and system

Info

Publication number: WO2024087732A1
Application number: PCT/CN2023/106150
Authority: WO
Inventors: 汤然; 姜军; 郑龙; 刘永明
Original assignee: 上海哔哩哔哩科技有限公司
Priority date: 2022-10-25
Filing date: 2023-07-06
Publication date: 2024-05-02
Also published as: CN115643424A

Abstract

Embodiments of the present application provide a livestreaming data processing method and system. The livestreaming data processing method comprises: decoding a received initial live stream to generate an audio stream and a first video stream; performing speech recognition on the audio stream to generate a corresponding recognition text, and determining time interval information between the generation time of the recognition text and the receiving time of the audio stream; using the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream to generate a second video stream; and encoding the second video stream and the audio stream to generate a live stream to be pushed, and returning said live stream to a client.

Description

Live data processing method and system

This application claims the priority of Chinese patent application numbered 202211311544.0, filed on October 25, 2022, and entitled “Live Data Processing Method and System”. The entire content of the Chinese patent application is incorporated into this application by reference.

Technical Field

The embodiments of the present application relate to the field of computer technology, and in particular to a live data processing method. One or more embodiments of the present application also relate to a live data processing system, a computing device, and a computer-readable storage medium.

Background technique

With the rapid development of the live audio and video industry, the requirements for high-definition image quality, low latency, and audio and video synchronization have been optimized to the extreme using data streaming technology. However, users are not satisfied with this.

In some special scenarios, such as large-scale sports events, large-scale conference reports, online education and training, etc., it is necessary to translate the live broadcast in real time and add language subtitles. Because the subtitles need to record the live stream first, then extract the audio stream, and burn it into the video after manual or machine translation, the subtitles can be displayed when re-reporting. However, this processing method cannot bring live broadcast effects to audiences who do not understand the language or have hearing impairments. The inventors realize that although the technology of real-time subtitle generation of live broadcasts has been developed, such as live barrage, the technology has some defects. For example, the subtitles and sound are not synchronized, sometimes ahead and sometimes delayed, and the audience experience is very poor and cannot meet their needs. Therefore, there is an urgent need for an effective method to solve such problems.

Summary of the invention

In view of this, the embodiments of the present application provide a live broadcast data processing method. One or more embodiments of the present application also relate to a live broadcast data processing device, a live broadcast data processing system, a computing device, and a computer-readable storage medium to solve the technical defects of high cost, low efficiency and delayed subtitle generation in the related art.

According to a first aspect of an embodiment of the present application, a live broadcast data processing method is provided, including:

Decoding the received initial live stream to generate an audio stream and a first video stream;

Performing speech recognition on the audio stream to generate corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream;

Using the recognized text as subtitle information, and adding the subtitle information and the time interval information to the first video stream to generate a second video stream;

The second video stream and the audio stream are encoded to generate a live stream to be pushed, and the live stream to be pushed is returned to the client.

According to a second aspect of an embodiment of the present application, a live broadcast data processing device is provided, including:

A decoding module, configured to decode the received initial live stream to generate an audio stream and a first video stream;

A recognition module, configured to perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream;

an adding module, configured to use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream;

The encoding module is configured to encode the second video stream and the audio stream, generate a live stream to be pushed, and return the live stream to be pushed to the client.

According to a third aspect of an embodiment of the present application, another live broadcast data processing method is provided, including:

Receive and cache the live stream to be pushed returned by the live server;

Decoding the live broadcast stream to be pushed, generating a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;

Determining a display time of the subtitle information according to the time interval information;

When it is determined that the playback condition of the live stream to be pushed is met, the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.

According to a fourth aspect of an embodiment of the present application, another live broadcast data processing device is provided, including:

A receiving module is configured to receive and cache the live streaming stream to be pushed returned by the live streaming server;

A decoding module is configured to decode the live broadcast stream to be pushed, generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;

A determination module, configured to determine a display time of the subtitle information according to the time interval information;

The display module is configured to synchronously play the video stream and the audio stream when it is determined that the playback conditions of the live stream to be pushed are met, and to display the subtitle information based on the display time.

According to a fifth aspect of an embodiment of the present application, a live broadcast data processing system is provided, including:

Live streaming server and client;

The live broadcast server is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, add the subtitle information and the time interval information to the first video stream, generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client;

The client is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.

According to a sixth aspect of an embodiment of the present application, there is provided a computing device, including:

Memory and processor;

The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live data processing method when executing the computer-executable instructions.

According to a seventh aspect of an embodiment of the present application, a computer-readable storage medium is provided, which stores computer-executable instructions, and when the instructions are executed by a processor, the steps of the live data processing method are implemented.

An embodiment of the present application implements a live broadcast data processing method and system, wherein the live broadcast data processing method includes decoding a received initial live broadcast stream, generating an audio stream and a first video stream, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, using the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream, generating a second video stream, encoding the second video stream and the audio stream, generating a live broadcast stream to be pushed, and returning the live broadcast stream to be pushed to the client.

In an embodiment of the present application, the live broadcast server performs speech recognition on the audio stream, generates a corresponding recognition text, and records the time interval between the generation time of the recognition text and the time interval between the time of receiving the audio stream. Since the time interval can be used to characterize the time consumed by the live broadcast server for performing speech recognition on the audio stream in the initial live broadcast stream after receiving the initial live broadcast stream, after adding the recognition text and the time interval information to the video stream and returning it to the client, the client can pre-parse and obtain the subtitle information carried in the live broadcast stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live broadcast server receives the audio stream, that is, determine the display time of the complete subtitle corresponding to the live broadcast stream to be pushed, so as to display the complete subtitle in advance based on the display time, which is beneficial to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is beneficial to avoiding the asynchrony between subtitles and video images or audio, thereby beneficial to satisfying the needs of users to watch live subtitles during live viewing, and beneficial to improving the user's live viewing experience.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is an architecture diagram of a live data processing system provided by an embodiment of the present application;

FIG2 is a flow chart of a live broadcast data processing method provided by an embodiment of the present application;

FIG3 is a flow chart of another live data processing method provided by an embodiment of the present application;

FIG4 is an interactive schematic diagram of a live broadcast data processing method provided by an embodiment of the present application applied to the live broadcast field;

FIG5 is a schematic diagram of the structure of a live data processing device provided by an embodiment of the present application;

FIG6 is a schematic diagram of the structure of another live data processing device provided by an embodiment of the present application;

FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present application.

Detailed ways

Many specific details are described in the following description to facilitate a full understanding of the present application. However, the present application can be implemented in many other ways than those described herein, and those skilled in the art can make similar generalizations without violating the connotation of the present application, so the present application is not limited by the specific implementation disclosed below.

The terms used in one or more embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit one or more embodiments of the present application. The singular forms of "a", "said" and "the" used in one or more embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used in one or more embodiments of the present application refers to and includes any or all possible combinations of one or more associated listed items.

It should be understood that, although the terms first, second, etc. may be used to describe various information in one or more embodiments of the present application, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of one or more embodiments of the present application, the first may also be referred to as the second, and similarly, the second may also be referred to as the first. Depending on the context, the word "if" as used herein may be interpreted as "at the time of" or "when" or "in response to determining".

First, the terms involved in one or more embodiments of the present application are explained.

Live broadcast: Live broadcast in English is Live broadcast. Broadly speaking, live broadcast also includes TV live broadcast. Here we generally refer to online video live broadcast. Live audio and video will be pushed to the server in the form of media stream (pushing stream). If there are viewers watching the live broadcast, after receiving the user's request, the server will transmit the video to the website, APP, and client player to play the video in real time.

H264 encoding: H264 generally refers to H.264. H.264 is a highly compressed digital video codec standard proposed by the Joint Video Team (JVT), which is jointly composed of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).

H265 encoding: H.265 is a new video coding standard developed by ITU-T VCEG after H.264. The H.265 standard revolves around the video coding standard H.264, retaining some of the original technologies while improving some related technologies.

SEI: SEI stands for Supplemental Enhancement Information, which belongs to the bitstream category. It provides a method to add additional information to the video bitstream and is one of the features of video compression standards such as H.264/H.265.

Speech recognition technology: a technology that converts speech signals into corresponding text or commands through the process of recognition and understanding by a machine.

GRPC: A type of RPC (abbreviation of Remote Procedure Call) framework. It is a high-performance, open source and general RPC framework developed based on the ProtoBuf (Protocol Buffers) serialization protocol and supports many development languages.

Transcoding: Video transcoding technology converts video signals from one format to another.

In the present application, a live data processing method is provided. One or more embodiments of the present application simultaneously relate to a live data processing apparatus, a live data processing system, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.

In specific implementation, the subtitle information of the embodiment of the present application can be presented on large video playback devices, game consoles, desktop computers, smart phones, tablet computers, MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3) players, MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Clients such as players, laptops, e-book readers and other display terminals.

In addition, the subtitle information of the embodiments of the present application can be applied to any video or audio that can present subtitles, for example, subtitles can be presented in live or recorded videos, and subtitles can be presented in audio of online or offline songs or books.

Referring to FIG. 1 , FIG. 1 shows an architecture diagram of a live broadcast data processing system provided according to an embodiment of the present application, including:

Live broadcast server 102 and client 104;

The live broadcast server 102 is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client 104;

The client 104 is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.

Specifically, in Figure 1, user U1 broadcasts live through a smart terminal and pushes the generated initial live stream to the live server 102. The live server 102 decodes the received initial live stream to generate an audio stream and a first video stream; then performs speech recognition on the audio stream to generate a corresponding recognition text, and determines the time interval information between the generation time of the recognition text and the reception time of the audio stream; then uses the recognition text as subtitle information, and adds the subtitle information and the time interval information to the first video stream to generate a second video stream; then encodes the second video stream and the audio stream to generate a live stream to be pushed. When user U2 and user U3 watch the live broadcast of user U1, the live server can push the live stream to be pushed to the client 104 of user U2 and user U3.

When playing the live stream for the user, the client 104 can pull the live stream to be pushed of a certain length from the live server in advance and cache it, so that the client 104 can decode the cached live stream to be pushed in advance and obtain the subtitle information contained in the live stream to be pushed. Then, according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server 102, the display time of the subtitle information can be determined. When it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed based on the display time.

In the embodiment of the present application, the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.

The above is a schematic scheme of a live data processing system of this embodiment. It should be noted that the technical scheme of the live data processing system and the technical scheme of the following live data processing method belong to the same concept, and the details not described in detail in the technical scheme of the live data processing system can be referred to the description of the technical scheme of the following live data processing method.

Referring to FIG. 2 , FIG. 2 shows a flow chart of a live broadcast data processing method provided according to an embodiment of the present application, comprising the following steps:

Step 202: decode the received initial live stream to generate an audio stream and a first video stream.

Specifically, the live broadcast data processing method provided in the embodiment of the present application is applied to a live broadcast server. The initial live broadcast stream is the live broadcast stream pushed to the live broadcast server by the anchor during the live broadcast process.

When the host is broadcasting live through his smart terminal, he can push the live stream generated during the live broadcast to the live broadcast server through the smart terminal, so that when other users need to watch the host's live broadcast, the live broadcast server can push the live stream pushed by the host to the user terminals (clients) of other users.

Currently, most live broadcasts are without subtitles, but in some special scenarios, such as large-scale sports events, large-scale conference reports, online education and training, etc., the live broadcast needs to be translated in real time and subtitled. Since the subtitles need to record the live stream first, then extract the audio stream, and burn it into the video after manual or machine translation, the subtitles can be displayed when the broadcast is repeated. However, this processing method cannot bring live broadcast effects to audiences who do not understand the language or have hearing impairments.

In addition, even though technology has been developed to generate subtitles in real time during live broadcasts, such as live bullet screens, this technology often has the problem of subtitles being out of sync with the video image or sound, making the user experience of watching live broadcasts extremely poor and failing to meet their needs.

Based on this, in an embodiment of the present application, after receiving the initial live stream pushed by the anchor, the live server can decode the initial live stream to obtain an audio stream and a first video stream, and can perform voice recognition on the audio stream to obtain the corresponding recognized text, and then add the recognized text as subtitle information to the first video stream to generate a second video stream, so that after the encoding results of the audio stream and the second video stream are pushed to the user's client, the client can decode and obtain the subtitle information, and can display the subtitle information while synchronously playing the audio stream and the second video stream for the user, thereby avoiding the problem of the live subtitles being out of sync with the live video screen or audio during the user's real-time viewing of the live broadcast, so as to meet the user's needs for viewing live subtitles during the live viewing process and improve the user's live viewing experience.

In specific implementation, decoding the received initial live stream can be implemented in the following ways:

Determine the live stream to be played cached by the client, and determine the generation time corresponding to the live stream to be played;

According to the live stream identifier corresponding to the live stream to be played and the generation time, an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.

In addition, the client decodes the live stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;

When it is determined that the playback condition of the live stream to be played is met, the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.

Specifically, when playing a live stream, the client can pre-cache the live stream to be played within a period of time after the current playback time, and parse this part of the live stream to be played in advance to obtain the video stream to be played, the audio stream to be played, the subtitles to be displayed, and the display time corresponding to the subtitles to be displayed contained in the live stream to be played. Then, when it is determined that the playback conditions of the live stream to be played are met, the decoded video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.

For example, if the duration of the live stream to be played pre-cached by the client is 5s and the current playback time is t, the live stream to be played within t to t+5s is pre-cached and parsed in advance to determine whether the subtitles to be displayed need to be displayed in advance based on the display time of the subtitles to be displayed in the analysis results, thereby reducing the delay between the live subtitles and the live video screen or audio during the user's real-time viewing of the live broadcast.

Furthermore, since the duration of the live stream to be played pre-cached by the client is limited, after this part of the live stream to be played is played, it is necessary to further cache a new live stream to be played. For example, the client pre-caches the live stream to be played within t to t+5s, and after the live stream to be played within t to t+3s is played, it is necessary to cache the live stream to be played within t+5s to t+8s, that is, it is necessary to obtain the live stream to be played within t+5s to t+8s from the live broadcast server.

Therefore, the live broadcast server can pre-determine the live broadcast stream to be played that has been cached by the client, and determine the generation time (playback time) corresponding to the cached live broadcast stream to be played, and then obtain the initial live broadcast stream corresponding to the live broadcast stream identifier within a period of time after the generation time based on the live broadcast stream identifier and generation time corresponding to the cached live broadcast stream to be played, and process the initial live broadcast stream to generate a live broadcast stream to be pushed containing subtitle information, and push it to the client.

Based on this, when the user is watching the live broadcast in real time through the client, the client pre-caches the live stream to be played within a period of time after the current playback time, and parses this part of the live stream to be played in advance. Similarly, the live broadcast server can also pre-determine the live stream to be played that the client has cached, and determine the initial Live stream and parse it; although the live server parses the initial live stream, and the client parses the live stream to be played, both take a certain amount of time and cause a certain delay in the live broadcast, but in the embodiment of the present application, the live server parses the initial live stream and the client parses the live stream to be played in parallel, and the client can determine whether to display the subtitles in advance according to the display time of the subtitles to be displayed in the analysis result, thereby reducing the delay between the live subtitles and the live video screen or audio during the user's real-time viewing of the live broadcast.

Step 204: Perform speech recognition on the audio stream to generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream.

Specifically, after the live broadcast server decodes the initial live broadcast stream and obtains the audio stream and the first video stream, it can perform speech recognition on the audio stream to generate corresponding recognition text, and then add the recognition text as subtitle information to the first video stream to generate a second video stream, so that after the client for watching the live broadcast obtains the second video stream, it can display the subtitle information to the user during the playback of the second video stream.

However, in actual applications, after the live broadcast server decodes and obtains the audio stream, it often takes a certain amount of time to perform speech recognition on the audio stream. In this case, there will be a time difference between the generation time of the recognized text and the time of receiving the audio stream, that is, the time of receiving the initial live broadcast stream. If this time difference is not taken into account and only the recognized text and the initial live broadcast stream are pushed to the client, then when the client displays the recognized text, the recognized text may be out of sync with the video screen or sound.

After the embodiment of the present application obtains the complete recognition text, in order to avoid the asynchrony between the recognition text and the video image or sound, it is necessary to determine the time consumed to generate the recognition text, that is, the time interval between the generation time of the recognition text and the time when the live broadcast server receives the audio stream, so that the client can determine how long in advance to display the recognition text after obtaining it according to the time interval.

In a specific implementation, after the live broadcast server decodes and obtains the audio stream, the audio stream may be divided according to the spectrum information corresponding to the audio stream to generate at least two audio segments;

Correspondingly, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream includes:

Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least two audio segments;

The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the target audio segment is determined.

Specifically, when performing speech recognition on an audio stream, if the recognized audio stream is a complete sentence, the accuracy of the recognition result can be guaranteed. Based on this, the embodiment of the present application can first divide the audio stream according to the spectrum information corresponding to the audio stream to generate at least two audio segments. For example, according to the spectrum information, the audio stream between any two adjacent points with a spectrum value of 0 (indicating a pause) is regarded as an audio segment. Then, speech recognition is performed on each audio segment to generate a corresponding recognition text, determine the generation time of the recognition text, and determine the time interval information between the generation time and the reception time of each audio segment (the reception time of the audio stream or the initial live stream).

Alternatively, performing speech recognition on the audio stream to generate corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, including:

Splitting the audio stream according to a preset recognition window to generate at least one audio segment;

Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least one audio segment;

The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.

Specifically, in the process of speech recognition of an audio stream, a preset recognition window is usually used, and the window length of the preset recognition window can be 0.5s-1s. By using the preset recognition window to perform speech recognition of the audio stream, a single word in the audio stream can be recognized; or the window length of the preset recognition window can be 1s-5s. By using the preset recognition window to perform speech recognition of the audio stream, a complete sentence in the audio stream can be recognized. The specific window length can be determined according to actual needs and is not limited here.

The audio stream is subjected to speech recognition according to a preset recognition window. Specifically, the audio stream is split according to the preset recognition window to generate at least one audio segment, speech recognition is performed on each audio segment to generate a corresponding recognition text, and then the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.

In addition, in the embodiment of the present application, the live broadcast server includes a transcoding module and a speech recognition service module. Therefore, the received initial live broadcast stream is decoded to generate an audio stream and a first video stream. Specifically, the received initial live broadcast stream is decoded by the transcoding module to generate an audio stream and a first video stream; speech recognition is performed on the audio stream to generate a corresponding recognition text. Specifically, speech recognition is performed on the audio stream by the speech recognition service module to generate a corresponding recognition text.

Wherein, the transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.

Specifically, the data transmission channel may be GRPC.

Further, performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text includes:

Splitting the audio stream according to a preset recognition window by a speech recognition service module to generate at least one audio segment;

Perform speech recognition on a first audio segment to generate a corresponding first recognition text, and return the first recognition text to the transcoding module, wherein the first audio segment is one of the at least one audio segment.

Specifically, as mentioned above, in the process of performing speech recognition on an audio stream, a preset recognition window is usually used. Similarly, when performing speech recognition on an audio stream using a speech recognition service module, a preset recognition window can also be used. Speech recognition is performed on the audio stream according to the preset recognition window. Specifically, the audio stream is split according to the preset recognition window to generate at least one audio segment, speech recognition is performed on each audio segment, corresponding recognition text is generated, and then the generation time of the recognition text is determined, and the time interval information between the generation time and the receiving time of the audio stream is determined.

The window length of the preset recognition window may be 0.5s-1s, and the preset recognition window may be used to perform speech recognition of the audio stream, so that a single word in the audio stream can be recognized; or the window length of the preset recognition window may be 1s-5s, and the preset recognition window may be used to perform speech recognition of the audio stream, so that a complete sentence in the audio stream can be recognized. The specific window length may be determined according to actual needs and is not limited here.

Step 206: Use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream.

Specifically, after generating the recognition text and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, the recognition text can be used as subtitle information, and the subtitle information and the time interval information can be added to the first video stream to generate the second video stream.

The subtitle information may be written to the first video stream in the form of SEI to generate the second video stream.

In a specific implementation, the text type of the recognized text may also be determined according to the text length and/or text semantics of the recognized text;

Correspondingly, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:

Determine a target video frame in the first video stream according to the generation time;

The recognized text is used as subtitle information, and the subtitle information, the time interval information and the text type are used as video frame information of the target video frame, and added to the first video stream.

Specifically, as mentioned above, in the process of performing speech recognition on an audio stream, a preset recognition window is usually used. The window length of the preset recognition window can be 0.5s-1s. By using the preset recognition window to perform speech recognition on the audio stream, single words in the audio stream can be recognized; or, the window length of the preset recognition window can be 1s-5s. By using the preset recognition window to perform speech recognition on the audio stream, complete sentences in the audio stream can be recognized.

Therefore, after generating the recognized text, the embodiment of the present application can also determine the text type of the recognized text according to the text length and/or text semantics of the recognized text. In practical applications, the text type includes but is not limited to characters, words, sentences, etc. Text semantics is used to determine whether the recognized text can express complete semantics. If so, the text type of the recognized text can be determined as Sentence type; if not, then when the text length of the recognized text is greater than or equal to two characters, the text type of the recognized text is the word type; if the text length is equal to 1, the text type is the character type.

After determining the text type of the recognized text, the target video frame in the first video stream can be determined according to the generation time of the recognized text, the recognized text can be used as subtitle information, and the subtitle information, time interval information and text type can be used as video frame information of the target video frame and added to the first video stream.

In practical applications, the last video frame in the video segment corresponding to the target audio segment can usually be used as the target video frame, and the subtitle information, time interval information and text type can be used as its video frame information and added to the first video stream to generate a second video stream. After obtaining the second video stream, the client can determine the subtitle information to be displayed according to the text type, and usually gives priority to sentence-type subtitle information for display to ensure the subtitle viewing effect of the live broadcast.

In addition, in the embodiment of the present application, after the voice recognition service module splits the audio stream according to a preset recognition window to generate at least one audio segment, and performs voice recognition on the first audio segment to generate a corresponding first recognition text, the recognition text is used as subtitle information, and the subtitle information and the time interval information are added to the first video stream, including:

The transcoding module determines a first target video frame in the first video stream according to a generation time of the first recognition text;

The first recognition text is used as the first subtitle information, and the time interval information between the generation time of the first subtitle information and the first recognition text and the receiving time of the audio stream is added to the first video stream as the video frame information of the first target video frame.

Specifically, when the speech recognition service module splits the audio stream into one or at least two audio segments, it can perform speech recognition on each audio segment in turn, and after generating the recognition text corresponding to any audio segment, it can return the recognition text to the transcoding module. The transcoding module determines the target video frame in the first video stream (usually the last video frame of the video segment corresponding to any audio segment) according to the generation time of the recognition text, and uses the recognition text as subtitle information, and the time interval information between the subtitle information, the generation time of the recognition text and the receiving time of the audio stream as the video frame information of the target video frame, and adds it to the first video stream.

Perform speech recognition on a second audio segment adjacent to the first audio segment among the at least two audio segments, generate a corresponding second recognized text, and return the first recognized text and the second recognized text to the transcoding module.

The transcoding module determines a second target video frame in the first video stream according to a generation time of the second recognition text;

The first recognition text and the second recognition text are used as second subtitle information, and the time interval information between the generation time of the second subtitle information and the second recognition text and the receiving time of the audio stream is used as video frame information of the second target video frame and added to the first video stream.

Specifically, as mentioned above, when the speech recognition service module splits the audio stream into at least two audio segments, it can first perform speech recognition on the first audio segment of the at least two audio segments to generate a corresponding first recognition text, and the transcoding module uses the first recognition text as subtitle information, and uses the subtitle information, the time interval information between the generation time of the first recognition text and the receiving time of the audio stream as the video frame information of the first target video frame (usually the last video frame of the video segment corresponding to the first audio segment), and adds it to the first video stream.

Then, speech recognition can be performed on the second audio segment adjacent to the first audio segment in the at least two audio segments to generate a corresponding second recognition text. The transcoding module uses the first recognition text and the second recognition text as subtitle information, and uses the subtitle information, the time interval information between the generation time of the second recognition text and the receiving time of the audio stream as the video frame information of the second target video frame (usually the last video frame of the video segment corresponding to the second audio segment) to add it to the first video stream, and so on.

After the speech recognition service module obtains the first recognition text through speech recognition, it can temporarily store it. After obtaining the second recognition text through recognition, since the first audio segment is adjacent to the second audio segment, the first recognition text and the second recognition text can be returned together as subtitle information of the video stream, thereby allowing the speech recognition service module to reuse the cache to improve the accuracy of the recognition results of the subtitle information.

Step 208: encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client.

Specifically, after generating the second video stream, the live broadcast server can encode the second video stream and the audio stream to generate a live broadcast stream to be pushed, and if the user has a need to watch live broadcast, the live broadcast stream to be pushed can be pushed to the user's client.

In a specific implementation, the client decodes the live stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;

In the case where it is determined that the text type is the target type, determining the display time of the subtitle information according to the playback time of the target video frame and the time interval information;

Determining, according to the display time, at least two video frames in the video stream for displaying the subtitle information, wherein the playback time of the at least two video frames is earlier than the playback time of the target video frame;

When it is determined that the playback conditions of the live stream to be pushed are met, the video stream and the audio stream are played synchronously, and based on the display time, the subtitle information is displayed in the at least two video frames and the target video frame.

Specifically, as described above, when playing a live stream, the client may pull the live stream to be pushed for a certain length of time after the current playback time from the live server in advance and cache it, and then decode the cached live stream to be pushed in advance to obtain the subtitle information corresponding to the target video frame in the live stream to be pushed, the text type of the subtitle information, and the time interval information between the generation time of the subtitle information and the reception time of the audio stream by the live stream server. When it is determined according to the text type that the subtitle information belongs to the target text type, that is, the sentence type, the display time of the subtitle information may be determined according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server, combined with the playback time of the target video frame, and then determine other video frames in the live stream to be pushed that are located before the target video frame and are used to display the subtitle information according to the display time, and when it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed in the determined video frame and the target video frame based on the display time.

For example, the current time point is t, and the client pre-caches the live stream to be pushed from t to t+5s during the process of playing the live stream, and then decodes to obtain the subtitle information carried in this live stream to be pushed. If the decoding result contains the recognition text corresponding to the video frames at the five time points of t+1s, t+2s, t+3s, t+4s, and t+5s, and the recognition text corresponding to the time point t+5s is of sentence type, then the recognition text can be displayed first. In this case, the time interval information corresponding to the recognition text can be determined. If the generation time of the recognition text and the reception time of the audio stream are determined, The time interval between is 4s, and the time interval between the generation time of the recognition text and the video frame at the time point t+5s is 1s, which means that the subtitle information (recognition text) needs to be displayed 3s in advance. It can also mean that the host expressed a complete sentence from t+3s to t+5s. Therefore, when the live stream to be pushed at t+3s is started to be played, the subtitle information can be displayed at the same time, and the display can be ended at t+5s to realize the early display of the complete subtitles and avoid delays between the subtitles and the video screen or sound. Until it is detected that other subtitle information needs to be displayed, the subtitle information is stopped.

In an embodiment of the present application, the live broadcast server performs speech recognition on the audio stream, generates corresponding recognition text, and records the time interval between the generation time of the recognition text and the time interval between the time of receiving the audio stream. Since the time interval can be used to characterize the time consumed by the live broadcast server for performing speech recognition on the audio stream in the initial live broadcast stream after receiving the initial live broadcast stream, after adding the recognition text and the time interval information to the video stream and returning it to the client, the client can pre-parse and obtain the subtitle information carried in the live broadcast stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live broadcast server receives the audio stream, that is, determine the display time of the complete subtitle corresponding to the live broadcast stream to be pushed, so as to display the complete subtitle in advance based on the display time, which is beneficial to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is beneficial to avoiding the asynchrony between subtitles and video images or audio, thereby beneficial to satisfying the needs of users to watch live subtitles during live viewing, and beneficial to improving the user's live viewing experience.

Referring to FIG. 3 , FIG. 3 shows a flow chart of another live broadcast data processing method provided according to an embodiment of the present application, comprising the following steps:

Step 302: Receive and cache the live streaming stream to be pushed returned by the live streaming server.

Step 304, decode the live stream to be pushed, generate corresponding audio stream, video stream, subtitle information and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live server according to the generation time of the subtitle information and the reception time of the audio stream.

Step 306: Determine the display time of the subtitle information according to the time interval information.

Step 308: When it is determined that the playback condition of the live stream to be pushed is met, the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.

Specifically, the anchor broadcasts live through the intelligent terminal and pushes the generated initial live stream to the live server, which decodes the received initial live stream to generate an audio stream and a first video stream; then performs speech recognition on the audio stream to generate corresponding recognition text, and determines the time interval information between the generation time of the recognition text and the reception time of the audio stream; then uses the recognition text as subtitle information, and adds the subtitle information and the time interval information to the first video stream to generate a second video stream; then encodes the second video stream and the audio stream to generate a live stream to be pushed. When the user watches the anchor's live broadcast, the live server can push the live stream to be pushed to the user's client.

When the client is playing the live stream for the user, it can pull the live stream to be pushed of a certain length from the live server in advance and cache it, so that the client can decode the cached live stream to be pushed in advance and obtain the subtitle information contained in the live stream to be pushed. Then, according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server, the display time of the subtitle information can be determined. When it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed based on the display time.

The above is a schematic scheme of another live data processing method of this embodiment. It should be noted that the technical scheme of the live data processing method and the technical scheme of the above-mentioned live data processing method belong to the same concept, and the details of the technical scheme of the live data processing method that are not described in detail can all be referred to the description of the technical scheme of the above-mentioned live data processing method.

Referring to Figure 4, the live broadcast data processing method provided by an embodiment of the present application is applied in the live broadcast field as an example to further illustrate the live broadcast data processing method. Figure 4 shows an interactive schematic diagram of a live broadcast data processing method provided by an embodiment of the present application applied in the live broadcast field, specifically comprising the following steps:

Step 402: The transcoding module receives the anchor's initial live stream.

Step 404: The transcoding module decodes the initial live stream to generate an audio stream and a first video stream.

Step 406: The transcoding module transmits the audio stream to the speech recognition service module via GRPC.

Step 408: The speech recognition service module performs speech recognition on the audio stream and generates corresponding recognition text.

Step 410, the speech recognition service module determines the generation time of the recognized text, and determines the time interval information between the generation time and the reception time of the audio stream, and determines the text type of the recognized text according to the text length and/or text semantics of the recognized text.

Step 412: The speech recognition service module transmits the recognized text, text type, and time interval information to the transcoding module via GRPC.

In step 414, the transcoding module uses the recognized text as subtitle information, and adds the subtitle information, time interval information, and text type to the first video stream to generate a second video stream.

Step 416: The transcoding module encodes the second video stream and the audio stream to generate a live stream to be pushed.

Step 418: The client pulls the live stream to be pushed from the live server.

The live broadcast server includes a transcoding module and a speech recognition service module.

In step 420, the client decodes the live stream to be pushed, generates a corresponding audio stream, a second video stream, subtitle information and time interval information, determines the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously plays the second video stream and the audio stream, and displays the subtitle information based on the display time.

Corresponding to the above method embodiment, the present application also provides a live data processing device embodiment, and FIG5 shows a schematic diagram of the structure of a live data processing device provided by an embodiment of the present application. As shown in FIG5, the device includes:

The decoding module 502 is configured to decode the received initial live stream to generate an audio stream and a first video stream;

The recognition module 504 is configured to perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream;

An adding module 506 is configured to use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream;

The encoding module 508 is configured to encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client.

Optionally, the decoding module 502 is further configured to:

Optionally, the client decodes the live stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;

Optionally, the live broadcast data processing device further includes a determination module configured to:

Determining the text type of the recognized text according to the text length and/or text semantics of the recognized text;

Accordingly, the adding module 506 is further configured to:

Optionally, the client decodes the live stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;

Optionally, the live broadcast data processing device further includes a division module configured to:

Dividing the audio stream according to the frequency spectrum information corresponding to the audio stream to generate at least two audio segments;

Accordingly, the identification module 504 is further configured to:

Optionally, the identification module 504 is further configured to:

Optionally, the decoding module 502 is further configured to:

The received initial live stream is decoded by a transcoding module to generate an audio stream and a first video stream;

Accordingly, the identification module 504 is further configured to:

The audio stream is subjected to speech recognition by a speech recognition service module to generate corresponding recognition text.

Optionally, the identification module 504 is further configured to:

Optionally, the adding module 506 is further configured to:

Optionally, the identification module 504 is further configured to:

Optionally, the adding module 506 is further configured to:

Optionally, the live broadcast data processing device further includes a transmission module configured to:

The transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.

The above is a schematic scheme of a live data processing device of this embodiment. It should be noted that the technical scheme of the live data processing device and the technical scheme of the above-mentioned live data processing method belong to the same concept, and the details not described in detail in the technical scheme of the live data processing device can be referred to the description of the technical scheme of the above-mentioned live data processing method.

Corresponding to the above method embodiment, the present application also provides a live data processing device embodiment, and FIG6 shows a schematic diagram of the structure of another live data processing device provided by an embodiment of the present application. As shown in FIG6, the device includes:

The receiving module 602 is configured to receive and cache the live streaming stream to be pushed returned by the live streaming server;

The decoding module 604 is configured to decode the live broadcast stream to be pushed, generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;

A determination module 606 is configured to determine a display time of the subtitle information according to the time interval information;

The display module 608 is configured to synchronously play the video stream and the audio stream when it is determined that the playback condition of the live stream to be pushed is met, and to display the subtitle information based on the display time.

The above is a schematic scheme of another live data processing device of this embodiment. It should be noted that the technical scheme of the live data processing device and the technical scheme of the above another live data processing method belong to the same concept, and the details of the technical scheme of the live data processing device that are not described in detail can all be referred to the description of the technical scheme of the above another live data processing method.

Fig. 7 shows a block diagram of a computing device 700 according to an embodiment of the present application. The components of the computing device 700 include but are not limited to a memory 710 and a processor 720. The processor 720 is connected to the memory 710 via a bus 730, and the database 750 is used to store data.

The computing device 700 also includes an access device 740 that enables the computing device 700 to communicate via one or more networks 760. Examples of these networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet. The access device 740 may include one or more of any type of network interface (e.g., a network interface card (NIC)) whether wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, a World Wide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a universal serial bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.

In one embodiment of the present application, the above components of the computing device 700 and other components not shown in FIG. 7 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG. 7 is only for illustrative purposes and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as needed.

The computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or PC. The computing device 700 may also be a mobile or stationary server.

Among them, the processor 720 is used to execute the following computer-executable instructions, and the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live data processing method when executing the computer-executable instructions.

The above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the technical scheme of the live data processing method described above are of the same concept, and the details of the technical scheme of the computing device that are not described in detail can be found in the description of the technical scheme of the live data processing method described above.

An embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the following steps:

Encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client;

Or implement the following steps:

Receive and cache the live stream to be pushed returned by the live server;

Determining the display time of the subtitle information according to the time interval information;

The above is a schematic scheme of a computer-readable storage medium of this embodiment. It should be noted that the technical scheme of the storage medium and the technical scheme of the live data processing method described above are of the same concept, and the details not described in detail in the technical scheme of the storage medium can be found in the description of the technical scheme of the live data processing method described above.

The above describes specific embodiments of the present application. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims can be performed in an order different from that in the embodiments and still achieve the desired results. In addition, the processes depicted in the accompanying drawings do not necessarily require the specific order or continuous order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program codes, which may be in source code form, object code form, executable files or some intermediate forms, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, USB flash drive, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electric carrier signals and telecommunication signals.

It should be noted that, for the aforementioned method embodiments, for the sake of simplicity of description, they are all expressed as a series of action combinations, but those skilled in the art should be aware that the embodiments of the present application are not limited by the described action sequence, because according to the embodiments of the present application, certain steps can be performed in other sequences or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the embodiments of the present application.

In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are only used to help explain the present application. The optional embodiments do not describe all the details in detail, nor do they limit the invention to the specific implementation methods described. Obviously, many modifications and changes can be made according to the content of the embodiments of the present application. The present application selects and specifically describes these embodiments in order to better explain the principles and practical applications of the embodiments of the present application, so that those skilled in the art can understand and use the present application well. The present application is only limited by the claims and their full scope and equivalents.

Claims

A live broadcast data processing method, comprising:

Decoding the received initial live stream to generate an audio stream and a first video stream;

Performing speech recognition on the audio stream to generate corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream;

Using the recognized text as subtitle information, and adding the subtitle information and the time interval information to the first video stream to generate a second video stream;

The second video stream and the audio stream are encoded to generate a live stream to be pushed, and the live stream to be pushed is returned to the client.
According to the live broadcast data processing method according to claim 1, the decoding of the received initial live broadcast stream comprises:

Determine the live stream to be played cached by the client, and determine the generation time corresponding to the live stream to be played;

According to the live stream identifier corresponding to the live stream to be played and the generation time, an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
According to the live broadcast data processing method of claim 2, the client decodes the live broadcast stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;

When it is determined that the playback condition of the live stream to be played is met, the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
The live broadcast data processing method according to claim 1 further comprises:

Determining the text type of the recognized text according to the text length and/or text semantics of the recognized text;

Correspondingly, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:

Determine a target video frame in the first video stream according to the generation time;

The recognized text is used as subtitle information, and the subtitle information, the time interval information and the text type are used as video frame information of the target video frame and added to the first video stream.
According to the live broadcast data processing method of claim 4, the client decodes the live broadcast stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information and the text type;

In the case where it is determined that the text type is the target type, determining the display time of the subtitle information according to the playback time of the target video frame and the time interval information;

Determining, according to the display time, at least two video frames in the video stream for displaying the subtitle information, wherein the playback time of the at least two video frames is earlier than the playback time of the target video frame;

When it is determined that the playback conditions of the live stream to be pushed are met, the video stream and the audio stream are played synchronously, and based on the display time, the subtitle information is displayed in the at least two video frames and the target video frame.
The live broadcast data processing method according to claim 1 further comprises:

Dividing the audio stream according to the frequency spectrum information corresponding to the audio stream to generate at least two audio segments;

Correspondingly, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream includes:

Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least two audio segments;

The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the target audio segment is determined.
According to the live broadcast data processing method of claim 1, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, comprising:

Splitting the audio stream according to a preset recognition window to generate at least one audio segment;

Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least one audio segment;

The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
According to the live broadcast data processing method of claim 1, the decoding of the received initial live broadcast stream to generate the audio stream and the first video stream comprises:

The received initial live stream is decoded by a transcoding module to generate an audio stream and a first video stream;

Accordingly, performing speech recognition on the audio stream to generate corresponding recognition text includes:

The audio stream is subjected to speech recognition by a speech recognition service module to generate corresponding recognition text.
According to the live broadcast data processing method of claim 8, the step of performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text comprises:

Splitting the audio stream according to a preset recognition window by a speech recognition service module to generate at least one audio segment;

Perform speech recognition on a first audio segment to generate a corresponding first recognition text, and return the first recognition text to the transcoding module, wherein the first audio segment is one of the at least one audio segment.
According to the live broadcast data processing method of claim 9, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream comprises:

The transcoding module determines a first target video frame in the first video stream according to a generation time of the first recognition text;

The first recognition text is used as the first subtitle information, and the time interval information between the generation time of the first subtitle information and the first recognition text and the receiving time of the audio stream is added to the first video stream as the video frame information of the first target video frame.
According to the live broadcast data processing method of claim 10, the step of performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text comprises:

Perform speech recognition on a second audio segment adjacent to the first audio segment among the at least two audio segments, generate a corresponding second recognized text, and return the first recognized text and the second recognized text to the transcoding module.
According to the live broadcast data processing method of claim 11, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream comprises:

The transcoding module determines a second target video frame in the first video stream according to a generation time of the second recognition text;

The first recognition text and the second recognition text are used as second subtitle information, and the time interval information between the generation time of the second subtitle information and the second recognition text and the receiving time of the audio stream is used as video frame information of the second target video frame and added to the first video stream.
The live broadcast data processing method according to claim 8 further comprises:

The transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.
A live broadcast data processing method, comprising:

Receive and cache the live stream to be pushed returned by the live server;

Decoding the live broadcast stream to be pushed, generating a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;

Determining the display time of the subtitle information according to the time interval information;

When it is determined that the playback condition of the live stream to be pushed is met, the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
A live broadcast data processing system, comprising:

Live streaming server and client;

The live broadcast server is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, add the subtitle information and the time interval information to the first video stream, generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client;

The client is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.
According to the live broadcast data processing system of claim 15, the decoding of the received initial live broadcast stream comprises:

Determine the live stream to be played cached by the client, and determine the generation time corresponding to the live stream to be played;

According to the live stream identifier corresponding to the live stream to be played and the generation time, an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
According to the live broadcast data processing system of claim 16, the client decodes the live broadcast stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;

When it is determined that the playback condition of the live stream to be played is met, the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
According to the live broadcast data processing system of claim 15, the live broadcast server is further used for:

Determining the text type of the recognized text according to the text length and/or text semantics of the recognized text;

Correspondingly, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:

Determine a target video frame in the first video stream according to the generation time;

The recognized text is used as subtitle information, and the subtitle information, the time interval information and the text type are used as video frame information of the target video frame, and added to the first video stream.
A computing device comprising:

Memory and processor;

The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live broadcast data processing method described in any one of claims 1-14 when executing the computer-executable instructions.
A computer-readable storage medium storing computer instructions, which, when executed by a processor, implement the steps of the live broadcast data processing method described in any one of claims 1 to 14.