CN115643424A

CN115643424A - Live broadcast data processing method and system

Info

Publication number: CN115643424A
Application number: CN202211311544.0A
Authority: CN
Inventors: 汤然; 姜军; 郑龙; 刘永明
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-24
Also published as: WO2024087732A1

Abstract

The embodiment of the application provides a live data processing method and a system, wherein the live data processing method comprises the following steps: decoding the received initial live stream to generate an audio stream and a first video stream, carrying out voice recognition on the audio stream to generate a corresponding recognition text, determining time interval information between the generation time of the recognition text and the receiving time of the audio stream, taking the recognition text as subtitle information, adding the subtitle information and the time interval information to the first video stream to generate a second video stream, coding the second video stream and the audio stream to generate a to-be-pushed live stream, and returning the to-be-pushed live stream to a client.

Description

Live broadcast data processing method and system

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a live data processing method. One or more embodiments of the present application are also directed to a live data processing system, a computing device, and a computer-readable storage medium.

Background

With the rapid development of the live broadcast audio and video industry, the requirements for high definition image quality, low time delay, sound and picture synchronization and the like are optimized to the utmost extent by utilizing the existing data stream transmission technology, but the requirements of users are not met.

In some special scenes, such as large sports events, large meeting reports, online educational training, etc., real-time translation of live broadcasts and addition of language captions are required. Because the caption needs to record the direct current at first, then the audio stream is extracted, and the audio stream is burnt into the video after being translated manually or mechanically, the caption can be displayed when the caption is repeated. But this approach does not provide a live effect to the audience population with language disabilities or hearing impairments. Although the technology for generating the subtitles in real time by live broadcasting, such as live barrage, has been developed in the prior art, the technology has some defects, for example, the subtitles and the sound are asynchronous, the experience of audience groups is extremely poor after the time comes and is delayed, and the requirements cannot be met. Therefore, an effective method is needed to solve such problems.

Disclosure of Invention

In view of this, the present application provides a live data processing method. One or more embodiments of the present application relate to a live data processing apparatus, a live data processing system, a computing device, and a computer-readable storage medium, so as to solve the technical defects of high cost, low efficiency, and delayed subtitles in generating live subtitles in the prior art.

According to a first aspect of an embodiment of the present application, a live data processing method is provided, including:

decoding the received initial live stream to generate an audio stream and a first video stream;

performing voice recognition on the audio stream, generating a corresponding recognition text, and determining time interval information between the generation time of the recognition text and the receiving time of the audio stream;

taking the identification text as subtitle information, and adding the subtitle information and the time interval information to the first video stream to generate a second video stream;

and coding the second video stream and the audio stream to generate a live stream to be pushed, and returning the live stream to be pushed to the client.

According to a second aspect of the embodiments of the present application, there is provided a live data processing apparatus, including:

a decoding module configured to decode the received initial live stream to generate an audio stream and a first video stream;

a recognition module configured to perform voice recognition on the audio stream, generate a corresponding recognition text, and determine time interval information between a generation time of the recognition text and a reception time of the audio stream;

an adding module configured to take the identification text as caption information, add the caption information and the time interval information to the first video stream, and generate a second video stream;

and the coding module is configured to code the second video stream and the audio stream, generate a to-be-pushed live stream, and return the to-be-pushed live stream to the client.

According to a third aspect of the embodiments of the present application, there is provided another live data processing method, including:

receiving and caching a to-be-pushed live stream returned by a live broadcast server;

decoding the live stream to be pushed to generate a corresponding audio stream, a video stream, subtitle information and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live server according to the generation time of the subtitle information and the receiving time of the audio stream;

determining the display time of the subtitle information according to the time interval information;

and under the condition that the playing condition of the live stream to be pushed is determined to be met, synchronously playing the video stream and the audio stream, and displaying the subtitle information based on the display time.

According to a fourth aspect of the embodiments of the present application, there is provided another live data processing apparatus, including:

the receiving module is configured to receive and cache a live stream to be pushed returned by the live server;

the decoding module is configured to decode the live stream to be pushed, and generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, where the time interval information is determined by the live server according to the generation time of the subtitle information and the receiving time of the audio stream;

a determining module configured to determine a presentation time of the subtitle information according to the time interval information;

and the display module is configured to synchronously play the video stream and the audio stream under the condition that the playing condition of the live stream to be pushed is determined to be met, and display the subtitle information based on the display time.

According to a fifth aspect of embodiments of the present application, there is provided a live data processing system, including:

a live broadcast server and a client;

the live broadcast server is used for decoding the received initial live broadcast stream, generating an audio stream and a first video stream, performing voice recognition on the audio stream, generating a corresponding recognition text, determining time interval information between the generation time of the recognition text and the receiving time of the audio stream, using the recognition text as subtitle information, adding the subtitle information and the time interval information to the first video stream, generating a second video stream, encoding the second video stream and the audio stream, generating a to-be-pushed live broadcast stream, and returning the to-be-pushed live broadcast stream to the client;

the client is used for receiving and caching the to-be-pushed live stream, decoding the to-be-pushed live stream, obtaining the audio stream, the second video stream, the subtitle information and the time interval information, determining the display time of the subtitle information according to the time interval information, synchronously playing the second video stream and the audio stream under the condition that the playing condition of the to-be-pushed live stream is determined to be met, and displaying the subtitle information based on the display time.

According to a sixth aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is used for storing computer-executable instructions, and the processor is used for executing the computer-executable instructions, wherein the processor realizes the steps of the live data processing method when executing the computer-executable instructions.

According to a seventh aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the live data processing method.

An embodiment of the application realizes a live data processing method and system, wherein, the live data processing method includes decoding received initial live stream, generating audio stream and first video stream, right voice recognition is performed on the audio stream, corresponding recognition text is generated, and the generation time of the recognition text and the time interval information between the receiving time of the audio stream are determined, the recognition text is used as subtitle information, the subtitle information and the time interval information are added to the first video stream, a second video stream is generated, right the second video stream and the audio stream are encoded, a to-be-pushed live stream is generated, and the to-be-pushed live stream is returned to a client.

In the embodiment of the application, the live broadcast server performs voice recognition on an audio stream to generate a corresponding recognition text, and records a time interval between the generation time of the recognition text and the time for receiving the audio stream, because the time interval can be used for representing the time consumed by the live broadcast server for performing voice recognition on the audio stream in the initial live broadcast stream after the live broadcast server receives the initial live broadcast stream, after the recognition text and the time interval information are added to a video stream and returned to a client, the client can analyze in advance to obtain subtitle information carried in the live broadcast stream to be pushed, and the display time of the subtitle information is determined according to the time interval information between the generation time of the subtitle information and the time for receiving the audio stream by the live broadcast server, i.e. the display time of a complete subtitle corresponding to the live broadcast stream to be pushed is determined, so that the complete subtitle is displayed in advance based on the display time, which is not only beneficial to reducing the cost for generating the subtitle, improving the subtitle generating efficiency, but also beneficial to avoiding asynchronization between the subtitle and the video picture or audio, thereby being beneficial to meeting the requirement of a user in the live broadcast watching process.

Drawings

FIG. 1 is an architecture diagram of a live data processing system provided by one embodiment of the present application;

fig. 2 is a flowchart of a live data processing method according to an embodiment of the present application;

fig. 3 is a flowchart of another live data processing method according to an embodiment of the present application;

fig. 4 is an interaction diagram of the live data processing method applied to the live broadcast field according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a live data processing apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of another live data processing apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit and scope of this application, and thus this application is not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present application. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

First, the noun terms referred to in one or more embodiments of the present application are explained.

And (4) live broadcast: the English of Live broadcast is Live broadcast, and the broad Live broadcast also includes the Live broadcast of television stations, and we generally refer to network video Live broadcast here. Live audio and video is pushed to the server in the form of a media stream (push streaming). If the audience watches the live broadcast, the server transmits the video to the website, the APP and the player of the client after receiving the request of the user, and plays the video in real time.

H264 coding: h264 generally refers to h.264.H.264 is a highly compressed digital Video codec standard proposed by the Joint Video Team (JVT, joint Video Team) consisting of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) union.

H265 encoding: h.265 is a new video coding standard established by ITU-T VCEG following H.264. The h.265 standard surrounds the existing video coding standard h.264, preserving some of the original techniques, while improving some of the related techniques.

SEI: SEI (Supplemental Enhancement Information), belonging to the category of codestream, provides a method for adding extra Information to video codestreams, and is one of the characteristics of h.264/h.265 video compression standards.

The voice recognition technology comprises the following steps: i.e., a technique for converting speech signals into corresponding text or commands by a machine through a recognition and understanding process.

GRPC: one of the RPC (Remote Procedure Call) frameworks is a high-performance, open-source and general RPC framework, which is developed based on a Protocol Buffers (Protocol Buffers) serialization Protocol and supports numerous development languages.

Transcoding: video transcoding techniques convert video signals from one format to another.

In the application, a live data processing method is provided. One or more embodiments of the present application relate to a live data processing apparatus, a live data processing system, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

In specific implementation, the subtitle information of the embodiment of the present application may be presented to clients such as a large-scale video playing device, a game console, a desktop computer, a smart phone, a tablet computer, an MP3 (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4) player, a laptop portable computer, an e-book reader, and other display terminals.

In addition, the subtitle information of the embodiment of the application can be applied to any videos and audios capable of presenting subtitles, for example, subtitles can be presented in videos of live broadcasting and recorded broadcasting, and subtitles can be presented in audios of online or offline song listening, book listening and the like.

Referring to fig. 1, fig. 1 shows an architecture diagram of a live data processing system according to an embodiment of the present application, including:

a live server 102 and a client 104;

the live broadcast server 102 is configured to decode a received initial live broadcast stream, generate an audio stream and a first video stream, perform voice recognition on the audio stream, generate a corresponding recognition text, determine time interval information between generation time of the recognition text and reception time of the audio stream, use the recognition text as subtitle information, add the subtitle information and the time interval information to the first video stream, generate a second video stream, encode the second video stream and the audio stream, generate a to-be-pushed live broadcast stream, and return the to-be-pushed live broadcast stream to the client 104;

the client 104 is configured to receive and cache the to-be-pushed live stream, decode the to-be-pushed live stream, obtain the audio stream, the second video stream, the subtitle information, and the time interval information, determine display time of the subtitle information according to the time interval information, play the second video stream and the audio stream synchronously under the condition that it is determined that a playing condition of the to-be-pushed live stream is met, and display the subtitle information based on the display time.

Specifically, in fig. 1, a user U1 performs live broadcast through an intelligent terminal, pushes a generated initial live broadcast stream to a live broadcast server 102, and decodes the received initial live broadcast stream by the live broadcast server 102 to generate an audio stream and a first video stream; then carrying out voice recognition on the audio stream to generate a corresponding recognition text, and determining time interval information between the generation time of the recognition text and the receiving time of the audio stream; and then, the identification text is used as caption information, the caption information and the time interval information are added to the first video stream to generate a second video stream, the second video stream and the audio stream are encoded to generate a live stream to be pushed, and when the user U2 and the user U3 watch the live broadcast of the user U1, the live broadcast server can push the live stream to be pushed to a client 104 of the user U2 and the user U3.

In the process of playing the live stream for the user, the client 104 may pull the live stream to be pushed for a certain duration from the live server in advance and cache the live stream, so that the client 104 may decode the cached live stream to be pushed in advance to obtain the subtitle information included in the live stream to be pushed, and then may determine the presentation time of the subtitle information according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the receiving time of the audio stream by the live stream server 102, and perform synchronous playing on the video stream and the audio stream obtained by decoding under the condition that the playing condition of the live stream to be pushed is determined to be satisfied, and present the subtitle information based on the presentation time.

In the embodiment of the application, through the above processing method, it is favorable to enabling the client to analyze in advance to obtain the subtitle information carried in the live stream to be pushed, and according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, determine the display time of the subtitle information, i.e., determine the display time of the complete subtitle corresponding to the live stream to be pushed, so as to display the complete subtitle in advance based on the display time, which is favorable to reducing the cost of generating the subtitle, improving the subtitle generation efficiency, and being favorable to avoiding desynchronization between the subtitle and the video picture or the audio, thereby being favorable to satisfying the requirement of a user in the live watching process, watching the live watching experience of the live user, and being favorable to improving the live watching experience of the user.

The above is an illustrative scheme of a live data processing system of the embodiment. It should be noted that the technical solution of the live data processing system and the technical solution of the live data processing method described below belong to the same concept, and details of the technical solution of the live data processing system, which are not described in detail, can be referred to in the following description of the technical solution of the live data processing method.

Referring to fig. 2, fig. 2 is a flowchart illustrating a live data processing method according to an embodiment of the present application, including the following steps:

step 202, decoding the received initial live stream to generate an audio stream and a first video stream.

Specifically, the live data processing method provided by the embodiment of the application is applied to a live server. The method comprises the steps of initiating a live stream, namely the live stream pushed to a live server by a main broadcast in the live broadcasting process.

The live broadcast server can push the live broadcast stream pushed by the anchor broadcast to user terminals (client sides) of other users by the live broadcast server when other users need to watch the live broadcast of the anchor broadcast.

At present, most live broadcasts are live broadcasts without subtitles, but in some special scenes, such as large-scale sports events, large-scale meeting reports, online education and training and the like, the live broadcasts need to be translated in real time and language subtitles need to be added. Because the caption needs to record the direct current at first, then the audio stream is extracted, and the audio stream is burnt into the video after being translated manually or mechanically, the caption can be displayed when the caption is repeated. But this approach does not provide a live effect to audience segments with speech obstruction or hearing impairment.

In addition, even though a technology for generating subtitles in real time by live broadcasting is developed at present, for example, live broadcasting barrage, the technology often has the problem that subtitles are not synchronous with video pictures or sound, so that the experience of a user watching live broadcasting is extremely poor, and the requirements of the user cannot be met.

Based on this, in the embodiment of the application, after receiving an initial live stream pushed by a main broadcast, a live broadcast server may decode the initial live stream to obtain an audio stream and a first video stream, and may perform voice recognition on the audio stream to obtain a corresponding recognition text, and then add the recognition text as subtitle information to the first video stream to generate a second video stream, so that after pushing the coding results of the audio stream and the second video stream to a client of a user, the client may decode to obtain the subtitle information, and may display the subtitle information while synchronously playing the audio stream and the second video stream for the user, thereby avoiding the problem of asynchronization between live subtitles and live video pictures or audio in the process of watching live broadcast by the user in real time, so as to meet the requirement of watching live subtitles by the user in the process of watching live broadcast, and improve the live broadcast watching experience of the user.

In specific implementation, decoding the received initial live stream may specifically be implemented in the following manner:

determining a to-be-played live stream cached by the client, and determining the generation time corresponding to the to-be-played live stream;

and acquiring an initial live stream corresponding to the live stream identifier within a preset time interval according to the live stream identifier corresponding to the to-be-played live stream and the generation time, and decoding the initial live stream, wherein the preset time interval is later than the generation time.

In addition, the client decodes the live stream to be played to generate a corresponding audio stream to be played, a video stream to be played, a caption to be displayed and display time corresponding to the caption to be displayed;

and under the condition that the playing condition of the live stream to be played is determined to be met, synchronously playing the video stream to be played and the audio stream to be played, and displaying the subtitle to be displayed based on the display time.

Specifically, when playing the live stream, the client may pre-cache the live stream to be played within a period of time after the current playing time, analyze the part of the live stream to be played in advance, obtain the video stream to be played, the audio stream to be played, the subtitle to be displayed, and the display time corresponding to the subtitle to be displayed, which are included in the live stream to be played, and then synchronously play the video stream to be played and the audio stream to be played, which are obtained by decoding, under the condition that the playing condition of the live stream to be played is determined to be satisfied, and display the subtitle to be displayed based on the display time.

For example, if the duration of the to-be-played live stream pre-cached by the client is 5s, and the current playing time is t, the to-be-played live stream in the period from t to t +5s is pre-cached and analyzed in advance, so as to determine whether to display the to-be-displayed caption in advance according to the display time of the to-be-displayed caption in the analysis result, thereby reducing the delay between the live caption and the live video picture or audio in the process of watching the live broadcast by the user in real time.

Further, since the duration of the to-be-played live stream pre-cached by the client is limited, after the part of the to-be-played live stream is played, a new to-be-played live stream needs to be further cached, for example, the client pre-caches the to-be-played live stream from t to t +5s, and after the to-be-played live stream from t to t +3s is played, the to-be-played live stream from t +5s to t +8s needs to be cached, that is, the to-be-played live stream from t +5s to t +8s needs to be acquired from the live server.

Therefore, the live broadcast server can predetermine the live broadcast stream to be played, which is cached at the client, and determine the generation time (playing time) corresponding to the cached live broadcast stream to be played, then obtain the initial live broadcast stream corresponding to the live broadcast stream identifier within a period of time after the generation time according to the live broadcast stream identifier and the generation time corresponding to the cached live broadcast stream to be played, process the initial live broadcast stream, generate the live broadcast stream to be pushed, which contains the caption information, and push the live broadcast stream to the client.

Based on the above, in the process that a user watches live broadcast through a client in real time, the client pre-caches the live broadcast stream to be played within a period of time after the current playing time and analyzes the part of the live broadcast stream to be played in advance, and similarly, the live broadcast server can also pre-determine the live broadcast stream to be played, which is cached by the client, and determine an initial live broadcast stream according to the cached live broadcast stream to be played and analyze the stream; although the live broadcast server analyzes the initial live broadcast stream and the process that the client analyzes the live broadcast stream to be played needs to consume a certain time length, so that certain live broadcast delay is caused, in the embodiment of the application, the process that the live broadcast server analyzes the initial live broadcast stream and the process that the client analyzes the live broadcast stream to be played are executed in parallel, and the client can determine whether the displayed caption needs to be displayed in advance according to the display time of the caption to be displayed in an analysis result, so that the delay between the live broadcast caption and a live broadcast video picture or audio in the process of watching the live broadcast in real time by a user is reduced.

Step 204, performing speech recognition on the audio stream, generating a corresponding recognition text, and determining time interval information between the generation time of the recognition text and the receiving time of the audio stream.

Specifically, the live broadcast server decodes the initial live broadcast stream to obtain an audio stream and a first video stream, and then performs voice recognition on the audio stream to generate a corresponding recognition text, and then adds the recognition text as subtitle information to the first video stream to generate a second video stream, so that the client for watching the live broadcast can display the subtitle information for the user in the process of playing the second video stream after obtaining the second video stream.

However, in practical applications, after the live broadcast server decodes and obtains an audio stream, it often takes a certain time to perform voice recognition on the audio stream, under such a situation, a time difference exists between the generation time of the recognition text and the time of receiving the audio stream, that is, receiving the initial live broadcast stream, and if the time difference is not considered, only the recognition text and the initial live broadcast stream are pushed to the client, and then the client may have a situation that the recognition text is not synchronized with a video picture or sound when displaying the recognition text.

After the complete recognition text is recognized and obtained, in order to avoid the asynchronism between the recognition text and the video picture or sound, the time length consumed for generating the recognition text, namely the time interval between the generation time of the recognition text and the time when the live broadcast server receives the audio stream, needs to be determined, so that the client can determine how long the recognition text needs to be displayed in advance after the recognition text is obtained according to the time interval.

In specific implementation, after a live broadcast server decodes an audio stream to obtain the audio stream, the audio stream can be divided according to frequency spectrum information corresponding to the audio stream to generate at least two audio segments;

correspondingly, the performing speech recognition on the audio stream, generating a corresponding recognition text, and determining time interval information between the generation time of the recognition text and the receiving time of the audio stream includes:

performing voice recognition on a target audio clip to generate a corresponding recognition text, wherein the target audio clip is one of the at least two audio clips;

determining a generation time of the recognized text, and determining time interval information between the generation time and a reception time of the target audio piece.

Specifically, when performing speech recognition on an audio stream, if the recognized audio stream is a complete sentence, the accuracy of the recognition result may be ensured, and based on this, the embodiment of the present application may first divide the audio stream according to the spectrum information corresponding to the audio stream to generate at least two audio segments, for example, according to the spectrum information, an audio stream between any two adjacent points whose spectrum values are 0 (indicating a pause) is taken as one audio segment. Then, voice recognition is carried out on each audio clip, a corresponding recognition text is generated, the generation time of the recognition text is determined, and time interval information between the generation time and the receiving time of each audio clip (the receiving time of an audio stream or an initial live stream) is determined.

Or, performing speech recognition on the audio stream, generating a corresponding recognition text, and determining time interval information between the generation time of the recognition text and the receiving time of the audio stream, including:

splitting the audio stream according to a preset identification window to generate at least one audio fragment;

performing voice recognition on a target audio clip to generate a corresponding recognition text, wherein the target audio clip is one of the at least one audio clip;

determining a generation time of the recognized text, and determining time interval information between the generation time and a reception time of the audio stream.

Specifically, in the process of performing speech recognition on an audio stream, a preset recognition window is usually used, the window length of the preset recognition window may be 0.5s-1s, and the single word in the audio stream can be recognized by performing speech recognition on the audio stream by using the preset recognition window; or the window length of the preset recognition window can be 1s-5s, and the complete sentence in the audio stream can be recognized by performing voice recognition on the audio stream by using the preset recognition window. The specific window length may be determined according to actual requirements, and is not limited herein.

Performing voice recognition on the audio stream according to a preset recognition window, specifically splitting the audio stream according to the preset recognition window to generate at least one audio segment, performing voice recognition on each audio segment to generate a corresponding recognition text, then determining the generation time of the recognition text, and determining the time interval information between the generation time and the receiving time of the audio stream.

In addition, in the embodiment of the application, the live broadcast server includes a transcoding module and a voice recognition service module, so that the received initial live broadcast stream is decoded to generate an audio stream and a first video stream, specifically, the received initial live broadcast stream is decoded by the transcoding module to generate the audio stream and the first video stream; and performing voice recognition on the audio stream to generate a corresponding recognition text, specifically, performing voice recognition on the audio stream through a voice recognition service module to generate a corresponding recognition text.

The transcoding module transmits the audio stream to the voice recognition service module through a data transmission channel.

Specifically, the data transmission channel may be GRPC.

Further, the performing voice recognition on the audio stream through a voice recognition service module to generate a corresponding recognition text includes:

splitting the audio stream through a voice recognition service module according to a preset recognition window to generate at least one audio clip;

and performing voice recognition on a first audio segment to generate a corresponding first recognition text, and returning the first recognition text to the transcoding module, wherein the first audio segment is one of the at least one audio segment.

Specifically, as mentioned above, in the process of performing speech recognition on an audio stream, a preset recognition window is usually used, and similarly, a speech recognition service module is used to perform speech recognition on the audio stream, and similarly, the preset recognition window may be used to perform speech recognition on the audio stream according to the preset recognition window, specifically, the audio stream is split according to the preset recognition window to generate at least one audio segment, each audio segment is subjected to speech recognition to generate a corresponding recognition text, then the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.

The window length of the preset recognition window can be 0.5s-1s, and the voice recognition of the audio stream is carried out by using the preset recognition window, so that a single character in the audio stream can be recognized; or the window length of the preset recognition window can be 1s-5s, and the complete sentence in the audio stream can be recognized by performing voice recognition on the audio stream by using the preset recognition window. The specific window length can be determined according to actual requirements, and is not limited herein.

Step 206, using the recognized text as subtitle information, and adding the subtitle information and the time interval information to the first video stream to generate a second video stream.

Specifically, after the recognition text is generated and the time interval information between the generation time of the recognition text and the receiving time of the audio stream is determined, the recognition text may be used as the subtitle information, and the subtitle information and the time interval information are added to the first video stream to generate the second video stream.

Wherein the subtitle information may be written to the first video stream in the form of SEI to generate the second video stream.

In specific implementation, the text type of the recognized text can be determined according to the text length and/or the text semantics of the recognized text;

accordingly, the adding the recognized text as subtitle information and the time interval information to the first video stream includes:

determining a target video frame in the first video stream according to the generation time;

and taking the identified text as subtitle information, and taking the subtitle information, the time interval information and the text type as video frame information of the target video frame, and adding the subtitle information, the time interval information and the text type to the first video stream.

Specifically, as mentioned above, in the process of performing speech recognition on an audio stream, a preset recognition window is usually used, the window length of the preset recognition window may be 0.5s-1s, and the single word in the audio stream can be recognized by performing speech recognition on the audio stream using the preset recognition window; or the window length of the preset recognition window can be 1s-5s, and the complete sentence in the audio stream can be recognized by performing voice recognition on the audio stream by using the preset recognition window.

Therefore, after the recognition text is generated, the text type of the recognition text can be determined according to the text length and/or the text semantics of the recognition text, and in practical applications, the text type includes, but is not limited to, a word, a sentence, and the like. The text semantics are used for determining whether the recognized text can express complete semantics, and if so, determining that the text type of the recognized text is a sentence type; if not, the text type of the recognized text is the word type under the condition that the text length of the recognized text is more than or equal to two words, and if the text length is equal to 1, the text type is the word type.

After the text type of the recognized text is determined, a target video frame in the first video stream can be determined according to the generation time of the recognized text, the recognized text is used as subtitle information, and the subtitle information, the time interval information and the text type are used as video frame information of the target video frame and are added to the first video stream.

In practical application, a last frame of video frames in a video segment corresponding to a target audio segment is usually taken as a target video frame, and subtitle information, time interval information and a text type are taken as video frame information of the target video frame and added to a first video stream to generate a second video stream, so that a client can determine subtitle information to be displayed according to the text type after obtaining the second video stream, and usually, the subtitle information of a sentence type is preferentially selected for display to ensure the viewing effect of live subtitles.

In addition, in this embodiment of the present application, after splitting the audio stream according to a preset recognition window through a speech recognition service module to generate at least one audio segment, and performing speech recognition on a first audio segment to generate a corresponding first recognition text, taking the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream, the method includes:

the transcoding module determines a first target video frame in the first video stream according to the generation time of the first identification text;

and taking the first identification text as first caption information, and adding the first caption information, time interval information between the generation time of the first identification text and the receiving time of the audio stream as video frame information of the first target video frame to the first video stream.

Specifically, the speech recognition service module may sequentially and respectively perform speech recognition on each audio segment when the audio stream is split into one or at least two audio segments, and after a recognition text corresponding to any audio segment is generated, the recognition text may be returned to the transcoding module, the transcoding module determines a target video frame (usually, a last frame video frame of a video segment corresponding to any audio segment) in the first video stream according to a generation time of the recognition text, uses the recognition text as subtitle information, and adds, as video frame information of the target video frame, the subtitle information, time interval information between a generation time of the recognition text and a reception time of the audio stream to the first video stream.

and performing voice recognition on a second audio segment adjacent to the first audio segment in the at least two audio segments to generate a corresponding second recognition text, and returning the first recognition text and the second recognition text to the transcoding module.

the transcoding module determines a second target video frame in the first video stream according to the generation time of the second identification text;

and taking the first identification text and the second identification text as second subtitle information, and adding the second subtitle information, time interval information between the generation time of the second identification text and the receiving time of the audio stream as video frame information of the second target video frame to the first video stream.

Specifically, as described above, when the audio stream is split into at least two audio segments, the speech recognition service module may first perform speech recognition on a first audio segment of the at least two audio segments to generate a corresponding first recognition text, and the transcoding module may use the first recognition text as subtitle information, and add the subtitle information, time interval information between a generation time of the first recognition text and a reception time of the audio stream, as video frame information of a first target video frame (generally, a last frame video frame of the video segment corresponding to the first audio segment) to the first video stream.

Then, a second audio segment adjacent to the first audio segment in the at least two audio segments may be subjected to speech recognition, a corresponding second recognition text is generated, the transcoding module uses the first recognition text and the second recognition text as caption information, and adds the caption information, time interval information between the generation time of the second recognition text and the receiving time of the audio stream as video frame information of a second target video frame (usually, a last video frame of the corresponding video segment of the second audio segment) to the first video stream, and so on.

After the voice recognition service module performs voice recognition to obtain the first recognition text, the first recognition text can be temporarily stored, and after the second recognition text is obtained through recognition, the first audio clip is adjacent to the second audio clip, so that the first recognition text and the second recognition text can be used as subtitle information of the video stream to be returned together, the voice recognition service module can multiplex and cache the subtitle information, and the accuracy of the recognition result of the subtitle information is improved.

And step 208, encoding the second video stream and the audio stream, generating a live stream to be pushed, and returning the live stream to be pushed to the client.

Specifically, after the live broadcast server generates the second video stream, the second video stream and the audio stream may be encoded to generate a to-be-pushed live broadcast stream, and the to-be-pushed live broadcast stream may be pushed to the client of the user when the user has a live broadcast watching demand.

In specific implementation, the client decodes the live stream to be pushed to generate a corresponding audio stream, a corresponding video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;

under the condition that the text type is determined to be the target type, determining the display time of the subtitle information according to the playing time of the target video frame and the time interval information;

determining at least two frames of video frames used for displaying the subtitle information in the video stream according to the display time, wherein the playing time of the at least two frames of video frames is earlier than that of the target video frame;

and under the condition that the playing condition of the live stream to be pushed is determined to be met, synchronously playing the video stream and the audio stream, and displaying the subtitle information in the at least two frames of video frames and the target video frame based on the display time.

Specifically, as described above, when playing a live stream, a client may pull a live stream to be pushed for a certain duration from a live server in advance and cache the live stream, and then may decode the cached live stream to be pushed in advance to obtain subtitle information corresponding to a target video frame in the live stream to be pushed, a text type of the subtitle information, and time interval information between generation time of the subtitle information and reception time of the live stream server for an audio stream, where it is determined according to the text type that the subtitle information belongs to the target text type, that is, the subtitle information belongs to a sentence type, then may determine, according to time interval information between generation time of the subtitle information carried in the live stream to be pushed and reception time of the live stream server for the audio stream, a presentation time of the subtitle information in combination with a playing time of a target video frame, then determine, according to the presentation time, other video frames for presenting the subtitle information that are located before the target video frame in the live stream to be pushed, and synchronize the video stream and the video frames that are presented in the target video frame, and determine that the video frames and the presentation time of the decoded video frames are presented in the live stream based on the target video frame.

For example, when the current time point is t, the client pre-buffers a to-be-pushed live stream to be played from t to t +5s in the process of playing the live stream, then decodes to obtain subtitle information carried in the to-be-pushed live stream, if a decoding result includes an identification text corresponding to a video frame at 5 time points of t +1s, t +2s, t +3s, t +4s and t +5s, and the text type of the identification text corresponding to the time point of t +5s is a sentence type, the identification text can be preferentially displayed, in this case, time interval information corresponding to the identification text can be determined, if the time interval between the generation time of the identification text and the reception time of the audio stream is determined to be 4s, and the time interval between the generation time of the identification text and the video frame at the time point of t +5s is determined to be 1s, it is indicated that the subtitle information needs to be displayed in advance for 3s (the identification text), it can be indicated that a main sentence is played from t +3s to t +5s, the subtitle information can be displayed before the display is started, and when the subtitle information is displayed completely displayed, and the subtitle information is displayed, and when the subtitle information is displayed, the display is detected, the subtitle information is started, and the subtitle information is displayed completely displayed, and the subtitle information is displayed.

Referring to fig. 3, fig. 3 shows a flowchart of another live data processing method provided in an embodiment of the present application, including the following steps:

step 302, receiving and caching the to-be-pushed live stream returned by the live server.

And 304, decoding the live stream to be pushed to generate a corresponding audio stream, a video stream, subtitle information and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live server according to the generation time of the subtitle information and the receiving time of the audio stream.

Step 306, determining the display time of the subtitle information according to the time interval information.

And 308, under the condition that the playing condition of the live stream to be pushed is determined to be met, synchronously playing the video stream and the audio stream, and displaying the subtitle information based on the display time.

Specifically, a main broadcast carries out live broadcast through an intelligent terminal, a generated initial live broadcast stream is pushed to a live broadcast server, and the received initial live broadcast stream is decoded by the live broadcast server to generate an audio stream and a first video stream; then carrying out voice recognition on the audio stream to generate a corresponding recognition text, and determining time interval information between the generation time of the recognition text and the receiving time of the audio stream; and then, the identification text is used as caption information, the caption information and the time interval information are added to the first video stream to generate a second video stream, the second video stream and the audio stream are encoded to generate a live stream to be pushed, and when a user watches the live broadcast of the main broadcast, the live broadcast server can push the live stream to be pushed to the client of the user.

The method comprises the steps that a client can pull a live stream to be pushed for a certain time length from a live broadcast server in advance and cache the live stream to be pushed in the process of playing the live stream for a user, so that the client can decode the cached live stream to be pushed in advance to obtain subtitle information contained in the live stream to be pushed, then display time of the subtitle information can be determined according to time interval information between generation time of the subtitle information carried in the live stream to be pushed and receiving time of an audio stream by the live broadcast stream server, video streams and audio streams obtained through decoding can be synchronously played under the condition that playing conditions of the live stream to be pushed are met, and the subtitle information is displayed based on the display time.

In the embodiment of the application, through the processing mode, the client can be enabled to analyze in advance to obtain the subtitle information carried in the live stream to be pushed, the display time of the subtitle information is determined according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, the display time of the subtitle information is determined, namely the display time of the complete subtitle corresponding to the live stream to be pushed is determined, the complete subtitle is displayed in advance based on the display time, the subtitle generation cost is reduced, the subtitle generation efficiency is improved, asynchronization between the subtitle and a video picture or audio frequency is avoided, the requirement of a user for watching live subtitles is met in the live watching process, and the live watching experience of the user is improved.

The above is an illustrative scheme of another live data processing method of this embodiment. It should be noted that the technical solution of the live data processing method and the technical solution of the live data processing method belong to the same concept, and details of the technical solution of the live data processing method, which are not described in detail, can be referred to the description of the technical solution of the live data processing method.

Referring to fig. 4, taking an application of the live data processing method provided in the embodiment of the present application in the live broadcast field as an example, the live data processing method is further described. Fig. 4 shows an interactive schematic diagram of a live data processing method applied in the live field according to an embodiment of the present application, which specifically includes the following steps:

in step 402, a transcoding module receives an initial live stream of an anchor.

In step 404, the transcoding module decodes the initial live stream to generate an audio stream and a first video stream.

In step 406, the transcoding module transmits the audio stream to the speech recognition service module through the GRPC.

Step 408, the speech recognition service module performs speech recognition on the audio stream to generate a corresponding recognition text.

In step 410, the speech recognition service module determines a generation time of the recognized text, determines time interval information between the generation time and a receiving time of the audio stream, and determines a text type of the recognized text according to a text length and/or a text semantic meaning of the recognized text.

In step 412, the speech recognition service module transmits the recognized text, the text type and the time interval information to the transcoding module through the GRPC.

In step 414, the transcoding module takes the recognized text as the caption information, and adds the caption information, the time interval information, and the text type to the first video stream to generate the second video stream.

In step 416, the transcoding module encodes the second video stream and the audio stream to generate a live stream to be pushed.

Step 418, the client pulls the live stream to be pushed from the live server.

The live broadcast server comprises a transcoding module and a voice recognition service module.

Step 420, the client decodes the live stream to be pushed to generate a corresponding audio stream, a second video stream, subtitle information, and time interval information, determines display time of the subtitle information according to the time interval information, synchronously plays the second video stream and the audio stream under the condition that it is determined that playing conditions of the live stream to be pushed are met, and displays the subtitle information based on the display time.

Corresponding to the above method embodiment, the present application further provides an embodiment of a live data processing apparatus, and fig. 5 shows a schematic structural diagram of a live data processing apparatus provided in an embodiment of the present application. As shown in fig. 5, the apparatus includes:

a decoding module 502 configured to decode the received initial live stream to generate an audio stream and a first video stream;

a recognition module 504 configured to perform voice recognition on the audio stream, generate a corresponding recognition text, and determine time interval information between a generation time of the recognition text and a reception time of the audio stream;

an adding module 506, configured to take the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream, to generate a second video stream;

an encoding module 508 configured to encode the second video stream and the audio stream, generate a to-be-pushed live stream, and return the to-be-pushed live stream to the client.

Optionally, the decoding module 502 is further configured to:

Optionally, the client decodes the live stream to be played to generate a corresponding audio stream to be played, a video stream to be played, a subtitle to be displayed, and display time corresponding to the subtitle to be displayed;

and under the condition that the playing condition of the live stream to be played is determined to be met, synchronously playing the video stream to be played and the audio stream to be played, and displaying the subtitles to be displayed based on the display time.

Optionally, the live data processing apparatus further includes a determining module configured to:

determining the text type of the recognized text according to the text length and/or text semantics of the recognized text;

accordingly, the adding module 506 is further configured to:

and taking the identification text as subtitle information, and taking the subtitle information, the time interval information and the text type as video frame information of the target video frame, and adding the subtitle information, the time interval information and the text type to the first video stream.

Optionally, the client decodes the live stream to be pushed to generate a corresponding audio stream, a corresponding video stream, and video frame information of a target video frame in the video stream, where the video frame information includes the subtitle information, the time interval information, and the text type;

Optionally, the live data processing apparatus further includes a dividing module configured to:

dividing the audio stream according to the frequency spectrum information corresponding to the audio stream to generate at least two audio segments;

accordingly, the identification module 504 is further configured to:

Optionally, the identifying module 504 is further configured to:

splitting the audio stream according to a preset identification window to generate at least one audio clip;

Optionally, the decoding module 502 is further configured to:

decoding the received initial live stream through a transcoding module to generate an audio stream and a first video stream;

accordingly, the identification module 504 is further configured to:

and performing voice recognition on the audio stream through a voice recognition service module to generate a corresponding recognition text.

Optionally, the identifying module 504 is further configured to:

Optionally, the adding module 506 is further configured to:

Optionally, the identifying module 504 is further configured to:

Optionally, the adding module 506 is further configured to:

Optionally, the live data processing apparatus further includes a transmission module configured to:

and the transcoding module transmits the audio stream to the voice recognition service module through a data transmission channel.

The above is a schematic scheme of a live data processing apparatus of this embodiment. It should be noted that the technical solution of the live data processing apparatus and the technical solution of the live data processing method described above belong to the same concept, and details of the technical solution of the live data processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the live data processing method described above.

Corresponding to the above method embodiment, the present application further provides an embodiment of a live data processing apparatus, and fig. 6 shows a schematic structural diagram of another live data processing apparatus provided in an embodiment of the present application. As shown in fig. 6, the apparatus includes:

a receiving module 602 configured to receive and cache a to-be-pushed live stream returned by a live server;

a decoding module 604, configured to decode the live stream to be pushed, and generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, where the time interval information is determined by the live server according to the generation time of the subtitle information and the receiving time of the audio stream;

a determining module 606 configured to determine a presentation time of the subtitle information according to the time interval information;

a display module 608 configured to, in a case that it is determined that a playing condition of the live stream to be pushed is met, play the video stream and the audio stream synchronously, and display the subtitle information based on the display time.

The above is a schematic scheme of another live data processing apparatus of the present embodiment. It should be noted that the technical solution of the live data processing apparatus and the technical solution of the another live data processing method belong to the same concept, and details that are not described in detail in the technical solution of the live data processing apparatus can be referred to the description of the technical solution of the another live data processing method.

FIG. 7 illustrates a block diagram of a computing device 700 provided according to an embodiment of the present application. The components of the computing device 700 include, but are not limited to, memory 710 and a processor 720. Processor 720 is coupled to memory 710 via bus 730, and database 750 is used to store data.

Computing device 700 also includes access device 740, access device 740 enabling computing device 700 to communicate via one or more networks 760. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 740 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the application, the above components of the computing device 700 and other components not shown in fig. 7 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 7 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 700 may also be a mobile or stationary server.

Wherein the processor 720 is configured to execute computer-executable instructions for executing the computer-executable instructions, wherein the steps of the live data processing method are implemented when the processor executes the computer-executable instructions.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the live data processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can all be referred to in the description of the technical solution of the live data processing method.

An embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the steps of the live data processing method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the live data processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the live data processing method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application embodiment is not limited by the described acts or sequences, because some steps may be performed in other sequences or simultaneously according to the present application embodiment. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that acts and modules referred to are not necessarily required to implement the embodiments of the application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments of the application and its practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A live data processing method is characterized by comprising the following steps:

2. The live data processing method according to claim 1, wherein said decoding the received initial live stream comprises:

3. The live broadcast data processing method according to claim 2, wherein the client decodes the live broadcast stream to be played to generate a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and display time corresponding to the subtitles to be displayed;

4. The live data processing method according to claim 1, further comprising:

5. The live data processing method according to claim 4, wherein the client decodes the live stream to be pushed to generate a corresponding audio stream, a corresponding video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;

6. The live data processing method according to claim 1, further comprising:

7. The live data processing method according to claim 1, wherein performing speech recognition on the audio stream, generating a corresponding recognition text, and determining time interval information between a generation time of the recognition text and a reception time of the audio stream includes:

8. The live data processing method of claim 1, wherein decoding the received initial live stream to generate an audio stream and a first video stream comprises:

correspondingly, the performing speech recognition on the audio stream to generate a corresponding recognition text includes:

9. The live data processing method of claim 8, wherein the performing voice recognition on the audio stream through a voice recognition service module to generate a corresponding recognition text comprises:

10. The live data processing method according to claim 9, wherein the adding the identification text as subtitle information and the time interval information to the first video stream includes:

11. The live data processing method of claim 10, wherein the performing voice recognition on the audio stream through a voice recognition service module to generate a corresponding recognition text comprises:

12. The live data processing method according to claim 11, wherein the adding the identification text as subtitle information and the time interval information to the first video stream includes:

13. The live data processing method of claim 8, further comprising:

14. A live data processing method is characterized by comprising the following steps:

receiving and caching a to-be-pushed live stream returned by a live server;

15. A live data processing system, comprising:

a live broadcast server and a client;

16. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor when executing the computer-executable instructions performs the steps of the live data processing method of any one of claims 1-14.

17. A computer-readable storage medium, characterized in that it stores computer instructions which, when executed by a processor, implement the steps of the live data processing method of any of claims 1-14.