WO2024087732A1 - Livestreaming data processing method and system - Google Patents

Livestreaming data processing method and system Download PDF

Info

Publication number
WO2024087732A1
WO2024087732A1 PCT/CN2023/106150 CN2023106150W WO2024087732A1 WO 2024087732 A1 WO2024087732 A1 WO 2024087732A1 CN 2023106150 W CN2023106150 W CN 2023106150W WO 2024087732 A1 WO2024087732 A1 WO 2024087732A1
Authority
WO
WIPO (PCT)
Prior art keywords
stream
live
text
audio
time
Prior art date
Application number
PCT/CN2023/106150
Other languages
French (fr)
Chinese (zh)
Inventor
汤然
姜军
郑龙
刘永明
Original Assignee
上海哔哩哔哩科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海哔哩哔哩科技有限公司 filed Critical 上海哔哩哔哩科技有限公司
Publication of WO2024087732A1 publication Critical patent/WO2024087732A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/262Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular to a live data processing method.
  • One or more embodiments of the present application also relate to a live data processing system, a computing device, and a computer-readable storage medium.
  • the embodiments of the present application provide a live broadcast data processing method.
  • One or more embodiments of the present application also relate to a live broadcast data processing device, a live broadcast data processing system, a computing device, and a computer-readable storage medium to solve the technical defects of high cost, low efficiency and delayed subtitle generation in the related art.
  • a live broadcast data processing method including:
  • the second video stream and the audio stream are encoded to generate a live stream to be pushed, and the live stream to be pushed is returned to the client.
  • a live broadcast data processing device including:
  • a decoding module configured to decode the received initial live stream to generate an audio stream and a first video stream
  • a recognition module configured to perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream;
  • an adding module configured to use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream;
  • the encoding module is configured to encode the second video stream and the audio stream, generate a live stream to be pushed, and return the live stream to be pushed to the client.
  • another live broadcast data processing method including:
  • the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • another live broadcast data processing device including:
  • a receiving module is configured to receive and cache the live streaming stream to be pushed returned by the live streaming server;
  • a decoding module is configured to decode the live broadcast stream to be pushed, generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;
  • a determination module configured to determine a display time of the subtitle information according to the time interval information
  • the display module is configured to synchronously play the video stream and the audio stream when it is determined that the playback conditions of the live stream to be pushed are met, and to display the subtitle information based on the display time.
  • a live broadcast data processing system including:
  • the live broadcast server is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, add the subtitle information and the time interval information to the first video stream, generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client;
  • the client is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live data processing method when executing the computer-executable instructions.
  • a computer-readable storage medium which stores computer-executable instructions, and when the instructions are executed by a processor, the steps of the live data processing method are implemented.
  • An embodiment of the present application implements a live broadcast data processing method and system, wherein the live broadcast data processing method includes decoding a received initial live broadcast stream, generating an audio stream and a first video stream, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, using the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream, generating a second video stream, encoding the second video stream and the audio stream, generating a live broadcast stream to be pushed, and returning the live broadcast stream to be pushed to the client.
  • the live broadcast server performs speech recognition on the audio stream, generates a corresponding recognition text, and records the time interval between the generation time of the recognition text and the time interval between the time of receiving the audio stream. Since the time interval can be used to characterize the time consumed by the live broadcast server for performing speech recognition on the audio stream in the initial live broadcast stream after receiving the initial live broadcast stream, after adding the recognition text and the time interval information to the video stream and returning it to the client, the client can pre-parse and obtain the subtitle information carried in the live broadcast stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live broadcast server receives the audio stream, that is, determine the display time of the complete subtitle corresponding to the live broadcast stream to be pushed, so as to display the complete subtitle in advance based on the display time, which is beneficial to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is beneficial to avoiding the asynchrony between subtitles and video images
  • FIG1 is an architecture diagram of a live data processing system provided by an embodiment of the present application.
  • FIG2 is a flow chart of a live broadcast data processing method provided by an embodiment of the present application.
  • FIG3 is a flow chart of another live data processing method provided by an embodiment of the present application.
  • FIG4 is an interactive schematic diagram of a live broadcast data processing method provided by an embodiment of the present application applied to the live broadcast field;
  • FIG5 is a schematic diagram of the structure of a live data processing device provided by an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of another live data processing device provided by an embodiment of the present application.
  • FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present application.
  • first, second, etc. may be used to describe various information in one or more embodiments of the present application, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • the first may also be referred to as the second, and similarly, the second may also be referred to as the first.
  • word "if” as used herein may be interpreted as "at the time of” or "when” or "in response to determining”.
  • Live broadcast in English is Live broadcast. Broadly speaking, live broadcast also includes TV live broadcast. Here we generally refer to online video live broadcast. Live audio and video will be pushed to the server in the form of media stream (pushing stream). If there are viewers watching the live broadcast, after receiving the user's request, the server will transmit the video to the website, APP, and client player to play the video in real time.
  • H264 generally refers to H.264.
  • H.264 is a highly compressed digital video codec standard proposed by the Joint Video Team (JVT), which is jointly composed of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).
  • JVT Joint Video Team
  • VCEG Video Coding Experts Group
  • MPEG ISO/IEC Moving Picture Experts Group
  • H.265 is a new video coding standard developed by ITU-T VCEG after H.264.
  • the H.265 standard revolves around the video coding standard H.264, retaining some of the original technologies while improving some related technologies.
  • SEI stands for Supplemental Enhancement Information, which belongs to the bitstream category. It provides a method to add additional information to the video bitstream and is one of the features of video compression standards such as H.264/H.265.
  • Speech recognition technology a technology that converts speech signals into corresponding text or commands through the process of recognition and understanding by a machine.
  • GRPC A type of RPC (abbreviation of Remote Procedure Call) framework. It is a high-performance, open source and general RPC framework developed based on the ProtoBuf (Protocol Buffers) serialization protocol and supports many development languages.
  • Video transcoding technology converts video signals from one format to another.
  • a live data processing method is provided.
  • One or more embodiments of the present application simultaneously relate to a live data processing apparatus, a live data processing system, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
  • the subtitle information of the embodiment of the present application can be presented on large video playback devices, game consoles, desktop computers, smart phones, tablet computers, MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3) players, MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Clients such as players, laptops, e-book readers and other display terminals.
  • MP3 Motion Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • Clients such as players, laptops, e-book readers and other display terminals.
  • subtitle information of the embodiments of the present application can be applied to any video or audio that can present subtitles, for example, subtitles can be presented in live or recorded videos, and subtitles can be presented in audio of online or offline songs or books.
  • FIG. 1 shows an architecture diagram of a live broadcast data processing system provided according to an embodiment of the present application, including:
  • Live broadcast server 102 and client 104 Live broadcast server 102 and client 104;
  • the live broadcast server 102 is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client 104;
  • the client 104 is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.
  • user U1 broadcasts live through a smart terminal and pushes the generated initial live stream to the live server 102.
  • the live server 102 decodes the received initial live stream to generate an audio stream and a first video stream; then performs speech recognition on the audio stream to generate a corresponding recognition text, and determines the time interval information between the generation time of the recognition text and the reception time of the audio stream; then uses the recognition text as subtitle information, and adds the subtitle information and the time interval information to the first video stream to generate a second video stream; then encodes the second video stream and the audio stream to generate a live stream to be pushed.
  • the live server can push the live stream to be pushed to the client 104 of user U2 and user U3.
  • the client 104 can pull the live stream to be pushed of a certain length from the live server in advance and cache it, so that the client 104 can decode the cached live stream to be pushed in advance and obtain the subtitle information contained in the live stream to be pushed. Then, according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server 102, the display time of the subtitle information can be determined. When it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.
  • the above is a schematic scheme of a live data processing system of this embodiment. It should be noted that the technical scheme of the live data processing system and the technical scheme of the following live data processing method belong to the same concept, and the details not described in detail in the technical scheme of the live data processing system can be referred to the description of the technical scheme of the following live data processing method.
  • FIG. 2 shows a flow chart of a live broadcast data processing method provided according to an embodiment of the present application, comprising the following steps:
  • Step 202 decode the received initial live stream to generate an audio stream and a first video stream.
  • the live broadcast data processing method provided in the embodiment of the present application is applied to a live broadcast server.
  • the initial live broadcast stream is the live broadcast stream pushed to the live broadcast server by the anchor during the live broadcast process.
  • the host When the host is broadcasting live through his smart terminal, he can push the live stream generated during the live broadcast to the live broadcast server through the smart terminal, so that when other users need to watch the host's live broadcast, the live broadcast server can push the live stream pushed by the host to the user terminals (clients) of other users.
  • the live server can decode the initial live stream to obtain an audio stream and a first video stream, and can perform voice recognition on the audio stream to obtain the corresponding recognized text, and then add the recognized text as subtitle information to the first video stream to generate a second video stream, so that after the encoding results of the audio stream and the second video stream are pushed to the user's client, the client can decode and obtain the subtitle information, and can display the subtitle information while synchronously playing the audio stream and the second video stream for the user, thereby avoiding the problem of the live subtitles being out of sync with the live video screen or audio during the user's real-time viewing of the live broadcast, so as to meet the user's needs for viewing live subtitles during the live viewing process and improve the user's live viewing experience.
  • decoding the received initial live stream can be implemented in the following ways:
  • an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
  • the client decodes the live stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;
  • the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
  • the client when playing a live stream, can pre-cache the live stream to be played within a period of time after the current playback time, and parse this part of the live stream to be played in advance to obtain the video stream to be played, the audio stream to be played, the subtitles to be displayed, and the display time corresponding to the subtitles to be displayed contained in the live stream to be played. Then, when it is determined that the playback conditions of the live stream to be played are met, the decoded video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
  • the live stream to be played within t to t+5s is pre-cached and parsed in advance to determine whether the subtitles to be displayed need to be displayed in advance based on the display time of the subtitles to be displayed in the analysis results, thereby reducing the delay between the live subtitles and the live video screen or audio during the user's real-time viewing of the live broadcast.
  • the client pre-caches the live stream to be played within t to t+5s, and after the live stream to be played within t to t+3s is played, it is necessary to cache the live stream to be played within t+5s to t+8s, that is, it is necessary to obtain the live stream to be played within t+5s to t+8s from the live broadcast server.
  • the live broadcast server can pre-determine the live broadcast stream to be played that has been cached by the client, and determine the generation time (playback time) corresponding to the cached live broadcast stream to be played, and then obtain the initial live broadcast stream corresponding to the live broadcast stream identifier within a period of time after the generation time based on the live broadcast stream identifier and generation time corresponding to the cached live broadcast stream to be played, and process the initial live broadcast stream to generate a live broadcast stream to be pushed containing subtitle information, and push it to the client.
  • the generation time playback time
  • the client when the user is watching the live broadcast in real time through the client, the client pre-caches the live stream to be played within a period of time after the current playback time, and parses this part of the live stream to be played in advance.
  • the live broadcast server can also pre-determine the live stream to be played that the client has cached, and determine the initial Live stream and parse it; although the live server parses the initial live stream, and the client parses the live stream to be played, both take a certain amount of time and cause a certain delay in the live broadcast, but in the embodiment of the present application, the live server parses the initial live stream and the client parses the live stream to be played in parallel, and the client can determine whether to display the subtitles in advance according to the display time of the subtitles to be displayed in the analysis result, thereby reducing the delay between the live subtitles and the live video screen or audio during the user's real-time viewing of the live broadcast.
  • Step 204 Perform speech recognition on the audio stream to generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream.
  • the live broadcast server decodes the initial live broadcast stream and obtains the audio stream and the first video stream, it can perform speech recognition on the audio stream to generate corresponding recognition text, and then add the recognition text as subtitle information to the first video stream to generate a second video stream, so that after the client for watching the live broadcast obtains the second video stream, it can display the subtitle information to the user during the playback of the second video stream.
  • the live broadcast server decodes and obtains the audio stream, it often takes a certain amount of time to perform speech recognition on the audio stream.
  • the embodiment of the present application obtains the complete recognition text, in order to avoid the asynchrony between the recognition text and the video image or sound, it is necessary to determine the time consumed to generate the recognition text, that is, the time interval between the generation time of the recognition text and the time when the live broadcast server receives the audio stream, so that the client can determine how long in advance to display the recognition text after obtaining it according to the time interval.
  • the audio stream may be divided according to the spectrum information corresponding to the audio stream to generate at least two audio segments;
  • performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream includes:
  • the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the target audio segment is determined.
  • the embodiment of the present application can first divide the audio stream according to the spectrum information corresponding to the audio stream to generate at least two audio segments. For example, according to the spectrum information, the audio stream between any two adjacent points with a spectrum value of 0 (indicating a pause) is regarded as an audio segment. Then, speech recognition is performed on each audio segment to generate a corresponding recognition text, determine the generation time of the recognition text, and determine the time interval information between the generation time and the reception time of each audio segment (the reception time of the audio stream or the initial live stream).
  • performing speech recognition on the audio stream to generate corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream including:
  • the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
  • a preset recognition window is usually used, and the window length of the preset recognition window can be 0.5s-1s.
  • the window length of the preset recognition window can be 1s-5s.
  • the audio stream is subjected to speech recognition according to a preset recognition window. Specifically, the audio stream is split according to the preset recognition window to generate at least one audio segment, speech recognition is performed on each audio segment to generate a corresponding recognition text, and then the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
  • the live broadcast server includes a transcoding module and a speech recognition service module. Therefore, the received initial live broadcast stream is decoded to generate an audio stream and a first video stream. Specifically, the received initial live broadcast stream is decoded by the transcoding module to generate an audio stream and a first video stream; speech recognition is performed on the audio stream to generate a corresponding recognition text. Specifically, speech recognition is performed on the audio stream by the speech recognition service module to generate a corresponding recognition text.
  • the transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.
  • the data transmission channel may be GRPC.
  • performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text includes:
  • a preset recognition window is usually used.
  • a preset recognition window can also be used. Speech recognition is performed on the audio stream according to the preset recognition window. Specifically, the audio stream is split according to the preset recognition window to generate at least one audio segment, speech recognition is performed on each audio segment, corresponding recognition text is generated, and then the generation time of the recognition text is determined, and the time interval information between the generation time and the receiving time of the audio stream is determined.
  • the window length of the preset recognition window may be 0.5s-1s, and the preset recognition window may be used to perform speech recognition of the audio stream, so that a single word in the audio stream can be recognized; or the window length of the preset recognition window may be 1s-5s, and the preset recognition window may be used to perform speech recognition of the audio stream, so that a complete sentence in the audio stream can be recognized.
  • the specific window length may be determined according to actual needs and is not limited here.
  • Step 206 Use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream.
  • the recognition text can be used as subtitle information, and the subtitle information and the time interval information can be added to the first video stream to generate the second video stream.
  • the subtitle information may be written to the first video stream in the form of SEI to generate the second video stream.
  • the text type of the recognized text may also be determined according to the text length and/or text semantics of the recognized text;
  • the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:
  • the recognized text is used as subtitle information
  • the subtitle information, the time interval information and the text type are used as video frame information of the target video frame, and added to the first video stream.
  • a preset recognition window is usually used.
  • the window length of the preset recognition window can be 0.5s-1s.
  • the window length of the preset recognition window can be 1s-5s.
  • the embodiment of the present application can also determine the text type of the recognized text according to the text length and/or text semantics of the recognized text.
  • the text type includes but is not limited to characters, words, sentences, etc.
  • Text semantics is used to determine whether the recognized text can express complete semantics. If so, the text type of the recognized text can be determined as Sentence type; if not, then when the text length of the recognized text is greater than or equal to two characters, the text type of the recognized text is the word type; if the text length is equal to 1, the text type is the character type.
  • the target video frame in the first video stream can be determined according to the generation time of the recognized text, the recognized text can be used as subtitle information, and the subtitle information, time interval information and text type can be used as video frame information of the target video frame and added to the first video stream.
  • the last video frame in the video segment corresponding to the target audio segment can usually be used as the target video frame, and the subtitle information, time interval information and text type can be used as its video frame information and added to the first video stream to generate a second video stream.
  • the client can determine the subtitle information to be displayed according to the text type, and usually gives priority to sentence-type subtitle information for display to ensure the subtitle viewing effect of the live broadcast.
  • the voice recognition service module splits the audio stream according to a preset recognition window to generate at least one audio segment, and performs voice recognition on the first audio segment to generate a corresponding first recognition text
  • the recognition text is used as subtitle information
  • the subtitle information and the time interval information are added to the first video stream, including:
  • the transcoding module determines a first target video frame in the first video stream according to a generation time of the first recognition text
  • the first recognition text is used as the first subtitle information, and the time interval information between the generation time of the first subtitle information and the first recognition text and the receiving time of the audio stream is added to the first video stream as the video frame information of the first target video frame.
  • the speech recognition service module when the speech recognition service module splits the audio stream into one or at least two audio segments, it can perform speech recognition on each audio segment in turn, and after generating the recognition text corresponding to any audio segment, it can return the recognition text to the transcoding module.
  • the transcoding module determines the target video frame in the first video stream (usually the last video frame of the video segment corresponding to any audio segment) according to the generation time of the recognition text, and uses the recognition text as subtitle information, and the time interval information between the subtitle information, the generation time of the recognition text and the receiving time of the audio stream as the video frame information of the target video frame, and adds it to the first video stream.
  • performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text includes:
  • the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:
  • the transcoding module determines a second target video frame in the first video stream according to a generation time of the second recognition text
  • the first recognition text and the second recognition text are used as second subtitle information, and the time interval information between the generation time of the second subtitle information and the second recognition text and the receiving time of the audio stream is used as video frame information of the second target video frame and added to the first video stream.
  • the speech recognition service module when the speech recognition service module splits the audio stream into at least two audio segments, it can first perform speech recognition on the first audio segment of the at least two audio segments to generate a corresponding first recognition text, and the transcoding module uses the first recognition text as subtitle information, and uses the subtitle information, the time interval information between the generation time of the first recognition text and the receiving time of the audio stream as the video frame information of the first target video frame (usually the last video frame of the video segment corresponding to the first audio segment), and adds it to the first video stream.
  • speech recognition can be performed on the second audio segment adjacent to the first audio segment in the at least two audio segments to generate a corresponding second recognition text.
  • the transcoding module uses the first recognition text and the second recognition text as subtitle information, and uses the subtitle information, the time interval information between the generation time of the second recognition text and the receiving time of the audio stream as the video frame information of the second target video frame (usually the last video frame of the video segment corresponding to the second audio segment) to add it to the first video stream, and so on.
  • the speech recognition service module After the speech recognition service module obtains the first recognition text through speech recognition, it can temporarily store it. After obtaining the second recognition text through recognition, since the first audio segment is adjacent to the second audio segment, the first recognition text and the second recognition text can be returned together as subtitle information of the video stream, thereby allowing the speech recognition service module to reuse the cache to improve the accuracy of the recognition results of the subtitle information.
  • Step 208 encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client.
  • the live broadcast server can encode the second video stream and the audio stream to generate a live broadcast stream to be pushed, and if the user has a need to watch live broadcast, the live broadcast stream to be pushed can be pushed to the user's client.
  • the client decodes the live stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;
  • the video stream and the audio stream are played synchronously, and based on the display time, the subtitle information is displayed in the at least two video frames and the target video frame.
  • the client when playing a live stream, may pull the live stream to be pushed for a certain length of time after the current playback time from the live server in advance and cache it, and then decode the cached live stream to be pushed in advance to obtain the subtitle information corresponding to the target video frame in the live stream to be pushed, the text type of the subtitle information, and the time interval information between the generation time of the subtitle information and the reception time of the audio stream by the live stream server.
  • the display time of the subtitle information may be determined according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server, combined with the playback time of the target video frame, and then determine other video frames in the live stream to be pushed that are located before the target video frame and are used to display the subtitle information according to the display time, and when it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed in the determined video frame and the target video frame based on the display time.
  • the client pre-caches the live stream to be pushed from t to t+5s during the process of playing the live stream, and then decodes to obtain the subtitle information carried in this live stream to be pushed.
  • the decoding result contains the recognition text corresponding to the video frames at the five time points of t+1s, t+2s, t+3s, t+4s, and t+5s, and the recognition text corresponding to the time point t+5s is of sentence type, then the recognition text can be displayed first. In this case, the time interval information corresponding to the recognition text can be determined.
  • the time interval between is 4s, and the time interval between the generation time of the recognition text and the video frame at the time point t+5s is 1s, which means that the subtitle information (recognition text) needs to be displayed 3s in advance. It can also mean that the host expressed a complete sentence from t+3s to t+5s. Therefore, when the live stream to be pushed at t+3s is started to be played, the subtitle information can be displayed at the same time, and the display can be ended at t+5s to realize the early display of the complete subtitles and avoid delays between the subtitles and the video screen or sound. Until it is detected that other subtitle information needs to be displayed, the subtitle information is stopped.
  • An embodiment of the present application implements a live broadcast data processing method and system, wherein the live broadcast data processing method includes decoding a received initial live broadcast stream, generating an audio stream and a first video stream, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, using the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream, generating a second video stream, encoding the second video stream and the audio stream, generating a live broadcast stream to be pushed, and returning the live broadcast stream to be pushed to the client.
  • the live broadcast server performs speech recognition on the audio stream, generates corresponding recognition text, and records the time interval between the generation time of the recognition text and the time interval between the time of receiving the audio stream. Since the time interval can be used to characterize the time consumed by the live broadcast server for performing speech recognition on the audio stream in the initial live broadcast stream after receiving the initial live broadcast stream, after adding the recognition text and the time interval information to the video stream and returning it to the client, the client can pre-parse and obtain the subtitle information carried in the live broadcast stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live broadcast server receives the audio stream, that is, determine the display time of the complete subtitle corresponding to the live broadcast stream to be pushed, so as to display the complete subtitle in advance based on the display time, which is beneficial to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is beneficial to avoiding the asynchrony between subtitles and video images or audio
  • FIG. 3 shows a flow chart of another live broadcast data processing method provided according to an embodiment of the present application, comprising the following steps:
  • Step 302 Receive and cache the live streaming stream to be pushed returned by the live streaming server.
  • Step 304 decode the live stream to be pushed, generate corresponding audio stream, video stream, subtitle information and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live server according to the generation time of the subtitle information and the reception time of the audio stream.
  • Step 306 Determine the display time of the subtitle information according to the time interval information.
  • Step 308 When it is determined that the playback condition of the live stream to be pushed is met, the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • the anchor broadcasts live through the intelligent terminal and pushes the generated initial live stream to the live server, which decodes the received initial live stream to generate an audio stream and a first video stream; then performs speech recognition on the audio stream to generate corresponding recognition text, and determines the time interval information between the generation time of the recognition text and the reception time of the audio stream; then uses the recognition text as subtitle information, and adds the subtitle information and the time interval information to the first video stream to generate a second video stream; then encodes the second video stream and the audio stream to generate a live stream to be pushed.
  • the live server can push the live stream to be pushed to the user's client.
  • the client When the client is playing the live stream for the user, it can pull the live stream to be pushed of a certain length from the live server in advance and cache it, so that the client can decode the cached live stream to be pushed in advance and obtain the subtitle information contained in the live stream to be pushed. Then, according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server, the display time of the subtitle information can be determined. When it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.
  • the above is a schematic scheme of another live data processing method of this embodiment. It should be noted that the technical scheme of the live data processing method and the technical scheme of the above-mentioned live data processing method belong to the same concept, and the details of the technical scheme of the live data processing method that are not described in detail can all be referred to the description of the technical scheme of the above-mentioned live data processing method.
  • the live broadcast data processing method provided by an embodiment of the present application is applied in the live broadcast field as an example to further illustrate the live broadcast data processing method.
  • Figure 4 shows an interactive schematic diagram of a live broadcast data processing method provided by an embodiment of the present application applied in the live broadcast field, specifically comprising the following steps:
  • Step 402 The transcoding module receives the anchor's initial live stream.
  • Step 404 The transcoding module decodes the initial live stream to generate an audio stream and a first video stream.
  • Step 406 The transcoding module transmits the audio stream to the speech recognition service module via GRPC.
  • Step 408 The speech recognition service module performs speech recognition on the audio stream and generates corresponding recognition text.
  • Step 410 the speech recognition service module determines the generation time of the recognized text, and determines the time interval information between the generation time and the reception time of the audio stream, and determines the text type of the recognized text according to the text length and/or text semantics of the recognized text.
  • Step 412 The speech recognition service module transmits the recognized text, text type, and time interval information to the transcoding module via GRPC.
  • the transcoding module uses the recognized text as subtitle information, and adds the subtitle information, time interval information, and text type to the first video stream to generate a second video stream.
  • Step 416 The transcoding module encodes the second video stream and the audio stream to generate a live stream to be pushed.
  • Step 418 The client pulls the live stream to be pushed from the live server.
  • the live broadcast server includes a transcoding module and a speech recognition service module.
  • the client decodes the live stream to be pushed, generates a corresponding audio stream, a second video stream, subtitle information and time interval information, determines the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously plays the second video stream and the audio stream, and displays the subtitle information based on the display time.
  • the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.
  • FIG5 shows a schematic diagram of the structure of a live data processing device provided by an embodiment of the present application.
  • the device includes:
  • the decoding module 502 is configured to decode the received initial live stream to generate an audio stream and a first video stream;
  • the recognition module 504 is configured to perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream;
  • An adding module 506 is configured to use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream;
  • the encoding module 508 is configured to encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client.
  • the decoding module 502 is further configured to:
  • an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
  • the client decodes the live stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;
  • the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
  • the live broadcast data processing device further includes a determination module configured to:
  • the adding module 506 is further configured to:
  • the recognized text is used as subtitle information
  • the subtitle information, the time interval information and the text type are used as video frame information of the target video frame, and added to the first video stream.
  • the client decodes the live stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;
  • the video stream and the audio stream are played synchronously, and based on the display time, the subtitle information is displayed in the at least two video frames and the target video frame.
  • the live broadcast data processing device further includes a division module configured to:
  • the identification module 504 is further configured to:
  • the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the target audio segment is determined.
  • the identification module 504 is further configured to:
  • the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
  • the decoding module 502 is further configured to:
  • the received initial live stream is decoded by a transcoding module to generate an audio stream and a first video stream;
  • the identification module 504 is further configured to:
  • the audio stream is subjected to speech recognition by a speech recognition service module to generate corresponding recognition text.
  • the identification module 504 is further configured to:
  • the adding module 506 is further configured to:
  • the transcoding module determines a first target video frame in the first video stream according to a generation time of the first recognition text
  • the first recognition text is used as the first subtitle information, and the time interval information between the generation time of the first subtitle information and the first recognition text and the receiving time of the audio stream is added to the first video stream as the video frame information of the first target video frame.
  • the identification module 504 is further configured to:
  • the adding module 506 is further configured to:
  • the transcoding module determines a second target video frame in the first video stream according to a generation time of the second recognition text
  • the first recognition text and the second recognition text are used as second subtitle information, and the time interval information between the generation time of the second subtitle information and the second recognition text and the receiving time of the audio stream is used as video frame information of the second target video frame and added to the first video stream.
  • the live broadcast data processing device further includes a transmission module configured to:
  • the transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.
  • the above is a schematic scheme of a live data processing device of this embodiment. It should be noted that the technical scheme of the live data processing device and the technical scheme of the above-mentioned live data processing method belong to the same concept, and the details not described in detail in the technical scheme of the live data processing device can be referred to the description of the technical scheme of the above-mentioned live data processing method.
  • FIG6 shows a schematic diagram of the structure of another live data processing device provided by an embodiment of the present application.
  • the device includes:
  • the receiving module 602 is configured to receive and cache the live streaming stream to be pushed returned by the live streaming server;
  • the decoding module 604 is configured to decode the live broadcast stream to be pushed, generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;
  • a determination module 606 is configured to determine a display time of the subtitle information according to the time interval information
  • the display module 608 is configured to synchronously play the video stream and the audio stream when it is determined that the playback condition of the live stream to be pushed is met, and to display the subtitle information based on the display time.
  • the above is a schematic scheme of another live data processing device of this embodiment. It should be noted that the technical scheme of the live data processing device and the technical scheme of the above another live data processing method belong to the same concept, and the details of the technical scheme of the live data processing device that are not described in detail can all be referred to the description of the technical scheme of the above another live data processing method.
  • Fig. 7 shows a block diagram of a computing device 700 according to an embodiment of the present application.
  • the components of the computing device 700 include but are not limited to a memory 710 and a processor 720.
  • the processor 720 is connected to the memory 710 via a bus 730, and the database 750 is used to store data.
  • the computing device 700 also includes an access device 740 that enables the computing device 700 to communicate via one or more networks 760.
  • networks 760 include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • PSTN public switched telephone network
  • LAN local area network
  • WAN wide area network
  • PAN personal area network
  • the access device 740 may include one or more of any type of network interface (e.g., a network interface card (NIC)) whether wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, a World Wide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a universal serial bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.
  • a network interface card e.g., a network interface card (NIC)
  • WLAN wireless local area network
  • Wi-MAX World Wide Interoperability for Microwave Access
  • Ethernet e.g., a USB interface, a universal serial bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.
  • USB universal serial bus
  • NFC near field communication
  • the above components of the computing device 700 and other components not shown in FIG. 7 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG. 7 is only for illustrative purposes and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as needed.
  • the computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or PC.
  • the computing device 700 may also be a mobile or stationary server.
  • the processor 720 is used to execute the following computer-executable instructions, and the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live data processing method when executing the computer-executable instructions.
  • the above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the technical scheme of the live data processing method described above are of the same concept, and the details of the technical scheme of the computing device that are not described in detail can be found in the description of the technical scheme of the live data processing method described above.
  • An embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the following steps:
  • the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • the computer instructions include computer program codes, which may be in source code form, object code form, executable files or some intermediate forms, etc.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, USB flash drive, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electric carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Embodiments of the present application provide a livestreaming data processing method and system. The livestreaming data processing method comprises: decoding a received initial live stream to generate an audio stream and a first video stream; performing speech recognition on the audio stream to generate a corresponding recognition text, and determining time interval information between the generation time of the recognition text and the receiving time of the audio stream; using the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream to generate a second video stream; and encoding the second video stream and the audio stream to generate a live stream to be pushed, and returning said live stream to a client.

Description

直播数据处理方法及系统Live data processing method and system
本申请申明2022年10月25日递交的申请号为202211311544.0、名称为“直播数据处理方法及系统”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application claims the priority of Chinese patent application numbered 202211311544.0, filed on October 25, 2022, and entitled “Live Data Processing Method and System”. The entire content of the Chinese patent application is incorporated into this application by reference.
技术领域Technical Field
本申请实施例涉及计算机技术领域,特别涉及直播数据处理方法。本申请一个或者多个实施例同时涉及一种直播数据处理系统,一种计算设备,以及一种计算机可读存储介质。The embodiments of the present application relate to the field of computer technology, and in particular to a live data processing method. One or more embodiments of the present application also relate to a live data processing system, a computing device, and a computer-readable storage medium.
背景技术Background technique
随着直播音视频行业的快速发展,利用数据流传输的技术,对高清画质、低延时、声画同步等要求已经优化到了极致,然而用户的需求并不满足于此。With the rapid development of the live audio and video industry, the requirements for high-definition image quality, low latency, and audio and video synchronization have been optimized to the extreme using data streaming technology. However, users are not satisfied with this.
在一些特殊场景中,比如大型体育赛事、大型会议报告、在线教育培训等,都需要对直播进行实时翻译并添加语言字幕。由于字幕需要先对直播流进行录像,然后提取音频流,人工或机器翻译后烧录到视频中,在重报时可以显示字幕。但这种处理方式无法给语言不通或者存在听觉障碍的受众群体带来直播效果。本发明人意识到虽然已开发出直播实时生成字幕的技术,比如直播弹幕,但是该技术存在一些缺陷,例如,字幕和声音不同步,时而超前时而延后,受众群体的体验感及其不佳,无法满足其需求。因此,亟需一种有效的方法以解决此类问题。In some special scenarios, such as large-scale sports events, large-scale conference reports, online education and training, etc., it is necessary to translate the live broadcast in real time and add language subtitles. Because the subtitles need to record the live stream first, then extract the audio stream, and burn it into the video after manual or machine translation, the subtitles can be displayed when re-reporting. However, this processing method cannot bring live broadcast effects to audiences who do not understand the language or have hearing impairments. The inventors realize that although the technology of real-time subtitle generation of live broadcasts has been developed, such as live barrage, the technology has some defects. For example, the subtitles and sound are not synchronized, sometimes ahead and sometimes delayed, and the audience experience is very poor and cannot meet their needs. Therefore, there is an urgent need for an effective method to solve such problems.
发明内容Summary of the invention
有鉴于此,本申请实施例提供了直播数据处理方法。本申请一个或者多个实施例同时涉及直播数据处理装置,一种直播数据处理系统,一种计算设备,以及一种计算机可读存储介质,以解决相关技术中存在的生成直播字幕的成本高、效率低以及字幕存在延迟的技术缺陷。In view of this, the embodiments of the present application provide a live broadcast data processing method. One or more embodiments of the present application also relate to a live broadcast data processing device, a live broadcast data processing system, a computing device, and a computer-readable storage medium to solve the technical defects of high cost, low efficiency and delayed subtitle generation in the related art.
根据本申请实施例的第一方面,提供了一种直播数据处理方法,包括:According to a first aspect of an embodiment of the present application, a live broadcast data processing method is provided, including:
对接收的初始直播流进行解码,生成音频流及第一视频流;Decoding the received initial live stream to generate an audio stream and a first video stream;
对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息;Performing speech recognition on the audio stream to generate corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream;
将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流;Using the recognized text as subtitle information, and adding the subtitle information and the time interval information to the first video stream to generate a second video stream;
对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至客户端。The second video stream and the audio stream are encoded to generate a live stream to be pushed, and the live stream to be pushed is returned to the client.
根据本申请实施例的第二方面,提供了一种直播数据处理装置,包括:According to a second aspect of an embodiment of the present application, a live broadcast data processing device is provided, including:
解码模块,被配置为对接收的初始直播流进行解码,生成音频流及第一视频流;A decoding module, configured to decode the received initial live stream to generate an audio stream and a first video stream;
识别模块,被配置为对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息;A recognition module, configured to perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream;
添加模块,被配置为将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流;an adding module, configured to use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream;
编码模块,被配置为对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至客户端。The encoding module is configured to encode the second video stream and the audio stream, generate a live stream to be pushed, and return the live stream to be pushed to the client.
根据本申请实施例的第三方面,提供了另一种直播数据处理方法,包括:According to a third aspect of an embodiment of the present application, another live broadcast data processing method is provided, including:
接收并缓存直播服务器返回的待推送直播流;Receive and cache the live stream to be pushed returned by the live server;
对所述待推送直播流进行解码,生成对应的音频流、视频流、字幕信息以及所述字幕信息对应的时间间隔信息,其中,所述时间间隔信息由所述直播服务器根据所述的字幕信息的生成时间及所述音频流的接收时间确定;Decoding the live broadcast stream to be pushed, generating a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;
根据所述时间间隔信息确定所述字幕信息的展示时间; Determining a display time of the subtitle information according to the time interval information;
在确定满足所述待推送直播流的播放条件的情况下,对所述视频流及所述音频流进行同步播放,并基于所述展示时间展示所述字幕信息。When it is determined that the playback condition of the live stream to be pushed is met, the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
根据本申请实施例的第四方面,提供了另一种直播数据处理装置,包括:According to a fourth aspect of an embodiment of the present application, another live broadcast data processing device is provided, including:
接收模块,被配置为接收并缓存直播服务器返回的待推送直播流;A receiving module is configured to receive and cache the live streaming stream to be pushed returned by the live streaming server;
解码模块,被配置为对所述待推送直播流进行解码,生成对应的音频流、视频流、字幕信息以及所述字幕信息对应的时间间隔信息,其中,所述时间间隔信息由所述直播服务器根据所述的字幕信息的生成时间及所述音频流的接收时间确定;A decoding module is configured to decode the live broadcast stream to be pushed, generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;
确定模块,被配置为根据所述时间间隔信息确定所述字幕信息的展示时间;A determination module, configured to determine a display time of the subtitle information according to the time interval information;
展示模块,被配置为在确定满足所述待推送直播流的播放条件的情况下,对所述视频流及所述音频流进行同步播放,并基于所述展示时间展示所述字幕信息。The display module is configured to synchronously play the video stream and the audio stream when it is determined that the playback conditions of the live stream to be pushed are met, and to display the subtitle information based on the display time.
根据本申请实施例的第五方面,提供了一种直播数据处理系统,包括:According to a fifth aspect of an embodiment of the present application, a live broadcast data processing system is provided, including:
直播服务器和客户端;Live streaming server and client;
所述直播服务器,用于对接收的初始直播流进行解码,生成音频流及第一视频流,对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流,对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至所述客户端;The live broadcast server is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, add the subtitle information and the time interval information to the first video stream, generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client;
所述客户端,用于接收并缓存所述待推送直播流,对所述待推送直播流进行解码,获得所述音频流、所述第二视频流、所述字幕信息以及所述时间间隔信息,根据所述时间间隔信息确定所述字幕信息的展示时间,在确定满足所述待推送直播流的播放条件的情况下,对所述第二视频流及所述音频流进行同步播放,并基于所述展示时间展示所述字幕信息。The client is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.
根据本申请实施例的第六方面,提供了一种计算设备,包括:According to a sixth aspect of an embodiment of the present application, there is provided a computing device, including:
存储器和处理器;Memory and processor;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,其中,所述处理器执行所述计算机可执行指令时实现所述直播数据处理方法的步骤。The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live data processing method when executing the computer-executable instructions.
根据本申请实施例的第七方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现所述直播数据处理方法的步骤。According to a seventh aspect of an embodiment of the present application, a computer-readable storage medium is provided, which stores computer-executable instructions, and when the instructions are executed by a processor, the steps of the live data processing method are implemented.
本申请一个实施例实现了直播数据处理方法及系统,其中,所述直播数据处理方法包括对接收的初始直播流进行解码,生成音频流及第一视频流,对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流,对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至客户端。An embodiment of the present application implements a live broadcast data processing method and system, wherein the live broadcast data processing method includes decoding a received initial live broadcast stream, generating an audio stream and a first video stream, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, using the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream, generating a second video stream, encoding the second video stream and the audio stream, generating a live broadcast stream to be pushed, and returning the live broadcast stream to be pushed to the client.
本申请实施例中,直播服务器对音频流进行语音识别,生成对应的识别文本,并记录识别文本的生成时间与接收音频流的时间之间的时间间隔,由于该时间间隔可用于表征直播服务器在接收到初始直播流后,用于对初始直播流中的音频流进行语音识别所消耗的时长,因此,在将该识别文本及时间间隔信息添加至视频流并返回至客户端后,客户端可预先解析获得待推送直播流中携带的字幕信息,并根据字幕信息的生成时间与直播服务器接收音频流的时间之间的时间间隔信息,确定字幕信息的展示时间,即确定该待推送直播流对应的完整字幕的展示时间,以基于该展示时间对完整字幕进行提前展示,既有利于降低生成字幕的成本,提高字幕生成效率,又有利于避免字幕与视频画面或音频之间的不同步,从而有利于满足用户在直播观看过程中,观看直播字幕的需求,并有利于提高用户的直播观看体验。In an embodiment of the present application, the live broadcast server performs speech recognition on the audio stream, generates a corresponding recognition text, and records the time interval between the generation time of the recognition text and the time interval between the time of receiving the audio stream. Since the time interval can be used to characterize the time consumed by the live broadcast server for performing speech recognition on the audio stream in the initial live broadcast stream after receiving the initial live broadcast stream, after adding the recognition text and the time interval information to the video stream and returning it to the client, the client can pre-parse and obtain the subtitle information carried in the live broadcast stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live broadcast server receives the audio stream, that is, determine the display time of the complete subtitle corresponding to the live broadcast stream to be pushed, so as to display the complete subtitle in advance based on the display time, which is beneficial to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is beneficial to avoiding the asynchrony between subtitles and video images or audio, thereby beneficial to satisfying the needs of users to watch live subtitles during live viewing, and beneficial to improving the user's live viewing experience.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请一个实施例提供的一种直播数据处理系统的架构图;FIG1 is an architecture diagram of a live data processing system provided by an embodiment of the present application;
图2是本申请一个实施例提供的一种直播数据处理方法的流程图; FIG2 is a flow chart of a live broadcast data processing method provided by an embodiment of the present application;
图3是本申请一个实施例提供的另一种直播数据处理方法的流程图;FIG3 is a flow chart of another live data processing method provided by an embodiment of the present application;
图4是本申请一个实施例提供的一种所述直播数据处理方法应用于直播领域的交互示意图;FIG4 is an interactive schematic diagram of a live broadcast data processing method provided by an embodiment of the present application applied to the live broadcast field;
图5是本申请一个实施例提供的一种直播数据处理装置的结构示意图;FIG5 is a schematic diagram of the structure of a live data processing device provided by an embodiment of the present application;
图6是本申请一个实施例提供的另一种直播数据处理装置的结构示意图;FIG6 is a schematic diagram of the structure of another live data processing device provided by an embodiment of the present application;
图7是本申请一个实施例提供的一种计算设备的结构框图。FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present application.
具体实施方式Detailed ways
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。Many specific details are described in the following description to facilitate a full understanding of the present application. However, the present application can be implemented in many other ways than those described herein, and those skilled in the art can make similar generalizations without violating the connotation of the present application, so the present application is not limited by the specific implementation disclosed below.
在本申请一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请一个或多个实施例。在本申请一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本申请一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terms used in one or more embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit one or more embodiments of the present application. The singular forms of "a", "said" and "the" used in one or more embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used in one or more embodiments of the present application refers to and includes any or all possible combinations of one or more associated listed items.
应当理解,尽管在本申请一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that, although the terms first, second, etc. may be used to describe various information in one or more embodiments of the present application, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of one or more embodiments of the present application, the first may also be referred to as the second, and similarly, the second may also be referred to as the first. Depending on the context, the word "if" as used herein may be interpreted as "at the time of" or "when" or "in response to determining".
首先,对本申请一个或多个实施例涉及的名词术语进行解释。First, the terms involved in one or more embodiments of the present application are explained.
直播:直播的英文为Live broadcast,广义上的直播还包括了电视台的直播,我们这里一般都指网络视频直播。直播音视频会以媒体流的形式推到服务器上(推流)。如果有观众收看直播,服务器收到用户的请求后,会把视频传输到网站、APP、客户端的播放器,实时播放视频。Live broadcast: Live broadcast in English is Live broadcast. Broadly speaking, live broadcast also includes TV live broadcast. Here we generally refer to online video live broadcast. Live audio and video will be pushed to the server in the form of media stream (pushing stream). If there are viewers watching the live broadcast, after receiving the user's request, the server will transmit the video to the website, APP, and client player to play the video in real time.
H264编码:H264一般指H.264。H.264是由ITU-T视频编码专家组(VCEG)和ISO/IEC动态图像专家组(MPEG)联合组成的联合视频组(JVT,Joint Video Team)提出的高度压缩数字视频编解码器标准。H264 encoding: H264 generally refers to H.264. H.264 is a highly compressed digital video codec standard proposed by the Joint Video Team (JVT), which is jointly composed of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).
H265编码:H.265是ITU-T VCEG继H.264之后所制定的新的视频编码标准。H.265标准围绕着视频编码标准H.264,保留原来的某些技术,同时对一些相关的技术加以改进。H265 encoding: H.265 is a new video coding standard developed by ITU-T VCEG after H.264. The H.265 standard revolves around the video coding standard H.264, retaining some of the original technologies while improving some related technologies.
SEI:SEI即补充增强信息(Supplemental Enhancement Information),属于码流范畴,它提供了向视频码流中加入额外信息的方法,是H.264/H.265这些视频压缩标准的特性之一。SEI: SEI stands for Supplemental Enhancement Information, which belongs to the bitstream category. It provides a method to add additional information to the video bitstream and is one of the features of video compression standards such as H.264/H.265.
语音识别技术:即由机器通过识别和理解过程,把语音信号转变为相应的文本或命令的技术。Speech recognition technology: a technology that converts speech signals into corresponding text or commands through the process of recognition and understanding by a machine.
GRPC:RPC(远程过程调用(Remote Procedure Call)的缩写形式)框架的一种,是一个高性能、开源和通用的RPC框架,基于ProtoBuf(Protocol Buffers)序列化协议开发,且支持众多开发语言。GRPC: A type of RPC (abbreviation of Remote Procedure Call) framework. It is a high-performance, open source and general RPC framework developed based on the ProtoBuf (Protocol Buffers) serialization protocol and supports many development languages.
转码:视频转码技术将视频信号从一种格式转换成另一种格式。Transcoding: Video transcoding technology converts video signals from one format to another.
在本申请中,提供了直播数据处理方法。本申请一个或者多个实施例同时涉及直播数据处理装置,一种直播数据处理系统,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。In the present application, a live data processing method is provided. One or more embodiments of the present application simultaneously relate to a live data processing apparatus, a live data processing system, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
具体实施时,本申请实施例的字幕信息可以呈现于大型视频播放设备、游戏机、台式计算机、智能手机、平板电脑、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器,MP4(Moving Picture Experts Group AudioLayer  IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机、电子书阅读器以及其它显示终端等客户端。In specific implementation, the subtitle information of the embodiment of the present application can be presented on large video playback devices, game consoles, desktop computers, smart phones, tablet computers, MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3) players, MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Clients such as players, laptops, e-book readers and other display terminals.
此外,本申请实施例的字幕信息可以应用于任何可呈现字幕的视频、音频中,例如在直播、录播类的视频中可呈现字幕,在线或离线听歌、听书等的音频中可呈现字幕。In addition, the subtitle information of the embodiments of the present application can be applied to any video or audio that can present subtitles, for example, subtitles can be presented in live or recorded videos, and subtitles can be presented in audio of online or offline songs or books.
参见图1,图1示出了根据本申请一个实施例提供的一种直播数据处理系统的架构图,包括:Referring to FIG. 1 , FIG. 1 shows an architecture diagram of a live broadcast data processing system provided according to an embodiment of the present application, including:
直播服务器102和客户端104;Live broadcast server 102 and client 104;
所述直播服务器102,用于对接收的初始直播流进行解码,生成音频流及第一视频流,对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流,对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至所述客户端104;The live broadcast server 102 is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client 104;
所述客户端104,用于接收并缓存所述待推送直播流,对所述待推送直播流进行解码,获得所述音频流、所述第二视频流、所述字幕信息以及所述时间间隔信息,根据所述时间间隔信息确定所述字幕信息的展示时间,在确定满足所述待推送直播流的播放条件的情况下,对所述第二视频流及所述音频流进行同步播放,并基于所述展示时间展示所述字幕信息。The client 104 is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.
具体的,图1中,用户U1通过智能终端进行直播,并将生成的初始直播流推送至直播服务器102,由直播服务器102对接收的初始直播流进行解码,生成音频流及第一视频流;然后对音频流进行语音识别,生成对应的识别文本,并确定识别文本的生成时间与音频流的接收时间之间的时间间隔信息;接着将识别文本作为字幕信息,并将字幕信息及该时间间隔信息添加至所述第一视频流,生成第二视频流,再对第二视频流及所述音频流进行编码,生成待推送直播流,在用户U2和用户U3观看用户U1的直播时,直播服务器即可将待推送直播流推送至用户U2和用户U3的客户端104。Specifically, in Figure 1, user U1 broadcasts live through a smart terminal and pushes the generated initial live stream to the live server 102. The live server 102 decodes the received initial live stream to generate an audio stream and a first video stream; then performs speech recognition on the audio stream to generate a corresponding recognition text, and determines the time interval information between the generation time of the recognition text and the reception time of the audio stream; then uses the recognition text as subtitle information, and adds the subtitle information and the time interval information to the first video stream to generate a second video stream; then encodes the second video stream and the audio stream to generate a live stream to be pushed. When user U2 and user U3 watch the live broadcast of user U1, the live server can push the live stream to be pushed to the client 104 of user U2 and user U3.
客户端104在为用户播放直播流的过程中,可预先从直播服务器拉取一定时长的待推送直播流并进行缓存,使得客户端104可提前对缓存的待推送直播流进行解码,获得该待推送直播流中包含的字幕信息,然后可根据待推送直播流中携带的字幕信息的生成时间与直播流服务器102对音频流的接收时间之间的时间间隔信息,确定字幕信息的展示时间,并在确定满足待推送直播流的播放条件的情况下,对解码获得的视频流及音频流进行同步播放,以及基于该展示时间展示所述字幕信息。When playing the live stream for the user, the client 104 can pull the live stream to be pushed of a certain length from the live server in advance and cache it, so that the client 104 can decode the cached live stream to be pushed in advance and obtain the subtitle information contained in the live stream to be pushed. Then, according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server 102, the display time of the subtitle information can be determined. When it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed based on the display time.
本申请实施例中,通过上述处理方式,有利于使得客户端可预先解析获得待推送直播流中携带的字幕信息,并根据字幕信息的生成时间与直播服务器接收音频流的时间之间的时间间隔信息,确定字幕信息的展示时间,即确定该待推送直播流对应的完整字幕的展示时间,以基于该展示时间对完整字幕进行提前展示,既有利于降低生成字幕的成本,提高字幕生成效率,又有利于避免字幕与视频画面或音频之间的不同步,从而有利于满足用户在直播观看过程中,观看直播字幕的需求,并有利于提高用户的直播观看体验。In the embodiment of the present application, the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.
上述为本实施例的一种直播数据处理系统的示意性方案。需要说明的是,该直播数据处理系统的技术方案与下述的一种直播数据处理方法的技术方案属于同一构思,该直播数据处理系统的技术方案未详细描述的细节内容,均可以参见下述一种直播数据处理方法的技术方案的描述。The above is a schematic scheme of a live data processing system of this embodiment. It should be noted that the technical scheme of the live data processing system and the technical scheme of the following live data processing method belong to the same concept, and the details not described in detail in the technical scheme of the live data processing system can be referred to the description of the technical scheme of the following live data processing method.
参见图2,图2示出了根据本申请一个实施例提供的一种直播数据处理方法的流程图,包括以下步骤:Referring to FIG. 2 , FIG. 2 shows a flow chart of a live broadcast data processing method provided according to an embodiment of the present application, comprising the following steps:
步骤202,对接收的初始直播流进行解码,生成音频流及第一视频流。Step 202: decode the received initial live stream to generate an audio stream and a first video stream.
具体的,本申请实施例提供的直播数据处理方法,应用于直播服务器。初始直播流,即主播在直播过程中推送至直播服务器的直播流。 Specifically, the live broadcast data processing method provided in the embodiment of the present application is applied to a live broadcast server. The initial live broadcast stream is the live broadcast stream pushed to the live broadcast server by the anchor during the live broadcast process.
主播在通过其智能终端进行直播的过程中,可通过智能终端将直播过程中生成的直播流推送至直播服务器,以使得在存在其他用户需观看该主播的直播时,即可由直播服务器将该主播推送的直播流推送至其他用户的用户终端(客户端)。When the host is broadcasting live through his smart terminal, he can push the live stream generated during the live broadcast to the live broadcast server through the smart terminal, so that when other users need to watch the host's live broadcast, the live broadcast server can push the live stream pushed by the host to the user terminals (clients) of other users.
目前大部分直播均为无字幕直播,但在一些特殊场景中,比如大型体育赛事、大型会议报告、在线教育培训等,都需要对直播进行实时翻译并添加语言字幕。由于字幕需要先对直播流进行录像,然后提取音频流,人工或机器翻译后烧录到视频中,在重报时可以显示字幕。但这种处理方式无法给语言不通或者存在听觉障碍的受众群体带来直播效果。Currently, most live broadcasts are without subtitles, but in some special scenarios, such as large-scale sports events, large-scale conference reports, online education and training, etc., the live broadcast needs to be translated in real time and subtitled. Since the subtitles need to record the live stream first, then extract the audio stream, and burn it into the video after manual or machine translation, the subtitles can be displayed when the broadcast is repeated. However, this processing method cannot bring live broadcast effects to audiences who do not understand the language or have hearing impairments.
另外,即使目前已开发出直播实时生成字幕的技术,比如直播弹幕,但是该技术往往存在字幕与视频画面或声音不同步的问题,使得观看直播的用户体验感及其不佳,无法满足其需求。In addition, even though technology has been developed to generate subtitles in real time during live broadcasts, such as live bullet screens, this technology often has the problem of subtitles being out of sync with the video image or sound, making the user experience of watching live broadcasts extremely poor and failing to meet their needs.
基于此,本申请实施例中,直播服务器在接收到主播推送的初始直播流后,可对该初始直播流进行解码,获得音频流和第一视频流,并可对音频流进行语音识别,获得对应的识别文本,然后将该识别文本作为字幕信息添加到第一视频流中,生成第二视频流,使得在将音频流和第二视频流的编码结果推送至用户的客户端后,客户端可解码获得该字幕信息,并可在为用户同步播放音频流和第二视频流的同时,展示该字幕信息,从而避免用户实时观看直播的过程中,直播字幕与直播视频画面或音频之间不同步的问题,以满足用户在直播观看过程中,观看直播字幕的需求,提高用户的直播观看体验。Based on this, in an embodiment of the present application, after receiving the initial live stream pushed by the anchor, the live server can decode the initial live stream to obtain an audio stream and a first video stream, and can perform voice recognition on the audio stream to obtain the corresponding recognized text, and then add the recognized text as subtitle information to the first video stream to generate a second video stream, so that after the encoding results of the audio stream and the second video stream are pushed to the user's client, the client can decode and obtain the subtitle information, and can display the subtitle information while synchronously playing the audio stream and the second video stream for the user, thereby avoiding the problem of the live subtitles being out of sync with the live video screen or audio during the user's real-time viewing of the live broadcast, so as to meet the user's needs for viewing live subtitles during the live viewing process and improve the user's live viewing experience.
具体实施时,对接收的初始直播流进行解码,具体可通过以下方式实现:In specific implementation, decoding the received initial live stream can be implemented in the following ways:
确定所述客户端缓存的待播放直播流,并确定所述待播放直播流对应的生成时间;Determine the live stream to be played cached by the client, and determine the generation time corresponding to the live stream to be played;
根据所述待播放直播流对应的直播流标识及所述生成时间,获取预设时间区间内、所述直播流标识对应的初始直播流,并对所述初始直播流进行解码,其中,所述预设时间区间晚于所述生成时间。According to the live stream identifier corresponding to the live stream to be played and the generation time, an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
另外,客户端对所述待播放直播流进行解码,生成对应的待播放音频流、待播放视频流、待展示字幕以及所述待展示字幕对应的展示时间;In addition, the client decodes the live stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;
在确定满足所述待播放直播流的播放条件的情况下,对所述待播放视频流及所述待播放音频流进行同步播放,并基于所述展示时间展示所述待展示字幕。When it is determined that the playback condition of the live stream to be played is met, the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
具体的,客户端在播放直播流时,可预先缓存当前播放时间之后一段时间内的待播放直播流,并提前对这部分待播放直播流进行解析,获得待播放直播流中包含的待播放视频流、待播放音频流、待展示字幕以及待展示字幕对应的展示时间,然后在确定满足待播放直播流的播放条件的情况下,对解码获得的待播放视频流及待播放音频流进行同步播放,以及基于该展示时间展示待展示字幕。Specifically, when playing a live stream, the client can pre-cache the live stream to be played within a period of time after the current playback time, and parse this part of the live stream to be played in advance to obtain the video stream to be played, the audio stream to be played, the subtitles to be displayed, and the display time corresponding to the subtitles to be displayed contained in the live stream to be played. Then, when it is determined that the playback conditions of the live stream to be played are met, the decoded video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
例如,客户端预先缓存的待播放直播流的时长为5s,当前播放时间为t,则预先缓存t至t+5s内的待播放直播流,并提前对其进行解析,以根据解析结果中待展示字幕的展示时间,确定是否需对待展示字幕进行提前展示,从而降低用户实时观看直播过程中,直播字幕与直播视频画面或音频之间的延迟。For example, if the duration of the live stream to be played pre-cached by the client is 5s and the current playback time is t, the live stream to be played within t to t+5s is pre-cached and parsed in advance to determine whether the subtitles to be displayed need to be displayed in advance based on the display time of the subtitles to be displayed in the analysis results, thereby reducing the delay between the live subtitles and the live video screen or audio during the user's real-time viewing of the live broadcast.
进一步的,由于客户端预先缓存的待播放直播流的时长有限,当这部分待播放直播流被播放后,则需进一步缓存新的待播放直播流,例如,客户端预先缓存t至t+5s内的待播放直播流,而在t至t+3s内的待播放直播流被播放后,则需缓存t+5s至t+8s内的待播放直播流,即需从直播服务器获取t+5s至t+8s内的待播放直播流。Furthermore, since the duration of the live stream to be played pre-cached by the client is limited, after this part of the live stream to be played is played, it is necessary to further cache a new live stream to be played. For example, the client pre-caches the live stream to be played within t to t+5s, and after the live stream to be played within t to t+3s is played, it is necessary to cache the live stream to be played within t+5s to t+8s, that is, it is necessary to obtain the live stream to be played within t+5s to t+8s from the live broadcast server.
因此,直播服务器可预先确定客户端已缓存的待播放直播流,并确定已缓存待播放直播流对应的生成时间(播放时间),然后根据已缓存待播放直播流对应的直播流标识及生成时间,获取该生成时间之后一段时间内、该直播流标识对应的初始直播流,并对初始直播流进行处理,生成包含字幕信息的待推送直播流,并推送至客户端。Therefore, the live broadcast server can pre-determine the live broadcast stream to be played that has been cached by the client, and determine the generation time (playback time) corresponding to the cached live broadcast stream to be played, and then obtain the initial live broadcast stream corresponding to the live broadcast stream identifier within a period of time after the generation time based on the live broadcast stream identifier and generation time corresponding to the cached live broadcast stream to be played, and process the initial live broadcast stream to generate a live broadcast stream to be pushed containing subtitle information, and push it to the client.
基于此,用户在通过客户端实时观看直播的过程中,客户端预先缓存当前播放时间之后一段时间内的待播放直播流,并提前对这部分待播放直播流进行解析,同样的,直播服务器同样可预先确定客户端已缓存的待播放直播流,并根据已缓存待播放直播流确定初始 直播流并进行解析处理;尽管直播服务器对初始直播流进行解析处理,以及客户端对待播放直播流进行解析处理的过程均需消耗一定时长,造成一定的直播延迟,但本申请实施例中,通过将直播服务器对初始直播流进行解析处理的过程,与客户端对待播放直播流进行解析处理的过程并行执行,并且客户端可根据解析结果中待展示字幕的展示时间,确定是否需对待展示字幕进行提前展示,从而降低用户实时观看直播过程中,直播字幕与直播视频画面或音频之间的延迟。Based on this, when the user is watching the live broadcast in real time through the client, the client pre-caches the live stream to be played within a period of time after the current playback time, and parses this part of the live stream to be played in advance. Similarly, the live broadcast server can also pre-determine the live stream to be played that the client has cached, and determine the initial Live stream and parse it; although the live server parses the initial live stream, and the client parses the live stream to be played, both take a certain amount of time and cause a certain delay in the live broadcast, but in the embodiment of the present application, the live server parses the initial live stream and the client parses the live stream to be played in parallel, and the client can determine whether to display the subtitles in advance according to the display time of the subtitles to be displayed in the analysis result, thereby reducing the delay between the live subtitles and the live video screen or audio during the user's real-time viewing of the live broadcast.
步骤204,对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息。Step 204: Perform speech recognition on the audio stream to generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream.
具体的,直播服务器在对初始直播流进行解码,获得音频流和第一视频流后,即可对该音频流进行语音识别,生成对应的识别文本,然后可将该识别文本作为字幕信息添加至第一视频流,生成第二视频流,使得用于观看直播的客户端在获得该第二视频流后,即可在播放第二视频流的过程中,为用户展示该字幕信息。Specifically, after the live broadcast server decodes the initial live broadcast stream and obtains the audio stream and the first video stream, it can perform speech recognition on the audio stream to generate corresponding recognition text, and then add the recognition text as subtitle information to the first video stream to generate a second video stream, so that after the client for watching the live broadcast obtains the second video stream, it can display the subtitle information to the user during the playback of the second video stream.
但实际应用中,直播服务器解码获得音频流后,对音频流进行语音识别往往需要消耗一定时长,这种情况下,识别文本的生成时间与接收音频流即接收初始直播流的时间之间,会存在时间差,若不考虑该时间差,仅将识别文本和初始直播流推送至客户端,那么客户端在展示该识别文本时,则可能会存在识别文本与视频画面或声音不同步的情况。However, in actual applications, after the live broadcast server decodes and obtains the audio stream, it often takes a certain amount of time to perform speech recognition on the audio stream. In this case, there will be a time difference between the generation time of the recognized text and the time of receiving the audio stream, that is, the time of receiving the initial live broadcast stream. If this time difference is not taken into account and only the recognized text and the initial live broadcast stream are pushed to the client, then when the client displays the recognized text, the recognized text may be out of sync with the video screen or sound.
本申请实施例在识别获得完整的识别文本后,为避免识别文本与视频画面或声音之间的不同步,则需确定生成该识别文本所消耗的时长,即识别文本的生成时间与直播服务器接收该音频流的时间之间的时间间隔,以使得客户端可根据该时间间隔,确定在获得该识别文本后,需提前多长时间对其进行展示。After the embodiment of the present application obtains the complete recognition text, in order to avoid the asynchrony between the recognition text and the video image or sound, it is necessary to determine the time consumed to generate the recognition text, that is, the time interval between the generation time of the recognition text and the time when the live broadcast server receives the audio stream, so that the client can determine how long in advance to display the recognition text after obtaining it according to the time interval.
具体实施时,直播服务器解码获得音频流后,可根据所述音频流对应的频谱信息,对所述音频流进行划分,生成至少两个音频片段;In a specific implementation, after the live broadcast server decodes and obtains the audio stream, the audio stream may be divided according to the spectrum information corresponding to the audio stream to generate at least two audio segments;
相应地,所述对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,包括:Correspondingly, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream includes:
对目标音频片段进行语音识别,生成对应的识别文本,其中,所述目标音频片段为所述至少两个音频片段之一;Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least two audio segments;
确定所述识别文本的生成时间,并确定所述生成时间与所述目标音频片段的接收时间之间的时间间隔信息。The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the target audio segment is determined.
具体的,由于在对音频流进行语音识别时,若识别的音频流为完整的语句,则可保证识别结果的准确性,基于此,本申请实施例可先根据音频流对应的频谱信息,对该音频流进行划分,生成至少两个音频片段,例如,根据频谱信息,将频谱值为0(表示停顿)的任意两个相邻点之间音频流作为一个音频片段。然后对各音频片段进行语音识别,生成对应的识别文本,确定识别文本的生成时间,并确定该生成时间与各音频片段的接收时间(音频流或初始直播流的接收时间)之间的时间间隔信息。Specifically, when performing speech recognition on an audio stream, if the recognized audio stream is a complete sentence, the accuracy of the recognition result can be guaranteed. Based on this, the embodiment of the present application can first divide the audio stream according to the spectrum information corresponding to the audio stream to generate at least two audio segments. For example, according to the spectrum information, the audio stream between any two adjacent points with a spectrum value of 0 (indicating a pause) is regarded as an audio segment. Then, speech recognition is performed on each audio segment to generate a corresponding recognition text, determine the generation time of the recognition text, and determine the time interval information between the generation time and the reception time of each audio segment (the reception time of the audio stream or the initial live stream).
或者,对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,包括:Alternatively, performing speech recognition on the audio stream to generate corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, including:
按照预设识别窗口对所述音频流进行拆分,生成至少一个音频片段;Splitting the audio stream according to a preset recognition window to generate at least one audio segment;
对目标音频片段进行语音识别,生成对应的识别文本,其中,所述目标音频片段为所述至少一个音频片段之一;Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least one audio segment;
确定所述识别文本的生成时间,并确定所述生成时间与所述音频流的接收时间之间的时间间隔信息。The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
具体的,对音频流进行语音识别的过程中,通常会使用一个预设识别窗口,该预设识别窗口的窗口长度可以是0.5s-1s,利用该预设识别窗口进行音频流的语音识别,即可识别出音频流中的单个字;或者,该预设识别窗口的窗口长度可以是1s-5s,利用该预设识别窗口进行音频流的语音识别,即可识别出音频流中完整的句子。具体的窗口长度可根据实际需求确定,在此不做限制。 Specifically, in the process of speech recognition of an audio stream, a preset recognition window is usually used, and the window length of the preset recognition window can be 0.5s-1s. By using the preset recognition window to perform speech recognition of the audio stream, a single word in the audio stream can be recognized; or the window length of the preset recognition window can be 1s-5s. By using the preset recognition window to perform speech recognition of the audio stream, a complete sentence in the audio stream can be recognized. The specific window length can be determined according to actual needs and is not limited here.
按照预设识别窗口对音频流进行语音识别,具体即按照预设识别窗口对音频流进行拆分,生成至少一个音频片段,对各音频片段进行语音识别,生成对应的识别文本,然后确定该识别文本的生成时间,并确定该生成时间与该音频流的接收时间之间的时间间隔信息。The audio stream is subjected to speech recognition according to a preset recognition window. Specifically, the audio stream is split according to the preset recognition window to generate at least one audio segment, speech recognition is performed on each audio segment to generate a corresponding recognition text, and then the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
另外,本申请实施例中,直播服务器中包含转码模块以及语音识别服务模块,因此,对接收的初始直播流进行解码,生成音频流及第一视频流,具体即通过转码模块对接收的初始直播流进行解码,生成音频流及第一视频流;对所述音频流进行语音识别,生成对应的识别文本,具体即通过语音识别服务模块对所述音频流进行语音识别,生成对应的识别文本。In addition, in the embodiment of the present application, the live broadcast server includes a transcoding module and a speech recognition service module. Therefore, the received initial live broadcast stream is decoded to generate an audio stream and a first video stream. Specifically, the received initial live broadcast stream is decoded by the transcoding module to generate an audio stream and a first video stream; speech recognition is performed on the audio stream to generate a corresponding recognition text. Specifically, speech recognition is performed on the audio stream by the speech recognition service module to generate a corresponding recognition text.
其中,所述转码模块通过数据传输通道将所述音频流传输至所述语音识别服务模块。Wherein, the transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.
具体的,数据传输通道即可以是GRPC。Specifically, the data transmission channel may be GRPC.
进一步的,所述通过语音识别服务模块对所述音频流进行语音识别,生成对应的识别文本,包括:Further, performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text includes:
通过语音识别服务模块按照预设识别窗口对所述音频流进行拆分,生成至少一个音频片段;Splitting the audio stream according to a preset recognition window by a speech recognition service module to generate at least one audio segment;
对第一音频片段进行语音识别,生成对应的第一识别文本,并将所述第一识别文本返回至所述转码模块,其中,所述第一音频片段为所述至少一个音频片段之一。Perform speech recognition on a first audio segment to generate a corresponding first recognition text, and return the first recognition text to the transcoding module, wherein the first audio segment is one of the at least one audio segment.
具体的,如前所述,对音频流进行语音识别的过程中,通常会使用一个预设识别窗口,同样的,利用语音识别服务模块对音频流进行语音识别,同样可使用预设识别窗口,按照预设识别窗口对音频流进行语音识别,具体即按照预设识别窗口对音频流进行拆分,生成至少一个音频片段,对各音频片段进行语音识别,生成对应的识别文本,然后确定该识别文本的生成时间,并确定该生成时间与该音频流的接收时间之间的时间间隔信息。Specifically, as mentioned above, in the process of performing speech recognition on an audio stream, a preset recognition window is usually used. Similarly, when performing speech recognition on an audio stream using a speech recognition service module, a preset recognition window can also be used. Speech recognition is performed on the audio stream according to the preset recognition window. Specifically, the audio stream is split according to the preset recognition window to generate at least one audio segment, speech recognition is performed on each audio segment, corresponding recognition text is generated, and then the generation time of the recognition text is determined, and the time interval information between the generation time and the receiving time of the audio stream is determined.
其中,预设识别窗口的窗口长度可以是0.5s-1s,利用该预设识别窗口进行音频流的语音识别,即可识别出音频流中的单个字;或者,该预设识别窗口的窗口长度可以是1s-5s,利用该预设识别窗口进行音频流的语音识别,即可识别出音频流中完整的句子。具体的窗口长度可根据实际需求确定,在此不做限制。The window length of the preset recognition window may be 0.5s-1s, and the preset recognition window may be used to perform speech recognition of the audio stream, so that a single word in the audio stream can be recognized; or the window length of the preset recognition window may be 1s-5s, and the preset recognition window may be used to perform speech recognition of the audio stream, so that a complete sentence in the audio stream can be recognized. The specific window length may be determined according to actual needs and is not limited here.
步骤206,将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流。Step 206: Use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream.
具体的,在生成识别文本,并确定识别文本的生成时间与音频流的接收时间之间的时间间隔信息后,即可将识别文本作为字幕信息,并将字幕信息及该时间间隔信息添加至第一视频流,生成第二视频流。Specifically, after generating the recognition text and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, the recognition text can be used as subtitle information, and the subtitle information and the time interval information can be added to the first video stream to generate the second video stream.
其中,可采用SEI的形式将字幕信息写到第一视频流,以生成第二视频流。The subtitle information may be written to the first video stream in the form of SEI to generate the second video stream.
具体实施时,还可根据所述识别文本的文本长度和/或文本语义,确定所述识别文本的文本类型;In a specific implementation, the text type of the recognized text may also be determined according to the text length and/or text semantics of the recognized text;
相应地,所述将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,包括:Correspondingly, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:
根据所述生成时间确定所述第一视频流中的目标视频帧;Determine a target video frame in the first video stream according to the generation time;
将所述识别文本作为字幕信息,并将所述字幕信息、所述时间间隔信息及所述文本类型作为所述目标视频帧的视频帧信息,添加至所述第一视频流。The recognized text is used as subtitle information, and the subtitle information, the time interval information and the text type are used as video frame information of the target video frame, and added to the first video stream.
具体的,如前所述,对音频流进行语音识别的过程中,通常会使用一个预设识别窗口,该预设识别窗口的窗口长度可以是0.5s-1s,利用该预设识别窗口进行音频流的语音识别,即可识别出音频流中的单个字;或者,该预设识别窗口的窗口长度可以是1s-5s,利用该预设识别窗口进行音频流的语音识别,即可识别出音频流中完整的句子。Specifically, as mentioned above, in the process of performing speech recognition on an audio stream, a preset recognition window is usually used. The window length of the preset recognition window can be 0.5s-1s. By using the preset recognition window to perform speech recognition on the audio stream, single words in the audio stream can be recognized; or, the window length of the preset recognition window can be 1s-5s. By using the preset recognition window to perform speech recognition on the audio stream, complete sentences in the audio stream can be recognized.
因此,本申请实施例在生成识别文本后,还可根据识别文本的文本长度和/或文本语义,确定该识别文本的文本类型,实际应用中,文本类型包括但不限于字、词、句子等。文本语义用于确定识别文本是否能表达完整的语义,若是,则可确定该识别文本的文本类型为 句子类型;若否,则在识别文本的文本长度大于等于两个字的情况下,该识别文本的文本类型即为词类型,若文本长度等于1,则其文本类型即为字类型。Therefore, after generating the recognized text, the embodiment of the present application can also determine the text type of the recognized text according to the text length and/or text semantics of the recognized text. In practical applications, the text type includes but is not limited to characters, words, sentences, etc. Text semantics is used to determine whether the recognized text can express complete semantics. If so, the text type of the recognized text can be determined as Sentence type; if not, then when the text length of the recognized text is greater than or equal to two characters, the text type of the recognized text is the word type; if the text length is equal to 1, the text type is the character type.
确定识别文本的文本类型后,可根据识别文本的生成时间确定第一视频流中的目标视频帧,将识别文本作为字幕信息,并将字幕信息、时间间隔信息及文本类型作为该目标视频帧的视频帧信息,添加至所述第一视频流。After determining the text type of the recognized text, the target video frame in the first video stream can be determined according to the generation time of the recognized text, the recognized text can be used as subtitle information, and the subtitle information, time interval information and text type can be used as video frame information of the target video frame and added to the first video stream.
实际应用中,通常可将目标音频片段对应的视频片段中,最后一帧视频帧作为目标视频帧,并将字幕信息、时间间隔信息以及文本类型作为其视频帧信息,添加到第一视频流,以生成第二视频流,使得客户端在获得第二视频流后,可根据文本类型确定待展示的字幕信息,通常优先选择句子类型的字幕信息进行展示,以保证直播的字幕观看效果。In practical applications, the last video frame in the video segment corresponding to the target audio segment can usually be used as the target video frame, and the subtitle information, time interval information and text type can be used as its video frame information and added to the first video stream to generate a second video stream. After obtaining the second video stream, the client can determine the subtitle information to be displayed according to the text type, and usually gives priority to sentence-type subtitle information for display to ensure the subtitle viewing effect of the live broadcast.
另外,本申请实施例在通过语音识别服务模块按照预设识别窗口对所述音频流进行拆分,生成至少一个音频片段,并对第一音频片段进行语音识别,生成对应的第一识别文本后,所述将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,包括:In addition, in the embodiment of the present application, after the voice recognition service module splits the audio stream according to a preset recognition window to generate at least one audio segment, and performs voice recognition on the first audio segment to generate a corresponding first recognition text, the recognition text is used as subtitle information, and the subtitle information and the time interval information are added to the first video stream, including:
所述转码模块根据所述第一识别文本的生成时间确定所述第一视频流中的第一目标视频帧;The transcoding module determines a first target video frame in the first video stream according to a generation time of the first recognition text;
将所述第一识别文本作为第一字幕信息,并将所述第一字幕信息、所述第一识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,作为所述第一目标视频帧的视频帧信息,添加至所述第一视频流。The first recognition text is used as the first subtitle information, and the time interval information between the generation time of the first subtitle information and the first recognition text and the receiving time of the audio stream is added to the first video stream as the video frame information of the first target video frame.
具体的,语音识别服务模块在将音频流拆分为一个或至少两个音频片段的情况下,可依次分别对每个音频片段进行语音识别,并在生成任意音频片段对应的识别文本后,即可将该识别文本返回至转码模块,由转码模块根据该识别文本的生成时间确定第一视频流中的目标视频帧(通常为任意音频片段对应视频片段的最后一帧视频帧),并将该识别文本作为字幕信息,将字幕信息、该识别文本的生成时间与音频流的接收时间之间的时间间隔信息,作为该目标视频帧的视频帧信息,添加至所述第一视频流。Specifically, when the speech recognition service module splits the audio stream into one or at least two audio segments, it can perform speech recognition on each audio segment in turn, and after generating the recognition text corresponding to any audio segment, it can return the recognition text to the transcoding module. The transcoding module determines the target video frame in the first video stream (usually the last video frame of the video segment corresponding to any audio segment) according to the generation time of the recognition text, and uses the recognition text as subtitle information, and the time interval information between the subtitle information, the generation time of the recognition text and the receiving time of the audio stream as the video frame information of the target video frame, and adds it to the first video stream.
进一步的,所述通过语音识别服务模块对所述音频流进行语音识别,生成对应的识别文本,包括:Further, performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text includes:
对所述至少两个音频片段中,与所述第一音频片段相邻的第二音频片段进行语音识别,生成对应的第二识别文本,并将所述第一识别文本及所述第二识别文本返回至所述转码模块。Perform speech recognition on a second audio segment adjacent to the first audio segment among the at least two audio segments, generate a corresponding second recognized text, and return the first recognized text and the second recognized text to the transcoding module.
相应地,所述将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,包括:Correspondingly, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:
所述转码模块根据所述第二识别文本的生成时间确定所述第一视频流中的第二目标视频帧;The transcoding module determines a second target video frame in the first video stream according to a generation time of the second recognition text;
将所述第一识别文本及所述第二识别文本作为第二字幕信息,并将所述第二字幕信息、所述第二识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,作为所述第二目标视频帧的视频帧信息,添加至所述第一视频流。The first recognition text and the second recognition text are used as second subtitle information, and the time interval information between the generation time of the second subtitle information and the second recognition text and the receiving time of the audio stream is used as video frame information of the second target video frame and added to the first video stream.
具体的,如前所述,语音识别服务模块在将音频流拆分为至少两个音频片段的情况下,可先对至少两个音频片段中的第一音频片段进行语音识别,生成对应的第一识别文本,并由转码模块并将该第一识别文本作为字幕信息,将字幕信息、该第一识别文本的生成时间与音频流的接收时间之间的时间间隔信息,作为第一目标视频帧(通常为第一音频片段对应视频片段的最后一帧视频帧)的视频帧信息,添加至所述第一视频流。Specifically, as mentioned above, when the speech recognition service module splits the audio stream into at least two audio segments, it can first perform speech recognition on the first audio segment of the at least two audio segments to generate a corresponding first recognition text, and the transcoding module uses the first recognition text as subtitle information, and uses the subtitle information, the time interval information between the generation time of the first recognition text and the receiving time of the audio stream as the video frame information of the first target video frame (usually the last video frame of the video segment corresponding to the first audio segment), and adds it to the first video stream.
然后可对至少两个音频片段中与第一音频片段相邻的第二音频片段进行语音识别,生成对应的第二识别文本,并由转码模块并将第一识别文本及该第二识别文本作为字幕信息,将字幕信息、该第二识别文本的生成时间与音频流的接收时间之间的时间间隔信息,作为第二目标视频帧(通常为第二音频片段对应视频片段的最后一帧视频帧)的视频帧信息,添加至所述第一视频流,以此类推。 Then, speech recognition can be performed on the second audio segment adjacent to the first audio segment in the at least two audio segments to generate a corresponding second recognition text. The transcoding module uses the first recognition text and the second recognition text as subtitle information, and uses the subtitle information, the time interval information between the generation time of the second recognition text and the receiving time of the audio stream as the video frame information of the second target video frame (usually the last video frame of the video segment corresponding to the second audio segment) to add it to the first video stream, and so on.
由于语音识别服务模块进行语音识别获得第一识别文本后,可将其进行暂存,而在识别获得第二识别文本后,由于第一音频片段与第二音频片段相邻,因此,即可将第一识别文本和第二识别文本共同作为视频流的字幕信息返回,实现让语音识别服务模块复用缓存,以提高字幕信息的识别结果的准确性。After the speech recognition service module obtains the first recognition text through speech recognition, it can temporarily store it. After obtaining the second recognition text through recognition, since the first audio segment is adjacent to the second audio segment, the first recognition text and the second recognition text can be returned together as subtitle information of the video stream, thereby allowing the speech recognition service module to reuse the cache to improve the accuracy of the recognition results of the subtitle information.
步骤208,对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至客户端。Step 208: encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client.
具体的,直播服务器在生成第二视频流后,可对第二视频流和音频流进行编码,生成待推送直播流,并在用户存在直播观看需求的情况下,即可将该待推送直播流推送至该用户的客户端。Specifically, after generating the second video stream, the live broadcast server can encode the second video stream and the audio stream to generate a live broadcast stream to be pushed, and if the user has a need to watch live broadcast, the live broadcast stream to be pushed can be pushed to the user's client.
具体实施时,客户端对所述待推送直播流进行解码,生成对应的音频流、视频流,以及所述视频流中目标视频帧的视频帧信息,其中,所述视频帧信息包括所述字幕信息、所述时间间隔信息及所述文本类型;In a specific implementation, the client decodes the live stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;
在确定所述文本类型为目标类型的情况下,根据所述目标视频帧的播放时间以及所述时间间隔信息,确定所述字幕信息的展示时间;In the case where it is determined that the text type is the target type, determining the display time of the subtitle information according to the playback time of the target video frame and the time interval information;
根据所述展示时间确定所述视频流中,用于展示所述字幕信息的至少两帧视频帧,其中,所述至少两帧视频帧的播放时间早于所述目标视频帧的播放时间;Determining, according to the display time, at least two video frames in the video stream for displaying the subtitle information, wherein the playback time of the at least two video frames is earlier than the playback time of the target video frame;
在确定满足所述待推送直播流的播放条件的情况下,对所述视频流及所述音频流进行同步播放,并基于所述展示时间,在所述至少两帧视频帧及所述目标视频帧中展示所述字幕信息。When it is determined that the playback conditions of the live stream to be pushed are met, the video stream and the audio stream are played synchronously, and based on the display time, the subtitle information is displayed in the at least two video frames and the target video frame.
具体的,如前所述,客户端在播放直播流时,可预先从直播服务器拉取当前播放时间之后一定时长的待推送直播流并进行缓存,然后可提前对缓存的待推送直播流进行解码,获得该待推送直播流中目标视频帧对应的字幕信息、字幕信息的文本类型以及字幕信息的生成时间与直播流服务器对音频流的接收时间之间的时间间隔信息,在根据文本类型确定该字幕信息属于目标文本类型,即属于句子类型的情况下,则可根据待推送直播流中携带的字幕信息的生成时间与直播流服务器对音频流的接收时间之间的时间间隔信息,再结合目标视频帧的播放时间共同确定字幕信息的展示时间,然后根据展示时间确定待推送直播流中处于目标视频帧之前的、用于展示该字幕信息的其他视频帧,并在确定满足待推送直播流的播放条件的情况下,对解码获得的视频流及音频流进行同步播放,以及基于该展示时间在确定的视频帧及目标视频帧中展示所述字幕信息。Specifically, as described above, when playing a live stream, the client may pull the live stream to be pushed for a certain length of time after the current playback time from the live server in advance and cache it, and then decode the cached live stream to be pushed in advance to obtain the subtitle information corresponding to the target video frame in the live stream to be pushed, the text type of the subtitle information, and the time interval information between the generation time of the subtitle information and the reception time of the audio stream by the live stream server. When it is determined according to the text type that the subtitle information belongs to the target text type, that is, the sentence type, the display time of the subtitle information may be determined according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server, combined with the playback time of the target video frame, and then determine other video frames in the live stream to be pushed that are located before the target video frame and are used to display the subtitle information according to the display time, and when it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed in the determined video frame and the target video frame based on the display time.
例如,当前时间点为t,客户端在播放直播流的过程中,预先缓存t至t+5s内需播放的待推送直播流,然后解码获得这段待推送直播流中携带的字幕信息,若解码结果中包含t+1s、t+2s、t+3s、t+4s、t+5s这5个时间点的视频帧所对应的识别文本,且t+5s这一时间点对应的识别文本,其文本类型为句子类型,则可优先展示该识别文本,这种情况下,即可确定该识别文本对应的时间间隔信息,若确定该识别文本的生成时间与音频流的接收时间之间的时间间隔为4s,并且确定该识别文本的生成时间与t+5s这一个时间点的视频帧之间的时间间隔为1s,则表示需提前3s展示该字幕信息(识别文本),亦可表示主播从t+3s至t+5s,表述了一句完整的句子,因此,可在开始播放第t+3s的待推送直播流时,同时展示该字幕信息,并在t+5s时结束展示,以实现对完整字幕进行提前展示,避免字幕与视频画面或声音之间出现延迟,直至检测到需展示其他字幕信息的情况下,则停止展示该字幕信息。For example, the current time point is t, and the client pre-caches the live stream to be pushed from t to t+5s during the process of playing the live stream, and then decodes to obtain the subtitle information carried in this live stream to be pushed. If the decoding result contains the recognition text corresponding to the video frames at the five time points of t+1s, t+2s, t+3s, t+4s, and t+5s, and the recognition text corresponding to the time point t+5s is of sentence type, then the recognition text can be displayed first. In this case, the time interval information corresponding to the recognition text can be determined. If the generation time of the recognition text and the reception time of the audio stream are determined, The time interval between is 4s, and the time interval between the generation time of the recognition text and the video frame at the time point t+5s is 1s, which means that the subtitle information (recognition text) needs to be displayed 3s in advance. It can also mean that the host expressed a complete sentence from t+3s to t+5s. Therefore, when the live stream to be pushed at t+3s is started to be played, the subtitle information can be displayed at the same time, and the display can be ended at t+5s to realize the early display of the complete subtitles and avoid delays between the subtitles and the video screen or sound. Until it is detected that other subtitle information needs to be displayed, the subtitle information is stopped.
本申请一个实施例实现了直播数据处理方法及系统,其中,所述直播数据处理方法包括对接收的初始直播流进行解码,生成音频流及第一视频流,对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流,对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至客户端。 An embodiment of the present application implements a live broadcast data processing method and system, wherein the live broadcast data processing method includes decoding a received initial live broadcast stream, generating an audio stream and a first video stream, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, using the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream, generating a second video stream, encoding the second video stream and the audio stream, generating a live broadcast stream to be pushed, and returning the live broadcast stream to be pushed to the client.
本申请实施例中,直播服务器对音频流进行语音识别,生成对应的识别文本,并记录识别文本的生成时间与接收音频流的时间之间的时间间隔,由于该时间间隔可用于表征直播服务器在接收到初始直播流后,用于对初始直播流中的音频流进行语音识别所消耗的时长,因此,在将该识别文本及时间间隔信息添加至视频流并返回至客户端后,客户端可预先解析获得待推送直播流中携带的字幕信息,并根据字幕信息的生成时间与直播服务器接收音频流的时间之间的时间间隔信息,确定字幕信息的展示时间,即确定该待推送直播流对应的完整字幕的展示时间,以基于该展示时间对完整字幕进行提前展示,既有利于降低生成字幕的成本,提高字幕生成效率,又有利于避免字幕与视频画面或音频之间的不同步,从而有利于满足用户在直播观看过程中,观看直播字幕的需求,并有利于提高用户的直播观看体验。In an embodiment of the present application, the live broadcast server performs speech recognition on the audio stream, generates corresponding recognition text, and records the time interval between the generation time of the recognition text and the time interval between the time of receiving the audio stream. Since the time interval can be used to characterize the time consumed by the live broadcast server for performing speech recognition on the audio stream in the initial live broadcast stream after receiving the initial live broadcast stream, after adding the recognition text and the time interval information to the video stream and returning it to the client, the client can pre-parse and obtain the subtitle information carried in the live broadcast stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live broadcast server receives the audio stream, that is, determine the display time of the complete subtitle corresponding to the live broadcast stream to be pushed, so as to display the complete subtitle in advance based on the display time, which is beneficial to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is beneficial to avoiding the asynchrony between subtitles and video images or audio, thereby beneficial to satisfying the needs of users to watch live subtitles during live viewing, and beneficial to improving the user's live viewing experience.
参见图3,图3示出了根据本申请一个实施例提供的另一种直播数据处理方法的流程图,包括以下步骤:Referring to FIG. 3 , FIG. 3 shows a flow chart of another live broadcast data processing method provided according to an embodiment of the present application, comprising the following steps:
步骤302,接收并缓存直播服务器返回的待推送直播流。Step 302: Receive and cache the live streaming stream to be pushed returned by the live streaming server.
步骤304,对所述待推送直播流进行解码,生成对应的音频流、视频流、字幕信息以及所述字幕信息对应的时间间隔信息,其中,所述时间间隔信息由所述直播服务器根据所述的字幕信息的生成时间及所述音频流的接收时间确定。Step 304, decode the live stream to be pushed, generate corresponding audio stream, video stream, subtitle information and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live server according to the generation time of the subtitle information and the reception time of the audio stream.
步骤306,根据所述时间间隔信息确定所述字幕信息的展示时间。Step 306: Determine the display time of the subtitle information according to the time interval information.
步骤308,在确定满足所述待推送直播流的播放条件的情况下,对所述视频流及所述音频流进行同步播放,并基于所述展示时间展示所述字幕信息。Step 308: When it is determined that the playback condition of the live stream to be pushed is met, the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
具体的,主播通过智能终端进行直播,并将生成的初始直播流推送至直播服务器,由直播服务器对接收的初始直播流进行解码,生成音频流及第一视频流;然后对音频流进行语音识别,生成对应的识别文本,并确定识别文本的生成时间与音频流的接收时间之间的时间间隔信息;接着将识别文本作为字幕信息,并将字幕信息及该时间间隔信息添加至所述第一视频流,生成第二视频流,再对第二视频流及所述音频流进行编码,生成待推送直播流,在用户观看该主播的直播时,直播服务器即可将待推送直播流推送至用户的客户端。Specifically, the anchor broadcasts live through the intelligent terminal and pushes the generated initial live stream to the live server, which decodes the received initial live stream to generate an audio stream and a first video stream; then performs speech recognition on the audio stream to generate corresponding recognition text, and determines the time interval information between the generation time of the recognition text and the reception time of the audio stream; then uses the recognition text as subtitle information, and adds the subtitle information and the time interval information to the first video stream to generate a second video stream; then encodes the second video stream and the audio stream to generate a live stream to be pushed. When the user watches the anchor's live broadcast, the live server can push the live stream to be pushed to the user's client.
客户端在为用户播放直播流的过程中,可预先从直播服务器拉取一定时长的待推送直播流并进行缓存,使得客户端可提前对缓存的待推送直播流进行解码,获得该待推送直播流中包含的字幕信息,然后可根据待推送直播流中携带的字幕信息的生成时间与直播流服务器对音频流的接收时间之间的时间间隔信息,确定字幕信息的展示时间,并在确定满足待推送直播流的播放条件的情况下,对解码获得的视频流及音频流进行同步播放,以及基于该展示时间展示所述字幕信息。When the client is playing the live stream for the user, it can pull the live stream to be pushed of a certain length from the live server in advance and cache it, so that the client can decode the cached live stream to be pushed in advance and obtain the subtitle information contained in the live stream to be pushed. Then, according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server, the display time of the subtitle information can be determined. When it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed based on the display time.
本申请实施例中,通过上述处理方式,有利于使得客户端可预先解析获得待推送直播流中携带的字幕信息,并根据字幕信息的生成时间与直播服务器接收音频流的时间之间的时间间隔信息,确定字幕信息的展示时间,即确定该待推送直播流对应的完整字幕的展示时间,以基于该展示时间对完整字幕进行提前展示,既有利于降低生成字幕的成本,提高字幕生成效率,又有利于避免字幕与视频画面或音频之间的不同步,从而有利于满足用户在直播观看过程中,观看直播字幕的需求,并有利于提高用户的直播观看体验。In the embodiment of the present application, the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.
上述为本实施例的另一种直播数据处理方法的示意性方案。需要说明的是,该直播数据处理方法的技术方案与上述的一种直播数据处理方法的技术方案属于同一构思,该直播数据处理方法的技术方案未详细描述的细节内容,均可以参见上述一种直播数据处理方法的技术方案的描述。The above is a schematic scheme of another live data processing method of this embodiment. It should be noted that the technical scheme of the live data processing method and the technical scheme of the above-mentioned live data processing method belong to the same concept, and the details of the technical scheme of the live data processing method that are not described in detail can all be referred to the description of the technical scheme of the above-mentioned live data processing method.
参见图4,以本申请实施例提供的所述直播数据处理方法应用在直播领域的应用为例,对所述直播数据处理方法进行进一步说明。其中,图4示出了本申请一个实施例提供的一种直播数据处理方法应用于直播领域的交互示意图,具体包括以下步骤:Referring to Figure 4, the live broadcast data processing method provided by an embodiment of the present application is applied in the live broadcast field as an example to further illustrate the live broadcast data processing method. Figure 4 shows an interactive schematic diagram of a live broadcast data processing method provided by an embodiment of the present application applied in the live broadcast field, specifically comprising the following steps:
步骤402,转码模块接收主播的初始直播流。Step 402: The transcoding module receives the anchor's initial live stream.
步骤404,转码模块对初始直播流进行解码,生成音频流及第一视频流。 Step 404: The transcoding module decodes the initial live stream to generate an audio stream and a first video stream.
步骤406,转码模块将音频流通过GRPC传输至语音识别服务模块。Step 406: The transcoding module transmits the audio stream to the speech recognition service module via GRPC.
步骤408,语音识别服务模块对音频流进行语音识别,生成对应的识别文本。Step 408: The speech recognition service module performs speech recognition on the audio stream and generates corresponding recognition text.
步骤410,语音识别服务模块确定识别文本的生成时间,并确定生成时间与音频流的接收时间之间的时间间隔信息,根据识别文本的文本长度和/或文本语义,确定识别文本的文本类型。Step 410, the speech recognition service module determines the generation time of the recognized text, and determines the time interval information between the generation time and the reception time of the audio stream, and determines the text type of the recognized text according to the text length and/or text semantics of the recognized text.
步骤412,语音识别服务模块将识别文本、文本类型、时间间隔信息通过GRPC传输至转码模块。Step 412: The speech recognition service module transmits the recognized text, text type, and time interval information to the transcoding module via GRPC.
步骤414,转码模块将识别文本作为字幕信息,并将字幕信息、时间间隔信息及文本类型作为添加至第一视频流,生成第二视频流。In step 414, the transcoding module uses the recognized text as subtitle information, and adds the subtitle information, time interval information, and text type to the first video stream to generate a second video stream.
步骤416,转码模块对第二视频流及音频流进行编码,生成待推送直播流。Step 416: The transcoding module encodes the second video stream and the audio stream to generate a live stream to be pushed.
步骤418,客户端从直播服务器拉取待推送直播流。Step 418: The client pulls the live stream to be pushed from the live server.
直播服务器包括转码模块和语音识别服务模块。The live broadcast server includes a transcoding module and a speech recognition service module.
步骤420,客户端对待推送直播流进行解码,生成对应的音频流、第二视频流、字幕信息以及时间间隔信息,根据时间间隔信息确定字幕信息的展示时间,在确定满足待推送直播流的播放条件的情况下,对第二视频流及音频流进行同步播放,并基于展示时间展示字幕信息。In step 420, the client decodes the live stream to be pushed, generates a corresponding audio stream, a second video stream, subtitle information and time interval information, determines the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously plays the second video stream and the audio stream, and displays the subtitle information based on the display time.
本申请实施例中,通过上述处理方式,有利于使得客户端可预先解析获得待推送直播流中携带的字幕信息,并根据字幕信息的生成时间与直播服务器接收音频流的时间之间的时间间隔信息,确定字幕信息的展示时间,即确定该待推送直播流对应的完整字幕的展示时间,以基于该展示时间对完整字幕进行提前展示,既有利于降低生成字幕的成本,提高字幕生成效率,又有利于避免字幕与视频画面或音频之间的不同步,从而有利于满足用户在直播观看过程中,观看直播字幕的需求,并有利于提高用户的直播观看体验。In the embodiment of the present application, the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.
与上述方法实施例相对应,本申请还提供了直播数据处理装置实施例,图5示出了本申请一个实施例提供的一种直播数据处理装置的结构示意图。如图5所示,该装置包括:Corresponding to the above method embodiment, the present application also provides a live data processing device embodiment, and FIG5 shows a schematic diagram of the structure of a live data processing device provided by an embodiment of the present application. As shown in FIG5, the device includes:
解码模块502,被配置为对接收的初始直播流进行解码,生成音频流及第一视频流;The decoding module 502 is configured to decode the received initial live stream to generate an audio stream and a first video stream;
识别模块504,被配置为对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息;The recognition module 504 is configured to perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream;
添加模块506,被配置为将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流;An adding module 506 is configured to use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream;
编码模块508,被配置为对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至客户端。The encoding module 508 is configured to encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client.
可选地,所述解码模块502,进一步被配置为:Optionally, the decoding module 502 is further configured to:
确定所述客户端缓存的待播放直播流,并确定所述待播放直播流对应的生成时间;Determine the live stream to be played cached by the client, and determine the generation time corresponding to the live stream to be played;
根据所述待播放直播流对应的直播流标识及所述生成时间,获取预设时间区间内、所述直播流标识对应的初始直播流,并对所述初始直播流进行解码,其中,所述预设时间区间晚于所述生成时间。According to the live stream identifier corresponding to the live stream to be played and the generation time, an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
可选地,所述客户端对所述待播放直播流进行解码,生成对应的待播放音频流、待播放视频流、待展示字幕以及所述待展示字幕对应的展示时间;Optionally, the client decodes the live stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;
在确定满足所述待播放直播流的播放条件的情况下,对所述待播放视频流及所述待播放音频流进行同步播放,并基于所述展示时间展示所述待展示字幕。When it is determined that the playback condition of the live stream to be played is met, the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
可选地,所述直播数据处理装置,还包括确定模块,被配置为:Optionally, the live broadcast data processing device further includes a determination module configured to:
根据所述识别文本的文本长度和/或文本语义,确定所述识别文本的文本类型;Determining the text type of the recognized text according to the text length and/or text semantics of the recognized text;
相应地,所述添加模块506,进一步被配置为:Accordingly, the adding module 506 is further configured to:
根据所述生成时间确定所述第一视频流中的目标视频帧;Determine a target video frame in the first video stream according to the generation time;
将所述识别文本作为字幕信息,并将所述字幕信息、所述时间间隔信息及所述文本类型作为所述目标视频帧的视频帧信息,添加至所述第一视频流。 The recognized text is used as subtitle information, and the subtitle information, the time interval information and the text type are used as video frame information of the target video frame, and added to the first video stream.
可选地,所述客户端对所述待推送直播流进行解码,生成对应的音频流、视频流,以及所述视频流中目标视频帧的视频帧信息,其中,所述视频帧信息包括所述字幕信息、所述时间间隔信息及所述文本类型;Optionally, the client decodes the live stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;
在确定所述文本类型为目标类型的情况下,根据所述目标视频帧的播放时间以及所述时间间隔信息,确定所述字幕信息的展示时间;In the case where it is determined that the text type is the target type, determining the display time of the subtitle information according to the playback time of the target video frame and the time interval information;
根据所述展示时间确定所述视频流中,用于展示所述字幕信息的至少两帧视频帧,其中,所述至少两帧视频帧的播放时间早于所述目标视频帧的播放时间;Determining, according to the display time, at least two video frames in the video stream for displaying the subtitle information, wherein the playback time of the at least two video frames is earlier than the playback time of the target video frame;
在确定满足所述待推送直播流的播放条件的情况下,对所述视频流及所述音频流进行同步播放,并基于所述展示时间,在所述至少两帧视频帧及所述目标视频帧中展示所述字幕信息。When it is determined that the playback conditions of the live stream to be pushed are met, the video stream and the audio stream are played synchronously, and based on the display time, the subtitle information is displayed in the at least two video frames and the target video frame.
可选地,所述直播数据处理装置,还包括划分模块,被配置为:Optionally, the live broadcast data processing device further includes a division module configured to:
根据所述音频流对应的频谱信息,对所述音频流进行划分,生成至少两个音频片段;Dividing the audio stream according to the frequency spectrum information corresponding to the audio stream to generate at least two audio segments;
相应地,所述识别模块504,进一步被配置为:Accordingly, the identification module 504 is further configured to:
对目标音频片段进行语音识别,生成对应的识别文本,其中,所述目标音频片段为所述至少两个音频片段之一;Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least two audio segments;
确定所述识别文本的生成时间,并确定所述生成时间与所述目标音频片段的接收时间之间的时间间隔信息。The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the target audio segment is determined.
可选地,所述识别模块504,进一步被配置为:Optionally, the identification module 504 is further configured to:
按照预设识别窗口对所述音频流进行拆分,生成至少一个音频片段;Splitting the audio stream according to a preset recognition window to generate at least one audio segment;
对目标音频片段进行语音识别,生成对应的识别文本,其中,所述目标音频片段为所述至少一个音频片段之一;Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least one audio segment;
确定所述识别文本的生成时间,并确定所述生成时间与所述音频流的接收时间之间的时间间隔信息。The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
可选地,所述解码模块502,进一步被配置为:Optionally, the decoding module 502 is further configured to:
通过转码模块对接收的初始直播流进行解码,生成音频流及第一视频流;The received initial live stream is decoded by a transcoding module to generate an audio stream and a first video stream;
相应地,所述识别模块504,进一步被配置为:Accordingly, the identification module 504 is further configured to:
通过语音识别服务模块对所述音频流进行语音识别,生成对应的识别文本。The audio stream is subjected to speech recognition by a speech recognition service module to generate corresponding recognition text.
可选地,所述识别模块504,进一步被配置为:Optionally, the identification module 504 is further configured to:
通过语音识别服务模块按照预设识别窗口对所述音频流进行拆分,生成至少一个音频片段;Splitting the audio stream according to a preset recognition window by a speech recognition service module to generate at least one audio segment;
对第一音频片段进行语音识别,生成对应的第一识别文本,并将所述第一识别文本返回至所述转码模块,其中,所述第一音频片段为所述至少一个音频片段之一。Perform speech recognition on a first audio segment to generate a corresponding first recognition text, and return the first recognition text to the transcoding module, wherein the first audio segment is one of the at least one audio segment.
可选地,所述添加模块506,进一步被配置为:Optionally, the adding module 506 is further configured to:
所述转码模块根据所述第一识别文本的生成时间确定所述第一视频流中的第一目标视频帧;The transcoding module determines a first target video frame in the first video stream according to a generation time of the first recognition text;
将所述第一识别文本作为第一字幕信息,并将所述第一字幕信息、所述第一识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,作为所述第一目标视频帧的视频帧信息,添加至所述第一视频流。The first recognition text is used as the first subtitle information, and the time interval information between the generation time of the first subtitle information and the first recognition text and the receiving time of the audio stream is added to the first video stream as the video frame information of the first target video frame.
可选地,所述识别模块504,进一步被配置为:Optionally, the identification module 504 is further configured to:
对所述至少两个音频片段中,与所述第一音频片段相邻的第二音频片段进行语音识别,生成对应的第二识别文本,并将所述第一识别文本及所述第二识别文本返回至所述转码模块。Perform speech recognition on a second audio segment adjacent to the first audio segment among the at least two audio segments, generate a corresponding second recognized text, and return the first recognized text and the second recognized text to the transcoding module.
可选地,所述添加模块506,进一步被配置为:Optionally, the adding module 506 is further configured to:
所述转码模块根据所述第二识别文本的生成时间确定所述第一视频流中的第二目标视频帧; The transcoding module determines a second target video frame in the first video stream according to a generation time of the second recognition text;
将所述第一识别文本及所述第二识别文本作为第二字幕信息,并将所述第二字幕信息、所述第二识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,作为所述第二目标视频帧的视频帧信息,添加至所述第一视频流。The first recognition text and the second recognition text are used as second subtitle information, and the time interval information between the generation time of the second subtitle information and the second recognition text and the receiving time of the audio stream is used as video frame information of the second target video frame and added to the first video stream.
可选地,所述直播数据处理装置,还包括传输模块,被配置为:Optionally, the live broadcast data processing device further includes a transmission module configured to:
所述转码模块通过数据传输通道将所述音频流传输至所述语音识别服务模块。The transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.
上述为本实施例的一种直播数据处理装置的示意性方案。需要说明的是,该直播数据处理装置的技术方案与上述的一种直播数据处理方法的技术方案属于同一构思,直播数据处理装置的技术方案未详细描述的细节内容,均可以参见上述一种直播数据处理方法的技术方案的描述。The above is a schematic scheme of a live data processing device of this embodiment. It should be noted that the technical scheme of the live data processing device and the technical scheme of the above-mentioned live data processing method belong to the same concept, and the details not described in detail in the technical scheme of the live data processing device can be referred to the description of the technical scheme of the above-mentioned live data processing method.
与上述方法实施例相对应,本申请还提供了直播数据处理装置实施例,图6示出了本申请一个实施例提供的另一种直播数据处理装置的结构示意图。如图6所示,该装置包括:Corresponding to the above method embodiment, the present application also provides a live data processing device embodiment, and FIG6 shows a schematic diagram of the structure of another live data processing device provided by an embodiment of the present application. As shown in FIG6, the device includes:
接收模块602,被配置为接收并缓存直播服务器返回的待推送直播流;The receiving module 602 is configured to receive and cache the live streaming stream to be pushed returned by the live streaming server;
解码模块604,被配置为对所述待推送直播流进行解码,生成对应的音频流、视频流、字幕信息以及所述字幕信息对应的时间间隔信息,其中,所述时间间隔信息由所述直播服务器根据所述的字幕信息的生成时间及所述音频流的接收时间确定;The decoding module 604 is configured to decode the live broadcast stream to be pushed, generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;
确定模块606,被配置为根据所述时间间隔信息确定所述字幕信息的展示时间;A determination module 606 is configured to determine a display time of the subtitle information according to the time interval information;
展示模块608,被配置为在确定满足所述待推送直播流的播放条件的情况下,对所述视频流及所述音频流进行同步播放,并基于所述展示时间展示所述字幕信息。The display module 608 is configured to synchronously play the video stream and the audio stream when it is determined that the playback condition of the live stream to be pushed is met, and to display the subtitle information based on the display time.
上述为本实施例的另一种直播数据处理装置的示意性方案。需要说明的是,该直播数据处理装置的技术方案与上述的另一种直播数据处理方法的技术方案属于同一构思,直播数据处理装置的技术方案未详细描述的细节内容,均可以参见上述另一种直播数据处理方法的技术方案的描述。The above is a schematic scheme of another live data processing device of this embodiment. It should be noted that the technical scheme of the live data processing device and the technical scheme of the above another live data processing method belong to the same concept, and the details of the technical scheme of the live data processing device that are not described in detail can all be referred to the description of the technical scheme of the above another live data processing method.
图7示出了根据本申请一个实施例提供的一种计算设备700的结构框图。该计算设备700的部件包括但不限于存储器710和处理器720。处理器720与存储器710通过总线730相连接,数据库750用于保存数据。Fig. 7 shows a block diagram of a computing device 700 according to an embodiment of the present application. The components of the computing device 700 include but are not limited to a memory 710 and a processor 720. The processor 720 is connected to the memory 710 via a bus 730, and the database 750 is used to store data.
计算设备700还包括接入设备740,接入设备740使得计算设备700能够经由一个或多个网络760通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备740可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。The computing device 700 also includes an access device 740 that enables the computing device 700 to communicate via one or more networks 760. Examples of these networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet. The access device 740 may include one or more of any type of network interface (e.g., a network interface card (NIC)) whether wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, a World Wide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a universal serial bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.
在本申请的一个实施例中,计算设备700的上述部件以及图7中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图7所示的计算设备结构框图仅仅是出于示例的目的,而不是对本申请范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。In one embodiment of the present application, the above components of the computing device 700 and other components not shown in FIG. 7 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG. 7 is only for illustrative purposes and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as needed.
计算设备700可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备700还可以是移动式或静止式的服务器。The computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or PC. The computing device 700 may also be a mobile or stationary server.
其中,处理器720用于执行如下计算机可执行指令,所述处理器用于执行所述计算机可执行指令,其中,所述处理器执行所述计算机可执行指令时实现所述直播数据处理方法的步骤。Among them, the processor 720 is used to execute the following computer-executable instructions, and the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live data processing method when executing the computer-executable instructions.
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的直播数据处理方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述直播数据处理方法的技术方案的描述。 The above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the technical scheme of the live data processing method described above are of the same concept, and the details of the technical scheme of the computing device that are not described in detail can be found in the description of the technical scheme of the live data processing method described above.
本申请一实施例还提供一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现以下步骤:An embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the following steps:
对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息;Performing speech recognition on the audio stream to generate corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream;
将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流;Using the recognized text as subtitle information, and adding the subtitle information and the time interval information to the first video stream to generate a second video stream;
对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至客户端;Encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client;
或者实现以下步骤:Or implement the following steps:
接收并缓存直播服务器返回的待推送直播流;Receive and cache the live stream to be pushed returned by the live server;
对所述待推送直播流进行解码,生成对应的音频流、视频流、字幕信息以及所述字幕信息对应的时间间隔信息,其中,所述时间间隔信息由所述直播服务器根据所述的字幕信息的生成时间及所述音频流的接收时间确定;Decoding the live broadcast stream to be pushed, generating a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;
根据所述时间间隔信息确定所述字幕信息的展示时间;Determining the display time of the subtitle information according to the time interval information;
在确定满足所述待推送直播流的播放条件的情况下,对所述视频流及所述音频流进行同步播放,并基于所述展示时间展示所述字幕信息。When it is determined that the playback condition of the live stream to be pushed is met, the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的直播数据处理方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述直播数据处理方法的技术方案的描述。The above is a schematic scheme of a computer-readable storage medium of this embodiment. It should be noted that the technical scheme of the storage medium and the technical scheme of the live data processing method described above are of the same concept, and the details not described in detail in the technical scheme of the storage medium can be found in the description of the technical scheme of the live data processing method described above.
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The above describes specific embodiments of the present application. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims can be performed in an order different from that in the embodiments and still achieve the desired results. In addition, the processes depicted in the accompanying drawings do not necessarily require the specific order or continuous order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。The computer instructions include computer program codes, which may be in source code form, object code form, executable files or some intermediate forms, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, USB flash drive, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electric carrier signals and telecommunication signals.
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本申请实施例所必须的。It should be noted that, for the aforementioned method embodiments, for the sake of simplicity of description, they are all expressed as a series of action combinations, but those skilled in the art should be aware that the embodiments of the present application are not limited by the described action sequence, because according to the embodiments of the present application, certain steps can be performed in other sequences or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the embodiments of the present application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.
以上公开的本申请优选实施例只是用于帮助阐述本申请。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本申请实施例的内容,可作很多的修改和变化。本申请选取并具体描述这些实施例,是为了更好地解释本申请实施例的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本申请。本申请仅受权利要求书及其全部范围和等效物的限制。 The preferred embodiments of the present application disclosed above are only used to help explain the present application. The optional embodiments do not describe all the details in detail, nor do they limit the invention to the specific implementation methods described. Obviously, many modifications and changes can be made according to the content of the embodiments of the present application. The present application selects and specifically describes these embodiments in order to better explain the principles and practical applications of the embodiments of the present application, so that those skilled in the art can understand and use the present application well. The present application is only limited by the claims and their full scope and equivalents.

Claims (20)

  1. 一种直播数据处理方法,包括:A live broadcast data processing method, comprising:
    对接收的初始直播流进行解码,生成音频流及第一视频流;Decoding the received initial live stream to generate an audio stream and a first video stream;
    对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息;Performing speech recognition on the audio stream to generate corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream;
    将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流;Using the recognized text as subtitle information, and adding the subtitle information and the time interval information to the first video stream to generate a second video stream;
    对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至客户端。The second video stream and the audio stream are encoded to generate a live stream to be pushed, and the live stream to be pushed is returned to the client.
  2. 根据权利要求1所述的直播数据处理方法,所述对接收的初始直播流进行解码,包括:According to the live broadcast data processing method according to claim 1, the decoding of the received initial live broadcast stream comprises:
    确定所述客户端缓存的待播放直播流,并确定所述待播放直播流对应的生成时间;Determine the live stream to be played cached by the client, and determine the generation time corresponding to the live stream to be played;
    根据所述待播放直播流对应的直播流标识及所述生成时间,获取预设时间区间内、所述直播流标识对应的初始直播流,并对所述初始直播流进行解码,其中,所述预设时间区间晚于所述生成时间。According to the live stream identifier corresponding to the live stream to be played and the generation time, an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
  3. 根据权利要求2所述的直播数据处理方法,所述客户端对所述待播放直播流进行解码,生成对应的待播放音频流、待播放视频流、待展示字幕以及所述待展示字幕对应的展示时间;According to the live broadcast data processing method of claim 2, the client decodes the live broadcast stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;
    在确定满足所述待播放直播流的播放条件的情况下,对所述待播放视频流及所述待播放音频流进行同步播放,并基于所述展示时间展示所述待展示字幕。When it is determined that the playback condition of the live stream to be played is met, the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
  4. 根据权利要求1所述的直播数据处理方法,还包括:The live broadcast data processing method according to claim 1 further comprises:
    根据所述识别文本的文本长度和/或文本语义,确定所述识别文本的文本类型;Determining the text type of the recognized text according to the text length and/or text semantics of the recognized text;
    相应地,所述将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,包括:Correspondingly, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:
    根据所述生成时间确定所述第一视频流中的目标视频帧;Determine a target video frame in the first video stream according to the generation time;
    将所述识别文本作为字幕信息,并将所述字幕信息、所述时间间隔信息及所述文本类型作为所述目标视频帧的视频帧信息,添加至所述第一视频流。The recognized text is used as subtitle information, and the subtitle information, the time interval information and the text type are used as video frame information of the target video frame and added to the first video stream.
  5. 根据权利要求4所述的直播数据处理方法,所述客户端对所述待推送直播流进行解码,生成对应的音频流、视频流,以及所述视频流中目标视频帧的视频帧信息,其中,所述视频帧信息包括所述字幕信息、所述时间间隔信息及所述文本类型;According to the live broadcast data processing method of claim 4, the client decodes the live broadcast stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information and the text type;
    在确定所述文本类型为目标类型的情况下,根据所述目标视频帧的播放时间以及所述时间间隔信息,确定所述字幕信息的展示时间;In the case where it is determined that the text type is the target type, determining the display time of the subtitle information according to the playback time of the target video frame and the time interval information;
    根据所述展示时间确定所述视频流中,用于展示所述字幕信息的至少两帧视频帧,其中,所述至少两帧视频帧的播放时间早于所述目标视频帧的播放时间;Determining, according to the display time, at least two video frames in the video stream for displaying the subtitle information, wherein the playback time of the at least two video frames is earlier than the playback time of the target video frame;
    在确定满足所述待推送直播流的播放条件的情况下,对所述视频流及所述音频流进行同步播放,并基于所述展示时间,在所述至少两帧视频帧及所述目标视频帧中展示所述字幕信息。When it is determined that the playback conditions of the live stream to be pushed are met, the video stream and the audio stream are played synchronously, and based on the display time, the subtitle information is displayed in the at least two video frames and the target video frame.
  6. 根据权利要求1所述的直播数据处理方法,还包括:The live broadcast data processing method according to claim 1 further comprises:
    根据所述音频流对应的频谱信息,对所述音频流进行划分,生成至少两个音频片段;Dividing the audio stream according to the frequency spectrum information corresponding to the audio stream to generate at least two audio segments;
    相应地,所述对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,包括:Correspondingly, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream includes:
    对目标音频片段进行语音识别,生成对应的识别文本,其中,所述目标音频片段为所述至少两个音频片段之一; Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least two audio segments;
    确定所述识别文本的生成时间,并确定所述生成时间与所述目标音频片段的接收时间之间的时间间隔信息。The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the target audio segment is determined.
  7. 根据权利要求1所述的直播数据处理方法,对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,包括:According to the live broadcast data processing method of claim 1, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, comprising:
    按照预设识别窗口对所述音频流进行拆分,生成至少一个音频片段;Splitting the audio stream according to a preset recognition window to generate at least one audio segment;
    对目标音频片段进行语音识别,生成对应的识别文本,其中,所述目标音频片段为所述至少一个音频片段之一;Performing speech recognition on a target audio segment to generate a corresponding recognition text, wherein the target audio segment is one of the at least one audio segment;
    确定所述识别文本的生成时间,并确定所述生成时间与所述音频流的接收时间之间的时间间隔信息。The generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
  8. 根据权利要求1所述的直播数据处理方法,所述对接收的初始直播流进行解码,生成音频流及第一视频流,包括:According to the live broadcast data processing method of claim 1, the decoding of the received initial live broadcast stream to generate the audio stream and the first video stream comprises:
    通过转码模块对接收的初始直播流进行解码,生成音频流及第一视频流;The received initial live stream is decoded by a transcoding module to generate an audio stream and a first video stream;
    相应地,所述对所述音频流进行语音识别,生成对应的识别文本,包括:Accordingly, performing speech recognition on the audio stream to generate corresponding recognition text includes:
    通过语音识别服务模块对所述音频流进行语音识别,生成对应的识别文本。The audio stream is subjected to speech recognition by a speech recognition service module to generate corresponding recognition text.
  9. 根据权利要求8所述的直播数据处理方法,所述通过语音识别服务模块对所述音频流进行语音识别,生成对应的识别文本,包括:According to the live broadcast data processing method of claim 8, the step of performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text comprises:
    通过语音识别服务模块按照预设识别窗口对所述音频流进行拆分,生成至少一个音频片段;Splitting the audio stream according to a preset recognition window by a speech recognition service module to generate at least one audio segment;
    对第一音频片段进行语音识别,生成对应的第一识别文本,并将所述第一识别文本返回至所述转码模块,其中,所述第一音频片段为所述至少一个音频片段之一。Perform speech recognition on a first audio segment to generate a corresponding first recognition text, and return the first recognition text to the transcoding module, wherein the first audio segment is one of the at least one audio segment.
  10. 根据权利要求9所述的直播数据处理方法,所述将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,包括:According to the live broadcast data processing method of claim 9, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream comprises:
    所述转码模块根据所述第一识别文本的生成时间确定所述第一视频流中的第一目标视频帧;The transcoding module determines a first target video frame in the first video stream according to a generation time of the first recognition text;
    将所述第一识别文本作为第一字幕信息,并将所述第一字幕信息、所述第一识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,作为所述第一目标视频帧的视频帧信息,添加至所述第一视频流。The first recognition text is used as the first subtitle information, and the time interval information between the generation time of the first subtitle information and the first recognition text and the receiving time of the audio stream is added to the first video stream as the video frame information of the first target video frame.
  11. 根据权利要求10所述的直播数据处理方法,所述通过语音识别服务模块对所述音频流进行语音识别,生成对应的识别文本,包括:According to the live broadcast data processing method of claim 10, the step of performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text comprises:
    对所述至少两个音频片段中,与所述第一音频片段相邻的第二音频片段进行语音识别,生成对应的第二识别文本,并将所述第一识别文本及所述第二识别文本返回至所述转码模块。Perform speech recognition on a second audio segment adjacent to the first audio segment among the at least two audio segments, generate a corresponding second recognized text, and return the first recognized text and the second recognized text to the transcoding module.
  12. 根据权利要求11所述的直播数据处理方法,所述将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,包括:According to the live broadcast data processing method of claim 11, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream comprises:
    所述转码模块根据所述第二识别文本的生成时间确定所述第一视频流中的第二目标视频帧;The transcoding module determines a second target video frame in the first video stream according to a generation time of the second recognition text;
    将所述第一识别文本及所述第二识别文本作为第二字幕信息,并将所述第二字幕信息、所述第二识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,作为所述第二目标视频帧的视频帧信息,添加至所述第一视频流。The first recognition text and the second recognition text are used as second subtitle information, and the time interval information between the generation time of the second subtitle information and the second recognition text and the receiving time of the audio stream is used as video frame information of the second target video frame and added to the first video stream.
  13. 根据权利要求8所述的直播数据处理方法,还包括:The live broadcast data processing method according to claim 8 further comprises:
    所述转码模块通过数据传输通道将所述音频流传输至所述语音识别服务模块。The transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.
  14. 一种直播数据处理方法,包括:A live broadcast data processing method, comprising:
    接收并缓存直播服务器返回的待推送直播流; Receive and cache the live stream to be pushed returned by the live server;
    对所述待推送直播流进行解码,生成对应的音频流、视频流、字幕信息以及所述字幕信息对应的时间间隔信息,其中,所述时间间隔信息由所述直播服务器根据所述的字幕信息的生成时间及所述音频流的接收时间确定;Decoding the live broadcast stream to be pushed, generating a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;
    根据所述时间间隔信息确定所述字幕信息的展示时间;Determining the display time of the subtitle information according to the time interval information;
    在确定满足所述待推送直播流的播放条件的情况下,对所述视频流及所述音频流进行同步播放,并基于所述展示时间展示所述字幕信息。When it is determined that the playback condition of the live stream to be pushed is met, the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  15. 一种直播数据处理系统,包括:A live broadcast data processing system, comprising:
    直播服务器和客户端;Live streaming server and client;
    所述直播服务器,用于对接收的初始直播流进行解码,生成音频流及第一视频流,对所述音频流进行语音识别,生成对应的识别文本,并确定所述识别文本的生成时间与所述音频流的接收时间之间的时间间隔信息,将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,生成第二视频流,对所述第二视频流及所述音频流进行编码,生成待推送直播流,并将所述待推送直播流返回至所述客户端;The live broadcast server is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, add the subtitle information and the time interval information to the first video stream, generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client;
    所述客户端,用于接收并缓存所述待推送直播流,对所述待推送直播流进行解码,获得所述音频流、所述第二视频流、所述字幕信息以及所述时间间隔信息,根据所述时间间隔信息确定所述字幕信息的展示时间,在确定满足所述待推送直播流的播放条件的情况下,对所述第二视频流及所述音频流进行同步播放,并基于所述展示时间展示所述字幕信息。The client is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.
  16. 根据权利要求15所述的直播数据处理系统,所述对接收的初始直播流进行解码,包括:According to the live broadcast data processing system of claim 15, the decoding of the received initial live broadcast stream comprises:
    确定所述客户端缓存的待播放直播流,并确定所述待播放直播流对应的生成时间;Determine the live stream to be played cached by the client, and determine the generation time corresponding to the live stream to be played;
    根据所述待播放直播流对应的直播流标识及所述生成时间,获取预设时间区间内、所述直播流标识对应的初始直播流,并对所述初始直播流进行解码,其中,所述预设时间区间晚于所述生成时间。According to the live stream identifier corresponding to the live stream to be played and the generation time, an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
  17. 根据权利要求16所述的直播数据处理系统,所述客户端对所述待播放直播流进行解码,生成对应的待播放音频流、待播放视频流、待展示字幕以及所述待展示字幕对应的展示时间;According to the live broadcast data processing system of claim 16, the client decodes the live broadcast stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;
    在确定满足所述待播放直播流的播放条件的情况下,对所述待播放视频流及所述待播放音频流进行同步播放,并基于所述展示时间展示所述待展示字幕。When it is determined that the playback condition of the live stream to be played is met, the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
  18. 根据权利要求15所述的直播数据处理系统,所述直播服务器还用于:According to the live broadcast data processing system of claim 15, the live broadcast server is further used for:
    根据所述识别文本的文本长度和/或文本语义,确定所述识别文本的文本类型;Determining the text type of the recognized text according to the text length and/or text semantics of the recognized text;
    相应地,所述将所述识别文本作为字幕信息,并将所述字幕信息及所述时间间隔信息添加至所述第一视频流,包括:Correspondingly, the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:
    根据所述生成时间确定所述第一视频流中的目标视频帧;Determine a target video frame in the first video stream according to the generation time;
    将所述识别文本作为字幕信息,并将所述字幕信息、所述时间间隔信息及所述文本类型作为所述目标视频帧的视频帧信息,添加至所述第一视频流。The recognized text is used as subtitle information, and the subtitle information, the time interval information and the text type are used as video frame information of the target video frame, and added to the first video stream.
  19. 一种计算设备,包括:A computing device comprising:
    存储器和处理器;Memory and processor;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,其中,所述处理器执行所述计算机可执行指令时实现权利要求1-14任意一项所述的直播数据处理方法的步骤。The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live broadcast data processing method described in any one of claims 1-14 when executing the computer-executable instructions.
  20. 一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现权利要求1-14任意一项所述的直播数据处理方法的步骤。 A computer-readable storage medium storing computer instructions, which, when executed by a processor, implement the steps of the live broadcast data processing method described in any one of claims 1 to 14.
PCT/CN2023/106150 2022-10-25 2023-07-06 Livestreaming data processing method and system WO2024087732A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211311544.0A CN115643424A (en) 2022-10-25 2022-10-25 Live broadcast data processing method and system
CN202211311544.0 2022-10-25

Publications (1)

Publication Number Publication Date
WO2024087732A1 true WO2024087732A1 (en) 2024-05-02

Family

ID=84946678

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/106150 WO2024087732A1 (en) 2022-10-25 2023-07-06 Livestreaming data processing method and system

Country Status (2)

Country Link
CN (1) CN115643424A (en)
WO (1) WO2024087732A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115643424A (en) * 2022-10-25 2023-01-24 上海哔哩哔哩科技有限公司 Live broadcast data processing method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108063970A (en) * 2017-11-22 2018-05-22 北京奇艺世纪科技有限公司 A kind of method and apparatus for handling live TV stream
CN108401192A (en) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
CN111010614A (en) * 2019-12-26 2020-04-14 北京奇艺世纪科技有限公司 Method, device, server and medium for displaying live caption
CN112272323A (en) * 2014-06-30 2021-01-26 苹果公司 Real-time digital assistant knowledge update
CN115086753A (en) * 2021-03-16 2022-09-20 北京有竹居网络技术有限公司 Live video stream processing method and device, electronic equipment and storage medium
CN115643424A (en) * 2022-10-25 2023-01-24 上海哔哩哔哩科技有限公司 Live broadcast data processing method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112272323A (en) * 2014-06-30 2021-01-26 苹果公司 Real-time digital assistant knowledge update
CN108063970A (en) * 2017-11-22 2018-05-22 北京奇艺世纪科技有限公司 A kind of method and apparatus for handling live TV stream
CN108401192A (en) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
CN111010614A (en) * 2019-12-26 2020-04-14 北京奇艺世纪科技有限公司 Method, device, server and medium for displaying live caption
CN115086753A (en) * 2021-03-16 2022-09-20 北京有竹居网络技术有限公司 Live video stream processing method and device, electronic equipment and storage medium
CN115643424A (en) * 2022-10-25 2023-01-24 上海哔哩哔哩科技有限公司 Live broadcast data processing method and system

Also Published As

Publication number Publication date
CN115643424A (en) 2023-01-24

Similar Documents

Publication Publication Date Title
US11463779B2 (en) Video stream processing method and apparatus, computer device, and storage medium
US11252444B2 (en) Video stream processing method, computer device, and storage medium
CN109168078B (en) Video definition switching method and device
US10244291B2 (en) Authoring system for IPTV network
US9478256B1 (en) Video editing processor for video cloud server
US10679675B2 (en) Multimedia file joining method and apparatus
WO2024087732A1 (en) Livestreaming data processing method and system
US11758245B2 (en) Interactive media events
CN112637670A (en) Video generation method and device
CN113923502B (en) Live video playing method and device
CN112616062A (en) Subtitle display method and device, electronic equipment and storage medium
CN114040255A (en) Live caption generating method, system, equipment and storage medium
CN103635938B (en) For processing the method for video streaming data, streaming client in streaming client
CN113301359A (en) Audio and video processing method and device and electronic equipment
WO2018142946A1 (en) Information processing device and method
KR100651566B1 (en) Multimedia Player Using Output Buffering in Mobile Terminal and Its Control Method
WO2013166785A1 (en) Media service providing method and device, and media service displaying method and device
CN113873296A (en) Video stream processing method and device
KR20210021630A (en) Method for transmiting contents and terminal apparatus using the same
CN113766342B (en) Subtitle synthesizing method and related device, electronic equipment and storage medium
CN118338093B (en) Soft solution method for playing H.265 video stream based on web front end
TWI819580B (en) Media playback method for improving playback response based on pre-parsing operation and related media playback device
CN118632044A (en) Audio/video transcoding processing and playing method, device, storage medium and program product
JP2024100180A (en) Reproduction controller, receiver and program
CN117376593A (en) Subtitle processing method and device for live stream, storage medium and computer equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23881320

Country of ref document: EP

Kind code of ref document: A1