WO2024087732A1 - Procédé et système de traitement de données de diffusion en direct - Google Patents

Procédé et système de traitement de données de diffusion en direct Download PDF

Info

Publication number
WO2024087732A1
WO2024087732A1 PCT/CN2023/106150 CN2023106150W WO2024087732A1 WO 2024087732 A1 WO2024087732 A1 WO 2024087732A1 CN 2023106150 W CN2023106150 W CN 2023106150W WO 2024087732 A1 WO2024087732 A1 WO 2024087732A1
Authority
WO
WIPO (PCT)
Prior art keywords
stream
live
text
audio
time
Prior art date
Application number
PCT/CN2023/106150
Other languages
English (en)
Chinese (zh)
Inventor
汤然
姜军
郑龙
刘永明
Original Assignee
上海哔哩哔哩科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海哔哩哔哩科技有限公司 filed Critical 上海哔哩哔哩科技有限公司
Publication of WO2024087732A1 publication Critical patent/WO2024087732A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/262Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular to a live data processing method.
  • One or more embodiments of the present application also relate to a live data processing system, a computing device, and a computer-readable storage medium.
  • the embodiments of the present application provide a live broadcast data processing method.
  • One or more embodiments of the present application also relate to a live broadcast data processing device, a live broadcast data processing system, a computing device, and a computer-readable storage medium to solve the technical defects of high cost, low efficiency and delayed subtitle generation in the related art.
  • a live broadcast data processing method including:
  • the second video stream and the audio stream are encoded to generate a live stream to be pushed, and the live stream to be pushed is returned to the client.
  • a live broadcast data processing device including:
  • a decoding module configured to decode the received initial live stream to generate an audio stream and a first video stream
  • a recognition module configured to perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream;
  • an adding module configured to use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream;
  • the encoding module is configured to encode the second video stream and the audio stream, generate a live stream to be pushed, and return the live stream to be pushed to the client.
  • another live broadcast data processing method including:
  • the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • another live broadcast data processing device including:
  • a receiving module is configured to receive and cache the live streaming stream to be pushed returned by the live streaming server;
  • a decoding module is configured to decode the live broadcast stream to be pushed, generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;
  • a determination module configured to determine a display time of the subtitle information according to the time interval information
  • the display module is configured to synchronously play the video stream and the audio stream when it is determined that the playback conditions of the live stream to be pushed are met, and to display the subtitle information based on the display time.
  • a live broadcast data processing system including:
  • the live broadcast server is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, add the subtitle information and the time interval information to the first video stream, generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client;
  • the client is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live data processing method when executing the computer-executable instructions.
  • a computer-readable storage medium which stores computer-executable instructions, and when the instructions are executed by a processor, the steps of the live data processing method are implemented.
  • An embodiment of the present application implements a live broadcast data processing method and system, wherein the live broadcast data processing method includes decoding a received initial live broadcast stream, generating an audio stream and a first video stream, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, using the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream, generating a second video stream, encoding the second video stream and the audio stream, generating a live broadcast stream to be pushed, and returning the live broadcast stream to be pushed to the client.
  • the live broadcast server performs speech recognition on the audio stream, generates a corresponding recognition text, and records the time interval between the generation time of the recognition text and the time interval between the time of receiving the audio stream. Since the time interval can be used to characterize the time consumed by the live broadcast server for performing speech recognition on the audio stream in the initial live broadcast stream after receiving the initial live broadcast stream, after adding the recognition text and the time interval information to the video stream and returning it to the client, the client can pre-parse and obtain the subtitle information carried in the live broadcast stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live broadcast server receives the audio stream, that is, determine the display time of the complete subtitle corresponding to the live broadcast stream to be pushed, so as to display the complete subtitle in advance based on the display time, which is beneficial to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is beneficial to avoiding the asynchrony between subtitles and video images
  • FIG1 is an architecture diagram of a live data processing system provided by an embodiment of the present application.
  • FIG2 is a flow chart of a live broadcast data processing method provided by an embodiment of the present application.
  • FIG3 is a flow chart of another live data processing method provided by an embodiment of the present application.
  • FIG4 is an interactive schematic diagram of a live broadcast data processing method provided by an embodiment of the present application applied to the live broadcast field;
  • FIG5 is a schematic diagram of the structure of a live data processing device provided by an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of another live data processing device provided by an embodiment of the present application.
  • FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present application.
  • first, second, etc. may be used to describe various information in one or more embodiments of the present application, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • the first may also be referred to as the second, and similarly, the second may also be referred to as the first.
  • word "if” as used herein may be interpreted as "at the time of” or "when” or "in response to determining”.
  • Live broadcast in English is Live broadcast. Broadly speaking, live broadcast also includes TV live broadcast. Here we generally refer to online video live broadcast. Live audio and video will be pushed to the server in the form of media stream (pushing stream). If there are viewers watching the live broadcast, after receiving the user's request, the server will transmit the video to the website, APP, and client player to play the video in real time.
  • H264 generally refers to H.264.
  • H.264 is a highly compressed digital video codec standard proposed by the Joint Video Team (JVT), which is jointly composed of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).
  • JVT Joint Video Team
  • VCEG Video Coding Experts Group
  • MPEG ISO/IEC Moving Picture Experts Group
  • H.265 is a new video coding standard developed by ITU-T VCEG after H.264.
  • the H.265 standard revolves around the video coding standard H.264, retaining some of the original technologies while improving some related technologies.
  • SEI stands for Supplemental Enhancement Information, which belongs to the bitstream category. It provides a method to add additional information to the video bitstream and is one of the features of video compression standards such as H.264/H.265.
  • Speech recognition technology a technology that converts speech signals into corresponding text or commands through the process of recognition and understanding by a machine.
  • GRPC A type of RPC (abbreviation of Remote Procedure Call) framework. It is a high-performance, open source and general RPC framework developed based on the ProtoBuf (Protocol Buffers) serialization protocol and supports many development languages.
  • Video transcoding technology converts video signals from one format to another.
  • a live data processing method is provided.
  • One or more embodiments of the present application simultaneously relate to a live data processing apparatus, a live data processing system, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
  • the subtitle information of the embodiment of the present application can be presented on large video playback devices, game consoles, desktop computers, smart phones, tablet computers, MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3) players, MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Clients such as players, laptops, e-book readers and other display terminals.
  • MP3 Motion Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • Clients such as players, laptops, e-book readers and other display terminals.
  • subtitle information of the embodiments of the present application can be applied to any video or audio that can present subtitles, for example, subtitles can be presented in live or recorded videos, and subtitles can be presented in audio of online or offline songs or books.
  • FIG. 1 shows an architecture diagram of a live broadcast data processing system provided according to an embodiment of the present application, including:
  • Live broadcast server 102 and client 104 Live broadcast server 102 and client 104;
  • the live broadcast server 102 is used to decode the received initial live broadcast stream, generate an audio stream and a first video stream, perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream, use the recognition text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream, encode the second video stream and the audio stream, generate a live broadcast stream to be pushed, and return the live broadcast stream to be pushed to the client 104;
  • the client 104 is used to receive and cache the live stream to be pushed, decode the live stream to be pushed, obtain the audio stream, the second video stream, the subtitle information and the time interval information, determine the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously play the second video stream and the audio stream, and display the subtitle information based on the display time.
  • user U1 broadcasts live through a smart terminal and pushes the generated initial live stream to the live server 102.
  • the live server 102 decodes the received initial live stream to generate an audio stream and a first video stream; then performs speech recognition on the audio stream to generate a corresponding recognition text, and determines the time interval information between the generation time of the recognition text and the reception time of the audio stream; then uses the recognition text as subtitle information, and adds the subtitle information and the time interval information to the first video stream to generate a second video stream; then encodes the second video stream and the audio stream to generate a live stream to be pushed.
  • the live server can push the live stream to be pushed to the client 104 of user U2 and user U3.
  • the client 104 can pull the live stream to be pushed of a certain length from the live server in advance and cache it, so that the client 104 can decode the cached live stream to be pushed in advance and obtain the subtitle information contained in the live stream to be pushed. Then, according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server 102, the display time of the subtitle information can be determined. When it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.
  • the above is a schematic scheme of a live data processing system of this embodiment. It should be noted that the technical scheme of the live data processing system and the technical scheme of the following live data processing method belong to the same concept, and the details not described in detail in the technical scheme of the live data processing system can be referred to the description of the technical scheme of the following live data processing method.
  • FIG. 2 shows a flow chart of a live broadcast data processing method provided according to an embodiment of the present application, comprising the following steps:
  • Step 202 decode the received initial live stream to generate an audio stream and a first video stream.
  • the live broadcast data processing method provided in the embodiment of the present application is applied to a live broadcast server.
  • the initial live broadcast stream is the live broadcast stream pushed to the live broadcast server by the anchor during the live broadcast process.
  • the host When the host is broadcasting live through his smart terminal, he can push the live stream generated during the live broadcast to the live broadcast server through the smart terminal, so that when other users need to watch the host's live broadcast, the live broadcast server can push the live stream pushed by the host to the user terminals (clients) of other users.
  • the live server can decode the initial live stream to obtain an audio stream and a first video stream, and can perform voice recognition on the audio stream to obtain the corresponding recognized text, and then add the recognized text as subtitle information to the first video stream to generate a second video stream, so that after the encoding results of the audio stream and the second video stream are pushed to the user's client, the client can decode and obtain the subtitle information, and can display the subtitle information while synchronously playing the audio stream and the second video stream for the user, thereby avoiding the problem of the live subtitles being out of sync with the live video screen or audio during the user's real-time viewing of the live broadcast, so as to meet the user's needs for viewing live subtitles during the live viewing process and improve the user's live viewing experience.
  • decoding the received initial live stream can be implemented in the following ways:
  • an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
  • the client decodes the live stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;
  • the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
  • the client when playing a live stream, can pre-cache the live stream to be played within a period of time after the current playback time, and parse this part of the live stream to be played in advance to obtain the video stream to be played, the audio stream to be played, the subtitles to be displayed, and the display time corresponding to the subtitles to be displayed contained in the live stream to be played. Then, when it is determined that the playback conditions of the live stream to be played are met, the decoded video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
  • the live stream to be played within t to t+5s is pre-cached and parsed in advance to determine whether the subtitles to be displayed need to be displayed in advance based on the display time of the subtitles to be displayed in the analysis results, thereby reducing the delay between the live subtitles and the live video screen or audio during the user's real-time viewing of the live broadcast.
  • the client pre-caches the live stream to be played within t to t+5s, and after the live stream to be played within t to t+3s is played, it is necessary to cache the live stream to be played within t+5s to t+8s, that is, it is necessary to obtain the live stream to be played within t+5s to t+8s from the live broadcast server.
  • the live broadcast server can pre-determine the live broadcast stream to be played that has been cached by the client, and determine the generation time (playback time) corresponding to the cached live broadcast stream to be played, and then obtain the initial live broadcast stream corresponding to the live broadcast stream identifier within a period of time after the generation time based on the live broadcast stream identifier and generation time corresponding to the cached live broadcast stream to be played, and process the initial live broadcast stream to generate a live broadcast stream to be pushed containing subtitle information, and push it to the client.
  • the generation time playback time
  • the client when the user is watching the live broadcast in real time through the client, the client pre-caches the live stream to be played within a period of time after the current playback time, and parses this part of the live stream to be played in advance.
  • the live broadcast server can also pre-determine the live stream to be played that the client has cached, and determine the initial Live stream and parse it; although the live server parses the initial live stream, and the client parses the live stream to be played, both take a certain amount of time and cause a certain delay in the live broadcast, but in the embodiment of the present application, the live server parses the initial live stream and the client parses the live stream to be played in parallel, and the client can determine whether to display the subtitles in advance according to the display time of the subtitles to be displayed in the analysis result, thereby reducing the delay between the live subtitles and the live video screen or audio during the user's real-time viewing of the live broadcast.
  • Step 204 Perform speech recognition on the audio stream to generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream.
  • the live broadcast server decodes the initial live broadcast stream and obtains the audio stream and the first video stream, it can perform speech recognition on the audio stream to generate corresponding recognition text, and then add the recognition text as subtitle information to the first video stream to generate a second video stream, so that after the client for watching the live broadcast obtains the second video stream, it can display the subtitle information to the user during the playback of the second video stream.
  • the live broadcast server decodes and obtains the audio stream, it often takes a certain amount of time to perform speech recognition on the audio stream.
  • the embodiment of the present application obtains the complete recognition text, in order to avoid the asynchrony between the recognition text and the video image or sound, it is necessary to determine the time consumed to generate the recognition text, that is, the time interval between the generation time of the recognition text and the time when the live broadcast server receives the audio stream, so that the client can determine how long in advance to display the recognition text after obtaining it according to the time interval.
  • the audio stream may be divided according to the spectrum information corresponding to the audio stream to generate at least two audio segments;
  • performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream includes:
  • the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the target audio segment is determined.
  • the embodiment of the present application can first divide the audio stream according to the spectrum information corresponding to the audio stream to generate at least two audio segments. For example, according to the spectrum information, the audio stream between any two adjacent points with a spectrum value of 0 (indicating a pause) is regarded as an audio segment. Then, speech recognition is performed on each audio segment to generate a corresponding recognition text, determine the generation time of the recognition text, and determine the time interval information between the generation time and the reception time of each audio segment (the reception time of the audio stream or the initial live stream).
  • performing speech recognition on the audio stream to generate corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream including:
  • the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
  • a preset recognition window is usually used, and the window length of the preset recognition window can be 0.5s-1s.
  • the window length of the preset recognition window can be 1s-5s.
  • the audio stream is subjected to speech recognition according to a preset recognition window. Specifically, the audio stream is split according to the preset recognition window to generate at least one audio segment, speech recognition is performed on each audio segment to generate a corresponding recognition text, and then the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
  • the live broadcast server includes a transcoding module and a speech recognition service module. Therefore, the received initial live broadcast stream is decoded to generate an audio stream and a first video stream. Specifically, the received initial live broadcast stream is decoded by the transcoding module to generate an audio stream and a first video stream; speech recognition is performed on the audio stream to generate a corresponding recognition text. Specifically, speech recognition is performed on the audio stream by the speech recognition service module to generate a corresponding recognition text.
  • the transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.
  • the data transmission channel may be GRPC.
  • performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text includes:
  • a preset recognition window is usually used.
  • a preset recognition window can also be used. Speech recognition is performed on the audio stream according to the preset recognition window. Specifically, the audio stream is split according to the preset recognition window to generate at least one audio segment, speech recognition is performed on each audio segment, corresponding recognition text is generated, and then the generation time of the recognition text is determined, and the time interval information between the generation time and the receiving time of the audio stream is determined.
  • the window length of the preset recognition window may be 0.5s-1s, and the preset recognition window may be used to perform speech recognition of the audio stream, so that a single word in the audio stream can be recognized; or the window length of the preset recognition window may be 1s-5s, and the preset recognition window may be used to perform speech recognition of the audio stream, so that a complete sentence in the audio stream can be recognized.
  • the specific window length may be determined according to actual needs and is not limited here.
  • Step 206 Use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream.
  • the recognition text can be used as subtitle information, and the subtitle information and the time interval information can be added to the first video stream to generate the second video stream.
  • the subtitle information may be written to the first video stream in the form of SEI to generate the second video stream.
  • the text type of the recognized text may also be determined according to the text length and/or text semantics of the recognized text;
  • the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:
  • the recognized text is used as subtitle information
  • the subtitle information, the time interval information and the text type are used as video frame information of the target video frame, and added to the first video stream.
  • a preset recognition window is usually used.
  • the window length of the preset recognition window can be 0.5s-1s.
  • the window length of the preset recognition window can be 1s-5s.
  • the embodiment of the present application can also determine the text type of the recognized text according to the text length and/or text semantics of the recognized text.
  • the text type includes but is not limited to characters, words, sentences, etc.
  • Text semantics is used to determine whether the recognized text can express complete semantics. If so, the text type of the recognized text can be determined as Sentence type; if not, then when the text length of the recognized text is greater than or equal to two characters, the text type of the recognized text is the word type; if the text length is equal to 1, the text type is the character type.
  • the target video frame in the first video stream can be determined according to the generation time of the recognized text, the recognized text can be used as subtitle information, and the subtitle information, time interval information and text type can be used as video frame information of the target video frame and added to the first video stream.
  • the last video frame in the video segment corresponding to the target audio segment can usually be used as the target video frame, and the subtitle information, time interval information and text type can be used as its video frame information and added to the first video stream to generate a second video stream.
  • the client can determine the subtitle information to be displayed according to the text type, and usually gives priority to sentence-type subtitle information for display to ensure the subtitle viewing effect of the live broadcast.
  • the voice recognition service module splits the audio stream according to a preset recognition window to generate at least one audio segment, and performs voice recognition on the first audio segment to generate a corresponding first recognition text
  • the recognition text is used as subtitle information
  • the subtitle information and the time interval information are added to the first video stream, including:
  • the transcoding module determines a first target video frame in the first video stream according to a generation time of the first recognition text
  • the first recognition text is used as the first subtitle information, and the time interval information between the generation time of the first subtitle information and the first recognition text and the receiving time of the audio stream is added to the first video stream as the video frame information of the first target video frame.
  • the speech recognition service module when the speech recognition service module splits the audio stream into one or at least two audio segments, it can perform speech recognition on each audio segment in turn, and after generating the recognition text corresponding to any audio segment, it can return the recognition text to the transcoding module.
  • the transcoding module determines the target video frame in the first video stream (usually the last video frame of the video segment corresponding to any audio segment) according to the generation time of the recognition text, and uses the recognition text as subtitle information, and the time interval information between the subtitle information, the generation time of the recognition text and the receiving time of the audio stream as the video frame information of the target video frame, and adds it to the first video stream.
  • performing speech recognition on the audio stream through a speech recognition service module to generate corresponding recognition text includes:
  • the step of using the recognized text as subtitle information and adding the subtitle information and the time interval information to the first video stream includes:
  • the transcoding module determines a second target video frame in the first video stream according to a generation time of the second recognition text
  • the first recognition text and the second recognition text are used as second subtitle information, and the time interval information between the generation time of the second subtitle information and the second recognition text and the receiving time of the audio stream is used as video frame information of the second target video frame and added to the first video stream.
  • the speech recognition service module when the speech recognition service module splits the audio stream into at least two audio segments, it can first perform speech recognition on the first audio segment of the at least two audio segments to generate a corresponding first recognition text, and the transcoding module uses the first recognition text as subtitle information, and uses the subtitle information, the time interval information between the generation time of the first recognition text and the receiving time of the audio stream as the video frame information of the first target video frame (usually the last video frame of the video segment corresponding to the first audio segment), and adds it to the first video stream.
  • speech recognition can be performed on the second audio segment adjacent to the first audio segment in the at least two audio segments to generate a corresponding second recognition text.
  • the transcoding module uses the first recognition text and the second recognition text as subtitle information, and uses the subtitle information, the time interval information between the generation time of the second recognition text and the receiving time of the audio stream as the video frame information of the second target video frame (usually the last video frame of the video segment corresponding to the second audio segment) to add it to the first video stream, and so on.
  • the speech recognition service module After the speech recognition service module obtains the first recognition text through speech recognition, it can temporarily store it. After obtaining the second recognition text through recognition, since the first audio segment is adjacent to the second audio segment, the first recognition text and the second recognition text can be returned together as subtitle information of the video stream, thereby allowing the speech recognition service module to reuse the cache to improve the accuracy of the recognition results of the subtitle information.
  • Step 208 encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client.
  • the live broadcast server can encode the second video stream and the audio stream to generate a live broadcast stream to be pushed, and if the user has a need to watch live broadcast, the live broadcast stream to be pushed can be pushed to the user's client.
  • the client decodes the live stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;
  • the video stream and the audio stream are played synchronously, and based on the display time, the subtitle information is displayed in the at least two video frames and the target video frame.
  • the client when playing a live stream, may pull the live stream to be pushed for a certain length of time after the current playback time from the live server in advance and cache it, and then decode the cached live stream to be pushed in advance to obtain the subtitle information corresponding to the target video frame in the live stream to be pushed, the text type of the subtitle information, and the time interval information between the generation time of the subtitle information and the reception time of the audio stream by the live stream server.
  • the display time of the subtitle information may be determined according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server, combined with the playback time of the target video frame, and then determine other video frames in the live stream to be pushed that are located before the target video frame and are used to display the subtitle information according to the display time, and when it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed in the determined video frame and the target video frame based on the display time.
  • the client pre-caches the live stream to be pushed from t to t+5s during the process of playing the live stream, and then decodes to obtain the subtitle information carried in this live stream to be pushed.
  • the decoding result contains the recognition text corresponding to the video frames at the five time points of t+1s, t+2s, t+3s, t+4s, and t+5s, and the recognition text corresponding to the time point t+5s is of sentence type, then the recognition text can be displayed first. In this case, the time interval information corresponding to the recognition text can be determined.
  • the time interval between is 4s, and the time interval between the generation time of the recognition text and the video frame at the time point t+5s is 1s, which means that the subtitle information (recognition text) needs to be displayed 3s in advance. It can also mean that the host expressed a complete sentence from t+3s to t+5s. Therefore, when the live stream to be pushed at t+3s is started to be played, the subtitle information can be displayed at the same time, and the display can be ended at t+5s to realize the early display of the complete subtitles and avoid delays between the subtitles and the video screen or sound. Until it is detected that other subtitle information needs to be displayed, the subtitle information is stopped.
  • An embodiment of the present application implements a live broadcast data processing method and system, wherein the live broadcast data processing method includes decoding a received initial live broadcast stream, generating an audio stream and a first video stream, performing speech recognition on the audio stream, generating corresponding recognition text, and determining the time interval information between the generation time of the recognition text and the reception time of the audio stream, using the recognition text as subtitle information, and adding the subtitle information and the time interval information to the first video stream, generating a second video stream, encoding the second video stream and the audio stream, generating a live broadcast stream to be pushed, and returning the live broadcast stream to be pushed to the client.
  • the live broadcast server performs speech recognition on the audio stream, generates corresponding recognition text, and records the time interval between the generation time of the recognition text and the time interval between the time of receiving the audio stream. Since the time interval can be used to characterize the time consumed by the live broadcast server for performing speech recognition on the audio stream in the initial live broadcast stream after receiving the initial live broadcast stream, after adding the recognition text and the time interval information to the video stream and returning it to the client, the client can pre-parse and obtain the subtitle information carried in the live broadcast stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live broadcast server receives the audio stream, that is, determine the display time of the complete subtitle corresponding to the live broadcast stream to be pushed, so as to display the complete subtitle in advance based on the display time, which is beneficial to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is beneficial to avoiding the asynchrony between subtitles and video images or audio
  • FIG. 3 shows a flow chart of another live broadcast data processing method provided according to an embodiment of the present application, comprising the following steps:
  • Step 302 Receive and cache the live streaming stream to be pushed returned by the live streaming server.
  • Step 304 decode the live stream to be pushed, generate corresponding audio stream, video stream, subtitle information and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live server according to the generation time of the subtitle information and the reception time of the audio stream.
  • Step 306 Determine the display time of the subtitle information according to the time interval information.
  • Step 308 When it is determined that the playback condition of the live stream to be pushed is met, the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • the anchor broadcasts live through the intelligent terminal and pushes the generated initial live stream to the live server, which decodes the received initial live stream to generate an audio stream and a first video stream; then performs speech recognition on the audio stream to generate corresponding recognition text, and determines the time interval information between the generation time of the recognition text and the reception time of the audio stream; then uses the recognition text as subtitle information, and adds the subtitle information and the time interval information to the first video stream to generate a second video stream; then encodes the second video stream and the audio stream to generate a live stream to be pushed.
  • the live server can push the live stream to be pushed to the user's client.
  • the client When the client is playing the live stream for the user, it can pull the live stream to be pushed of a certain length from the live server in advance and cache it, so that the client can decode the cached live stream to be pushed in advance and obtain the subtitle information contained in the live stream to be pushed. Then, according to the time interval information between the generation time of the subtitle information carried in the live stream to be pushed and the reception time of the audio stream by the live stream server, the display time of the subtitle information can be determined. When it is determined that the playback conditions of the live stream to be pushed are met, the decoded video stream and audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.
  • the above is a schematic scheme of another live data processing method of this embodiment. It should be noted that the technical scheme of the live data processing method and the technical scheme of the above-mentioned live data processing method belong to the same concept, and the details of the technical scheme of the live data processing method that are not described in detail can all be referred to the description of the technical scheme of the above-mentioned live data processing method.
  • the live broadcast data processing method provided by an embodiment of the present application is applied in the live broadcast field as an example to further illustrate the live broadcast data processing method.
  • Figure 4 shows an interactive schematic diagram of a live broadcast data processing method provided by an embodiment of the present application applied in the live broadcast field, specifically comprising the following steps:
  • Step 402 The transcoding module receives the anchor's initial live stream.
  • Step 404 The transcoding module decodes the initial live stream to generate an audio stream and a first video stream.
  • Step 406 The transcoding module transmits the audio stream to the speech recognition service module via GRPC.
  • Step 408 The speech recognition service module performs speech recognition on the audio stream and generates corresponding recognition text.
  • Step 410 the speech recognition service module determines the generation time of the recognized text, and determines the time interval information between the generation time and the reception time of the audio stream, and determines the text type of the recognized text according to the text length and/or text semantics of the recognized text.
  • Step 412 The speech recognition service module transmits the recognized text, text type, and time interval information to the transcoding module via GRPC.
  • the transcoding module uses the recognized text as subtitle information, and adds the subtitle information, time interval information, and text type to the first video stream to generate a second video stream.
  • Step 416 The transcoding module encodes the second video stream and the audio stream to generate a live stream to be pushed.
  • Step 418 The client pulls the live stream to be pushed from the live server.
  • the live broadcast server includes a transcoding module and a speech recognition service module.
  • the client decodes the live stream to be pushed, generates a corresponding audio stream, a second video stream, subtitle information and time interval information, determines the display time of the subtitle information according to the time interval information, and when it is determined that the playback conditions of the live stream to be pushed are met, synchronously plays the second video stream and the audio stream, and displays the subtitle information based on the display time.
  • the above-mentioned processing method is conducive to enabling the client to pre-parse and obtain the subtitle information carried in the live stream to be pushed, and determine the display time of the subtitle information according to the time interval information between the generation time of the subtitle information and the time when the live server receives the audio stream, that is, determine the display time of the complete subtitles corresponding to the live stream to be pushed, so as to display the complete subtitles in advance based on the display time, which is conducive to reducing the cost of generating subtitles and improving the efficiency of subtitle generation, and is conducive to avoiding the asynchrony between subtitles and video images or audio, thereby meeting the user's needs for watching live subtitles during live viewing and improving the user's live viewing experience.
  • FIG5 shows a schematic diagram of the structure of a live data processing device provided by an embodiment of the present application.
  • the device includes:
  • the decoding module 502 is configured to decode the received initial live stream to generate an audio stream and a first video stream;
  • the recognition module 504 is configured to perform speech recognition on the audio stream, generate corresponding recognition text, and determine the time interval information between the generation time of the recognition text and the reception time of the audio stream;
  • An adding module 506 is configured to use the recognized text as subtitle information, and add the subtitle information and the time interval information to the first video stream to generate a second video stream;
  • the encoding module 508 is configured to encode the second video stream and the audio stream to generate a live stream to be pushed, and return the live stream to be pushed to the client.
  • the decoding module 502 is further configured to:
  • an initial live stream corresponding to the live stream identifier within a preset time interval is obtained, and the initial live stream is decoded, wherein the preset time interval is later than the generation time.
  • the client decodes the live stream to be played, generates a corresponding audio stream to be played, a video stream to be played, subtitles to be displayed, and a display time corresponding to the subtitles to be displayed;
  • the video stream to be played and the audio stream to be played are played synchronously, and the subtitles to be displayed are displayed based on the display time.
  • the live broadcast data processing device further includes a determination module configured to:
  • the adding module 506 is further configured to:
  • the recognized text is used as subtitle information
  • the subtitle information, the time interval information and the text type are used as video frame information of the target video frame, and added to the first video stream.
  • the client decodes the live stream to be pushed, generates a corresponding audio stream, a video stream, and video frame information of a target video frame in the video stream, wherein the video frame information includes the subtitle information, the time interval information, and the text type;
  • the video stream and the audio stream are played synchronously, and based on the display time, the subtitle information is displayed in the at least two video frames and the target video frame.
  • the live broadcast data processing device further includes a division module configured to:
  • the identification module 504 is further configured to:
  • the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the target audio segment is determined.
  • the identification module 504 is further configured to:
  • the generation time of the recognition text is determined, and the time interval information between the generation time and the reception time of the audio stream is determined.
  • the decoding module 502 is further configured to:
  • the received initial live stream is decoded by a transcoding module to generate an audio stream and a first video stream;
  • the identification module 504 is further configured to:
  • the audio stream is subjected to speech recognition by a speech recognition service module to generate corresponding recognition text.
  • the identification module 504 is further configured to:
  • the adding module 506 is further configured to:
  • the transcoding module determines a first target video frame in the first video stream according to a generation time of the first recognition text
  • the first recognition text is used as the first subtitle information, and the time interval information between the generation time of the first subtitle information and the first recognition text and the receiving time of the audio stream is added to the first video stream as the video frame information of the first target video frame.
  • the identification module 504 is further configured to:
  • the adding module 506 is further configured to:
  • the transcoding module determines a second target video frame in the first video stream according to a generation time of the second recognition text
  • the first recognition text and the second recognition text are used as second subtitle information, and the time interval information between the generation time of the second subtitle information and the second recognition text and the receiving time of the audio stream is used as video frame information of the second target video frame and added to the first video stream.
  • the live broadcast data processing device further includes a transmission module configured to:
  • the transcoding module transmits the audio stream to the speech recognition service module through a data transmission channel.
  • the above is a schematic scheme of a live data processing device of this embodiment. It should be noted that the technical scheme of the live data processing device and the technical scheme of the above-mentioned live data processing method belong to the same concept, and the details not described in detail in the technical scheme of the live data processing device can be referred to the description of the technical scheme of the above-mentioned live data processing method.
  • FIG6 shows a schematic diagram of the structure of another live data processing device provided by an embodiment of the present application.
  • the device includes:
  • the receiving module 602 is configured to receive and cache the live streaming stream to be pushed returned by the live streaming server;
  • the decoding module 604 is configured to decode the live broadcast stream to be pushed, generate a corresponding audio stream, a video stream, subtitle information, and time interval information corresponding to the subtitle information, wherein the time interval information is determined by the live broadcast server according to the generation time of the subtitle information and the reception time of the audio stream;
  • a determination module 606 is configured to determine a display time of the subtitle information according to the time interval information
  • the display module 608 is configured to synchronously play the video stream and the audio stream when it is determined that the playback condition of the live stream to be pushed is met, and to display the subtitle information based on the display time.
  • the above is a schematic scheme of another live data processing device of this embodiment. It should be noted that the technical scheme of the live data processing device and the technical scheme of the above another live data processing method belong to the same concept, and the details of the technical scheme of the live data processing device that are not described in detail can all be referred to the description of the technical scheme of the above another live data processing method.
  • Fig. 7 shows a block diagram of a computing device 700 according to an embodiment of the present application.
  • the components of the computing device 700 include but are not limited to a memory 710 and a processor 720.
  • the processor 720 is connected to the memory 710 via a bus 730, and the database 750 is used to store data.
  • the computing device 700 also includes an access device 740 that enables the computing device 700 to communicate via one or more networks 760.
  • networks 760 include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • PSTN public switched telephone network
  • LAN local area network
  • WAN wide area network
  • PAN personal area network
  • the access device 740 may include one or more of any type of network interface (e.g., a network interface card (NIC)) whether wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, a World Wide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a universal serial bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.
  • a network interface card e.g., a network interface card (NIC)
  • WLAN wireless local area network
  • Wi-MAX World Wide Interoperability for Microwave Access
  • Ethernet e.g., a USB interface, a universal serial bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.
  • USB universal serial bus
  • NFC near field communication
  • the above components of the computing device 700 and other components not shown in FIG. 7 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG. 7 is only for illustrative purposes and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as needed.
  • the computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or PC.
  • the computing device 700 may also be a mobile or stationary server.
  • the processor 720 is used to execute the following computer-executable instructions, and the processor is used to execute the computer-executable instructions, wherein the processor implements the steps of the live data processing method when executing the computer-executable instructions.
  • the above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the technical scheme of the live data processing method described above are of the same concept, and the details of the technical scheme of the computing device that are not described in detail can be found in the description of the technical scheme of the live data processing method described above.
  • An embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the following steps:
  • the video stream and the audio stream are played synchronously, and the subtitle information is displayed based on the display time.
  • the computer instructions include computer program codes, which may be in source code form, object code form, executable files or some intermediate forms, etc.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, USB flash drive, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electric carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Selon des modes de réalisation, la présente demande concerne un procédé et un système de traitement de données de diffusion en direct. Le procédé de traitement de données de diffusion en direct consiste en : le décodage d'un flux en direct initial reçu pour générer un flux audio et un premier flux vidéo ; la réalisation d'une reconnaissance vocale sur le flux audio pour générer un texte de reconnaissance correspondant, et la détermination d'informations d'intervalle de temps entre le temps de génération du texte de reconnaissance et le temps de réception du flux audio ; l'utilisation du texte de reconnaissance en tant qu'informations de sous-titre, et l'ajout des informations de sous-titre et des informations d'intervalle de temps au premier flux vidéo pour générer un second flux vidéo ; et le codage du second flux vidéo et du flux audio pour générer un flux en direct à pousser, et renvoyer ledit flux en direct à un client.
PCT/CN2023/106150 2022-10-25 2023-07-06 Procédé et système de traitement de données de diffusion en direct WO2024087732A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211311544.0A CN115643424A (zh) 2022-10-25 2022-10-25 直播数据处理方法及系统
CN202211311544.0 2022-10-25

Publications (1)

Publication Number Publication Date
WO2024087732A1 true WO2024087732A1 (fr) 2024-05-02

Family

ID=84946678

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/106150 WO2024087732A1 (fr) 2022-10-25 2023-07-06 Procédé et système de traitement de données de diffusion en direct

Country Status (2)

Country Link
CN (1) CN115643424A (fr)
WO (1) WO2024087732A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115643424A (zh) * 2022-10-25 2023-01-24 上海哔哩哔哩科技有限公司 直播数据处理方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108063970A (zh) * 2017-11-22 2018-05-22 北京奇艺世纪科技有限公司 一种处理直播流的方法和装置
CN108401192A (zh) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 视频流处理方法、装置、计算机设备及存储介质
CN111010614A (zh) * 2019-12-26 2020-04-14 北京奇艺世纪科技有限公司 一种显示直播字幕的方法、装置、服务器及介质
CN112272323A (zh) * 2014-06-30 2021-01-26 苹果公司 实时数字助理知识更新
CN115086753A (zh) * 2021-03-16 2022-09-20 北京有竹居网络技术有限公司 直播视频流的处理方法、装置、电子设备和存储介质
CN115643424A (zh) * 2022-10-25 2023-01-24 上海哔哩哔哩科技有限公司 直播数据处理方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112272323A (zh) * 2014-06-30 2021-01-26 苹果公司 实时数字助理知识更新
CN108063970A (zh) * 2017-11-22 2018-05-22 北京奇艺世纪科技有限公司 一种处理直播流的方法和装置
CN108401192A (zh) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 视频流处理方法、装置、计算机设备及存储介质
CN111010614A (zh) * 2019-12-26 2020-04-14 北京奇艺世纪科技有限公司 一种显示直播字幕的方法、装置、服务器及介质
CN115086753A (zh) * 2021-03-16 2022-09-20 北京有竹居网络技术有限公司 直播视频流的处理方法、装置、电子设备和存储介质
CN115643424A (zh) * 2022-10-25 2023-01-24 上海哔哩哔哩科技有限公司 直播数据处理方法及系统

Also Published As

Publication number Publication date
CN115643424A (zh) 2023-01-24

Similar Documents

Publication Publication Date Title
US11463779B2 (en) Video stream processing method and apparatus, computer device, and storage medium
US11252444B2 (en) Video stream processing method, computer device, and storage medium
CN109168078B (zh) 一种视频清晰度切换方法及装置
US10244291B2 (en) Authoring system for IPTV network
US9478256B1 (en) Video editing processor for video cloud server
US10679675B2 (en) Multimedia file joining method and apparatus
KR102469142B1 (ko) 미디어 스트림 재생들 사이를 트랜지션하는 동안 트랜지션 프레임들의 동적 재생
CN112616062B (zh) 一种字幕显示方法、装置、电子设备及存储介质
WO2024087732A1 (fr) Procédé et système de traitement de données de diffusion en direct
US20220188357A1 (en) Video generating method and device
US11758245B2 (en) Interactive media events
CN114040255A (zh) 直播字幕生成方法、系统、设备及存储介质
CN103635938B (zh) 在流传输客户端中用于处理视频流传输数据的方法、流传输客户端
CN113301359A (zh) 音视频处理方法、装置及电子设备
WO2018142946A1 (fr) Dispositif et procédé de traitement d'informations
CN113923502B (zh) 直播视频播放方法及装置
KR100651566B1 (ko) 이동통신 단말기에서 출력 버퍼링을 이용한 멀티미디어재생 장치 및 그 제어 방법
KR102248097B1 (ko) 컨텐츠전송방법 및 이를 이용한 컨텐츠 생성 단말
CN113873296A (zh) 视频流处理方法及装置
WO2013166785A1 (fr) Procédé et dispositif de fourniture de service multimédia, et procédé et dispositif d'affichage de service multimédia
CN113766342B (zh) 字幕合成方法及相关装置、电子设备、存储介质
TWI819580B (zh) 基於預解析操作以改善播放響應的媒體播放方法與相關媒體播放裝置
CN117376593A (zh) 直播流的字幕处理方法、装置、存储介质及计算机设备
KR101384740B1 (ko) 인식 기술을 이용한 영상물의 자막 처리 시스템 및 그 방법
Zhang et al. Research on Audio and Video Synchronization Algorithm Based on AVI Format

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23881320

Country of ref document: EP

Kind code of ref document: A1