WO2019205872A1 - 视频流处理方法、装置、计算机设备及存储介质 - Google Patents

视频流处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2019205872A1
WO2019205872A1 PCT/CN2019/079830 CN2019079830W WO2019205872A1 WO 2019205872 A1 WO2019205872 A1 WO 2019205872A1 CN 2019079830 W CN2019079830 W CN 2019079830W WO 2019205872 A1 WO2019205872 A1 WO 2019205872A1
Authority
WO
WIPO (PCT)
Prior art keywords
stream data
picture frame
text
subtitle
video stream
Prior art date
Application number
PCT/CN2019/079830
Other languages
English (en)
French (fr)
Inventor
胡小华
罗梓恒
朱秀明
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19792095.2A priority Critical patent/EP3787300A4/en
Publication of WO2019205872A1 publication Critical patent/WO2019205872A1/zh
Priority to US16/922,904 priority patent/US11463779B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47202End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for requesting content on demand, e.g. video on demand
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the field of Internet application technologies, and in particular, to a video stream processing method, apparatus, computer device, and storage medium.
  • the information that needs to be conveyed to the end user is not only images and sounds, but also subtitles to enhance the user's viewing experience.
  • the application example provides a video stream processing method, where the method includes:
  • the subtitle data includes subtitle text and time information corresponding to the subtitle text
  • the application example further provides a video stream processing apparatus, where the apparatus includes:
  • a first acquiring module configured to acquire first audio stream data in the live video stream data
  • a voice recognition module configured to perform voice recognition on the first audio stream data to obtain a voice recognition text
  • a subtitle generating module configured to generate subtitle data according to the speech recognition text, where the subtitle data includes subtitle text and time information corresponding to the subtitle text;
  • a subtitle adding module configured to add the subtitle text to a corresponding picture frame in the live video stream data according to the time information corresponding to the subtitle text, and obtain the processed live video stream data.
  • the application example further provides a computer device, the computer device comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction, the at least one instruction A program, the set of codes, or a set of instructions is loaded and executed by the processor to implement the video stream processing method described above.
  • the application example further provides a computer readable storage medium, where the storage medium stores at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction, the at least one program, the code
  • the set or set of instructions is loaded and executed by the processor to implement the video stream processing method described above.
  • FIG. 1 is a schematic diagram of a live broadcast process provided by an example of the present application.
  • FIG. 2 is a schematic structural diagram of a live broadcast system according to an illustrative example
  • FIG. 3 is a flowchart of a video stream processing method according to an illustrative example
  • FIG. 4 is a flow chart showing a video stream processing method according to an illustrative example
  • FIG. 5 is a data structure diagram of a live video stream data related to the example shown in FIG. 4;
  • FIG. 6 is a flow chart of speech recognition involved in the example shown in FIG. 4;
  • FIG. 7 is a schematic structural diagram of a caption data related to the example shown in FIG. 4;
  • FIG. 8 is a schematic diagram of superimposition of subtitles involved in the example shown in FIG. 4;
  • FIG. 8 is a schematic diagram of superimposition of subtitles involved in the example shown in FIG. 4;
  • FIG. 9 is a schematic diagram of a subtitle superimposition process involved in the example shown in FIG. 4;
  • FIG. 10 is a schematic diagram of a live stream selection according to the example shown in FIG. 4;
  • FIG. 11 is a schematic diagram of another live stream selection involved in the example shown in FIG. 4;
  • FIG. 12 is a schematic flowchart of processing of a live video stream according to an illustrative example
  • FIG. 13 is a flowchart of a video stream processing method according to an illustrative example
  • FIG. 14 is a schematic flowchart of processing of a live video stream according to an illustrative example
  • 15 is a block diagram showing the structure of a video stream processing apparatus in a live broadcast scenario, according to an illustrative example
  • Figure 16 is a block diagram showing the structure of a computer device, according to an illustrative example.
  • Subtitles refer to non-image content such as dialogues or narrations displayed in online video, television, movies, and stage works in text form, and also refer to texts processed in the post-production of film and television works.
  • Live broadcast is a set of technologies that use streaming media technology to present vivid and intuitive real images to users through images, sounds, texts, etc., involving coding tools, streaming media data, servers, networks, and players.
  • a series of service modules are a series of service modules.
  • Real-time translation refers to the simultaneous translation of speech or text in one language into speech or text in another language, either manually or by computer.
  • the real-time translation may be artificial intelligence based speech recognition and instant translation.
  • subtitles in live video are typically implemented by manual insertion at a live recording end (such as a recording scene/studio).
  • a live recording end such as a recording scene/studio
  • FIG. 1 shows a schematic diagram of a live broadcast process provided by some examples of the present application.
  • the live recording terminal uploads the live video stream to the server through the live broadcast access service, and the server passes the live broadcast.
  • the code service transcodes the live video stream, and sends the transcoded live video stream to the player on the user terminal side for playing through the content distribution network.
  • FIG. 2 shows a schematic diagram of a live broadcast process provided by some examples of the present application.
  • the above scheme for inserting subtitles in the live video requires manual insertion of subtitle data on the live recording end, and the accuracy of synchronizing the subtitle data with the live video screen is low, and usually results in a higher live broadcast delay, affecting the live broadcast effect. .
  • FIG. 2 is a schematic structural diagram of a live broadcast system according to an illustrative example.
  • the system includes a live recording terminal 220, a server 240, and a plurality of user terminals 260.
  • the live recording terminal 220 can be a mobile phone, a tablet computer, an e-book reader, smart glasses, a smart watch, an MP3 player (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio layer 3), MP4 (Moving Picture Experts Group) Audio Layer IV, motion imaging experts compress standard audio layers 4) players, laptops and desktop computers, and more.
  • MP3 player Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio layer 3
  • MP4 Motion Imaging experts compress standard audio layers 4
  • players laptops and desktop computers, and more.
  • the live recording terminal 220 corresponds to an image acquisition component and an audio collection component.
  • the image capturing component and the audio collecting component may be part of the live recording terminal 220.
  • the image capturing component and the audio collecting component may be a camera built in the live recording terminal 220 and a built-in microphone; or the image capturing component and The audio collection component can also be connected to the user terminal 220 as a peripheral device of the live recording terminal 220.
  • the image acquisition component and the audio collection component can be respectively a camera and a microphone connected to the live recording terminal 220; or the image The collection component and the audio collection component may also be partially built in the live recording terminal 220, and partially used as a peripheral device of the live recording terminal 220.
  • the image capturing component may be a camera built in the live recording terminal 220, and the audio collection component may be connected.
  • the live recording the microphone in the earphone of the terminal 220.
  • the implementation examples of the present application do not limit the implementation form of the image acquisition component and the audio collection component.
  • the user terminal 260 may be a terminal device having a video playing function, for example, the user terminal may be a mobile phone, a tablet computer, an e-book reader, smart glasses, a smart watch, an MP3/MP4 player, a laptop portable computer, a desktop computer, etc. Wait.
  • the live recording terminal 220 and the user terminal 260 are respectively connected to the server 240 via a communication network.
  • the communication network is a wired network or a wireless network.
  • the live recording terminal 220 can upload the live video stream recorded locally to the server 240, and the server 240 performs related processing on the live video stream and then pushes it to the user terminal 260.
  • Server 240 is a server, or a number of servers, or a virtualization platform, or a cloud computing service center.
  • the live broadcast recording terminal 220 may be installed with a live application (Application, APP) client, such as a Tencent video client or a trick-cast client.
  • the server 240 may be a live broadcast server corresponding to the live application.
  • the live recording terminal runs the client of the live application.
  • the client of the live application invokes the image capture component in the live recording terminal.
  • the audio collection component records the live video stream, and uploads the recorded live video stream to the live broadcast server, and the live broadcast server receives the live video stream, and establishes a live channel for the live video stream, and the user corresponding to the user terminal can pass through the user terminal.
  • the live server pushes the live video stream to the user terminal, and the user terminal is in the live application interface or the browser.
  • the live video stream is played in the interface.
  • the system can also include a management device (not shown in FIG. 2) that is coupled to the server 240 via a communication network.
  • the communication network is a wired network or a wireless network.
  • the wireless or wired network described above uses standard communication techniques and/or protocols.
  • the network is usually the Internet, but can also be any network, including but not limited to a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, a wired or a wireless. Any combination of networks, private networks, or virtual private networks).
  • data exchanged over a network is represented using techniques and/or formats including Hyper Text Markup Language (HTML), Extensible Markup Language (XML), and the like.
  • SSL Secure Socket Layer
  • TLS Transport Layer Security
  • VPN Virtual Private Network
  • IPsec Internet Protocol Security
  • Regular encryption techniques are used to encrypt all or some of the links.
  • the above described data communication techniques may also be replaced or supplemented using custom and/or dedicated data communication techniques.
  • FIG. 3 is a flowchart of a video stream processing method in a live broadcast scenario according to an exemplary example.
  • the video stream processing method in the live broadcast scenario may be used in a live broadcast system as shown in FIG. 2 .
  • the video stream processing method in the live broadcast scenario may include the following steps:
  • Step 31 Acquire first audio stream data in the live video stream data.
  • the audio stream data may be streaming data including each audio frame in the live video stream.
  • Step 32 Perform speech recognition on the first audio stream data to obtain speech recognition text.
  • speech recognition refers to recognizing speech in the first audio stream data as text of a corresponding language type.
  • Step 33 Generate subtitle data according to the speech recognition text, where the subtitle data includes subtitle text and time information corresponding to the subtitle text.
  • the time information may be information indicating a play time of the caption data, the audio stream data, or the live video stream data.
  • Step 34 Add the subtitle text to the corresponding picture frame in the live video stream data according to the time information corresponding to the subtitle text, and obtain the processed live video stream data.
  • the steps of acquiring audio stream data, performing voice recognition, and generating caption data according to the voice recognition result inevitably require a certain processing time, and therefore, in the example of the present application, After delaying the preset duration for a period of time, performing the above-mentioned time information corresponding to the subtitle text, adding the subtitle text to the corresponding picture frame in the live video stream data, and obtaining the processed live video stream data (ie, the above) Step 34); wherein the first moment is a time when the live video stream data is acquired.
  • a fixed delay duration (that is, the preset duration, for example, 5 minutes) may be preset, and the live video stream data is started to be timed, and the live video stream data is cached on the other hand.
  • the preset duration may be preset in the code by the developer, or the preset duration may be set or changed by the system administrator or the user. It should be noted that the preset duration may be greater than the duration required to perform step 31 to step 33 above.
  • step 34 may also be performed directly at the time of obtaining the caption data.
  • the live video stream data is cached on the one hand, and the above steps 31 to 33 are performed on the other hand, and the storage is successful.
  • the live video stream data corresponding to the subtitle data may be extracted from the cache, and step 34 is performed according to the generated subtitle data and the live video stream data extracted from the cache.
  • the server may provide a caption generation service, a caption storage service, and a caption hybrid service, wherein the caption generation service is configured to generate caption data according to the voice recognition text, and the caption storage service is configured to receive the caption data generated by the caption generation service and store the caption
  • the hybrid service is used to add the caption text in the caption data stored by the caption storage service into the picture frame in the live video stream data.
  • the subtitle mixing service receives the notification of the subtitle data successfully transmitted by the subtitle storage service, or when the subtitle hybrid service queries the subtitle data stored in the subtitle storage service in the database
  • the subtitle hybrid service It can be determined that the caption storage service successfully stores the caption data, and at this time, the captioning service can start performing the above step 34.
  • the audio stream data in the live video stream data can be acquired, and the audio stream data is voice-recognized, and the caption data is generated according to the recognition result, and then the caption data is used according to the time information.
  • the subtitle text is added into the picture frame of the corresponding subtitle text in the video stream, thereby obtaining a live video stream containing the subtitle, realizing accurate synchronization of the subtitle and the video picture, and at the same time, the live broadcast delay can be effectively reduced because manual insertion of subtitle data is not required. .
  • the foregoing method for processing a video stream may be performed by a server in a live broadcast system, that is, after receiving the live video stream uploaded by the live recording terminal, the server obtains the live video stream data, and The obtained live video stream data is processed as shown in FIG. 3 above.
  • the foregoing method for processing a video stream may also be performed by a live recording terminal in a live broadcast system, that is, the live broadcast recording terminal obtains a live video stream before uploading the live video stream data to the server. Data, and the obtained live video stream data is processed as shown in FIG. 3 above.
  • the foregoing method for processing a video stream may also be performed by a user terminal in a live broadcast system, that is, after the user terminal receives the live video stream data pushed by the server, the live video stream is played. Before the data, the live video stream data is subjected to the processing shown in FIG. 3 described above.
  • a subsequent example of the present application will be described by using a server in a live broadcast system as an example of the method for processing a video stream.
  • FIG. 4 is a flowchart of a video stream processing method in a live broadcast scenario according to an exemplary example, where a video stream processing method in the live broadcast scenario may be used in a server, for example, the method may be used in the above FIG. 1
  • the server 240 is shown.
  • the video stream processing method in the live broadcast scenario may include the following steps:
  • Step 401 Acquire first audio stream data in the live video stream data.
  • the live recording terminal records the live video on the live broadcast site, and encodes the recorded video into a live video stream (also referred to as the original video stream) and then pushes it to the server; the server receives the live broadcast terminal push. After the live video stream is streamed, the received live video stream is first transcoded to obtain the live video stream data.
  • a live video stream also referred to as the original video stream
  • the live video stream data is composed of picture frame stream data and audio stream data, wherein the picture frame stream data is composed of a series of picture frame data blocks, and each picture frame data block includes several picture frames, corresponding
  • the audio stream data consists of a series of audio frame data blocks, each audio frame data block containing a number of audio frames.
  • the picture frame data block in the live video stream data corresponds to the audio frame data block in time, that is, the play time of one picture frame data block is exactly the same as the play time of one audio frame data block.
  • the picture frame data block and the audio frame data block respectively include respective time information, and the correspondence between the picture frame data block and the audio frame data block is indicated by the respective time information, that is, for the one-to-one corresponding picture.
  • the frame data block and the audio frame data block both contain the same time information.
  • FIG. 5 shows a data structure diagram of a live video stream data related to an example of the present application.
  • a picture frame data block in the live video stream data includes a data block header and a payload, wherein the payload includes each picture frame in the picture frame data block, and the data block header It includes information such as header size (header_size), payload size (payload_size), duration (duration), index (index), Universal Time Coordinated (UTC), and timestamp.
  • the data block header size is used to indicate the amount of data occupied by the data block header in the current picture frame data block
  • the payload size is used to indicate the amount of data occupied by the payload in the current picture frame data block
  • the duration is used to indicate the current picture frame.
  • the playing time of each picture frame in the data block, the index is used to indicate each picture frame in the current picture frame data block, and the coordinate time is used to indicate the system time of the current picture frame data block being transcoded (for example, it may be a picture frame)
  • the system time in which the first picture frame in the data block is transcoded), and the timestamp is used to indicate the temporal position of the current picture frame data block in the live video stream.
  • one audio frame data block in the live video stream data also includes a data block header and a payload.
  • the data block header includes each audio frame in the audio frame data block, and the data block header includes a data block header.
  • Information such as size, payload size, duration, index, Coordinated Universal Time, and timestamp.
  • the time information of each of the picture frame data block and the audio frame data block can be represented by a coordinated world time and/or a time stamp in the respective data block header, that is, at time
  • the upper synchronized group of picture frame data blocks and audio frame data blocks, the coordinated world time and time stamp in the data block header are also the same.
  • the server transcodes the live video stream data
  • the first audio stream data in the live video stream data can be obtained, and the live video stream data is cached locally.
  • Step 402 Perform speech recognition on the first audio stream data to obtain speech recognition text.
  • the server may extract audio frames corresponding to each segment of speech from the first audio stream data, and segment each segment of speech. Corresponding audio frames are separately subjected to speech recognition.
  • the server may perform voice start and stop detection on the first audio stream data, and obtain a voice start frame and a voice end frame in the first audio stream data;
  • the voice start frame is an audio frame starting from a voice, and the voice ends.
  • the frame is a voice end audio segment;
  • the server extracts at least one piece of voice data from the first audio stream data according to the voice start frame and the voice end frame in the first audio stream data, where the voice data includes a corresponding group An audio frame between the voice start frame and the voice end frame; the server then performs voice recognition on the at least one piece of voice data to obtain the identification subtext corresponding to the at least one piece of voice data respectively; and finally, the server at least one piece of voice data
  • the respective identified sub-texts are determined as the speech recognition text.
  • the server can detect the start and stop of the speech by genetic detection.
  • the gene detection may be referred to as feature detection, and the server may determine whether the audio frame in the audio data is for the audio tail according to the characteristics of the audio data.
  • FIG. 6, illustrates a speech recognition flowchart related to an example of the present application. As shown in FIG.
  • the server after identifying a voice start frame in the audio data (ie, the first audio stream data), the server starts to perform genetic detection on each audio frame after the voice start frame to determine the currently detected Whether the audio frame corresponds to the audio end point (corresponding to the above-mentioned voice ending frame), that is, step 601 is performed, and each detected audio frame is input into the voice recognition model for voice recognition, that is, step 602 is performed, when the audio tail point is detected,
  • the server stops speech recognition, that is, step 603 is executed, and the recognized text is output (step 604).
  • the subsequent subtitle output step 606 flow is entered.
  • Step 403 Generate subtitle data according to the speech recognition text, where the subtitle data includes subtitle text and time information corresponding to the subtitle text.
  • the server may translate the voice recognition text obtained in the above step into the translated text corresponding to the target language, and generate the caption text according to the translated text; the caption text includes the translated text, or the caption text The speech recognition text and the translated text are included; then, the server regenerates the subtitle data including the subtitle text.
  • the server may separately generate corresponding subtitle data for each language, for example, assuming that the language corresponding to the speech recognition text obtained by the speech recognition is Chinese, and the target language includes English, Russian, Korean, and Japanese.
  • the caption text includes speech recognition text and translated text.
  • the server can generate four subtitle data, that is, subtitle data corresponding to “Chinese+English”, subtitle data corresponding to “Chinese+Russian”, and “Chinese+Korean”. Corresponding caption data and caption data corresponding to "Chinese + Japanese”.
  • the subtitle data further includes time information corresponding to the subtitle text.
  • the subtitle data may include a plurality of subtitle data, and each subtitle data corresponds to a complete speech.
  • FIG. 7 is a schematic structural diagram of a caption data according to an example of the present application.
  • each subtitle data includes information such as a sequence number (seq), a coordinated world time, a duration, a time stamp, and a text of a caption.
  • the duration in the subtitle data may be the duration of the speech
  • the coordinated world time in the subtitle data may be the start time of the corresponding complete speech (ie, the first audio frame corresponding to the complete speech is
  • the time stamp in the subtitle data can be the timestamp of the first audio frame of the corresponding complete speech.
  • the coordinated world time and/or time stamp in the subtitle data is the time information of the subtitle text included in the subtitle data.
  • the above-mentioned piece of speech may be a speech segment containing one or more sentences.
  • the time information corresponding to the subtitle text may be time information of the voice corresponding to the subtitle text.
  • the server records the time information of each piece of voice data when performing voice recognition on the first audio stream data.
  • the time information of the piece of voice data may include a time point corresponding to the voice start frame of the piece of voice data (such as utc/time stamp), and a duration of the piece of voice data.
  • the server uses the translated text of the piece of the recognized word text as the subtitle text in the corresponding subtitle data, and uses the time information of the piece of voice data as the subtitle. Time information of the caption text in the data.
  • Step 404 Decompose the live video stream data into second audio stream data and first picture frame stream data.
  • the server may first decompose the live video stream data into the second audio stream data and the first picture frame stream data, where The decomposition step is also referred to as audio and video demultiplexing.
  • Step 405 Determine a target picture frame in the first picture frame stream data, where the target picture frame is a picture frame corresponding to time information of the subtitle text.
  • the server may acquire the coordinated world time and the duration of the subtitle data, and determine the target end time point according to the coordinated world time and the duration (the target end time point is After the coordinated world time, and the duration between the coordinated world time is the time duration of the duration, and the first picture frame stream data is in the coordinated world time and the target end in the subtitle data Each picture frame between time points is determined as the above-mentioned target picture frame.
  • the server may obtain a timestamp and a duration of the subtitle data, and determine a target end time point according to the timestamp and the duration (the target end time point is corresponding to the timestamp) After the time point, and the time length between the time points corresponding to the time stamp is the time point of the duration of the duration, and the time point corresponding to the time stamp in the subtitle data in the first picture frame stream data Each picture frame between the target end time point and the target end time point is determined as the above-described target picture frame.
  • Step 406 Generate a caption image containing the caption text.
  • the server may generate a subtitle image corresponding to the subtitle text in the subtitle data corresponding to each subtitle data.
  • the subtitle image may be a transparent or semi-transparent image containing subtitle text.
  • Step 407 Superimpose the subtitle image on an upper layer of the target picture frame to obtain superimposed picture frame stream data.
  • the server may superimpose the subtitle image containing the subtitle text in the subtitle data in each target picture frame corresponding to the subtitle data, and obtain the superimposed picture frame corresponding to the subtitle data.
  • Stream data For a certain subtitle data, the server may superimpose the subtitle image containing the subtitle text in the subtitle data in each target picture frame corresponding to the subtitle data, and obtain the superimposed picture frame corresponding to the subtitle data.
  • FIG. 8 illustrates a schematic diagram of superimposition of subtitles involved in an example of the present application.
  • the picture frame 81 is one picture frame in the target picture frame corresponding to the subtitle picture 82 in the picture frame stream data, and the server superimposes the picture frame 81 and the subtitle picture 82 to obtain the superimposed picture frame.
  • the image frame 81 in the picture frame stream data is replaced with the superimposed picture frame 83.
  • Step 408 Combine the second audio stream data and the superimposed picture frame stream data into the processed live video stream data.
  • the server may align the second audio stream data with the superimposed picture frame stream data according to the time information; and combine the aligned second audio stream data and the superimposed picture frame stream data into the processed Live video stream data.
  • the second audio stream data and the first picture frame stream data obtained by the decomposition in the above step 404 are respectively composed of an audio frame data block and a picture frame data block, and the audio frame data block and the picture frame data before and after the decomposition.
  • the time information in the block does not change.
  • the step of superimposing the subtitle image on the corresponding picture frame ie, the above step 407
  • the time information corresponding to the picture frame data block remains unchanged.
  • the audio frame data block included in the second audio stream data is also in a one-to-one correspondence with the picture frame data block in the superimposed picture frame stream data, and the server may use the second audio stream data and
  • the superimposed picture frame stream data is aligned with data blocks corresponding to the same time information (such as time stamp and/or Coordinated Universal Time).
  • FIG. 9 shows a schematic diagram of a subtitle superimposition process according to an example of the present application.
  • the server performs audio and video demultiplexing on the input video stream (corresponding to the live video stream data), obtains audio and video, and decodes the video portion to obtain each picture frame;
  • the server And acquiring subtitle information (corresponding to the subtitle data), and generating a subtitle picture (corresponding to the subtitle image); the server superimposing the generated subtitle picture into the corresponding picture frame obtained by decoding (ie, the video superimposing step in FIG. 7 , step 407 And video encoding the superimposed picture frame to obtain a video, and finally multiplexing the obtained video with the above audio to obtain a video stream containing the subtitle.
  • the server after receiving the request sent by the user terminal, the server pushes the processed live video stream data to the user terminal, and the user terminal plays the content.
  • the server may receive a video stream acquisition request sent by the user terminal, and obtain language indication information carried in the video stream acquisition request, where the language indication information is used to indicate a subtitle language; and the subtitle language indicated by the language indication information is the subtitle text.
  • the processed live video stream data is pushed to the user terminal.
  • the user watching the live broadcast can request to obtain a live video stream containing subtitles in the specified language on the user terminal side.
  • the user may select a subtitle in a certain language in the subtitle selection interface on the user terminal side, and then the user terminal sends a video stream acquisition request to the server, where the video stream acquisition request includes language indication information indicating a subtitle language selected by the user.
  • the server may obtain the language indication information.
  • the subtitle language indicated by the language indication information in the video stream acquisition request sent by the user terminal is the superimposed subtitle in the live video stream data obtained in the above step 408.
  • the server may push the processed live video stream data to the user terminal, and the user terminal plays the file.
  • the server may generate a live video stream corresponding to a superimposed subtitle for each subtitle text of a language or a combination of languages.
  • the server may superimpose the The live video stream of the subtitles of the language or language combination is sent to the user terminal.
  • the user may select which subtitle corresponding to the live video stream when entering the live broadcast interface.
  • FIG. 10 illustrates a live stream selection diagram involved in an example of the present application.
  • the user terminal displays a live video stream selection interface 101, which includes a plurality of live portals 101a, and each live portal 101a corresponds to a language/language combination of subtitles, and the user clicks.
  • the live broadcast portals 101a shown in FIG. 10
  • the user terminal displays the live broadcast interface 102 and simultaneously sends a video stream acquisition request to the server, the video stream acquisition request instructing the user to select
  • the Chinese and English language subtitles are subtitled
  • the server pushes the live video stream corresponding to the Chinese and English subtitles to the user terminal, and the user terminal displays the live video interface 102.
  • the subtitle 102a in the live broadcast interface 102 is Chinese+English. subtitle.
  • the user may also switch the live video stream of different subtitles during the process of watching the live broadcast.
  • FIG. 11 illustrates another live stream selection diagram involved in the example of the present application.
  • the subtitle 112a in the live broadcast screen displayed on the live broadcast interface 112 of the user terminal is a Chinese and English subtitle.
  • the user may click and other manners.
  • the subtitle selection menu 114 is called out, and the subtitle of another language/language combination is selected (as shown in FIG.
  • the user selects the subtitle of the Chinese+Japanese combination), after which the user terminal sends a video stream acquisition request to the server, and the video stream acquisition request is obtained.
  • the server pushes the live video stream corresponding to the Chinese and Japanese subtitles to the user terminal, and the user terminal displays on the live broadcast interface, as shown in FIG. 11, after the user selects the Japanese subtitles
  • the subtitles in the live broadcast interface 112 are switched to the Chinese and Japanese subtitles 112b.
  • the server may obtain audio stream data in the live video stream data, perform voice recognition on the audio stream data, generate subtitle data according to the recognition result, and then perform subtitle data according to the time information.
  • the subtitle text is added into the picture frame of the corresponding subtitle text in the video stream, thereby obtaining a live video stream containing the subtitle, realizing accurate synchronization of the subtitle and the video picture, and at the same time, the live broadcast delay can be effectively reduced because manual insertion of subtitle data is not required.
  • subtitles have been added to the picture frame of the live video stream that is sent to the user terminal, and the user terminal can display the live video with the subtitles without further processing on the live video stream.
  • FIG. 12 is a schematic flowchart of processing of a live video stream according to an exemplary example.
  • the live recording terminal collects the live video and encodes it
  • the live broadcast stream is uploaded to the server through the live broadcast service.
  • the server transcodes the live broadcast stream and synchronizes the output time information.
  • Video stream including picture frame data block and audio frame data block
  • pure audio stream only audio frame data block.
  • the server realizes the delayed output of the video stream through the live broadcast delay service (for example, the delay scheduled time), on the other hand, the server will transcode the acquired audio data through the live translation service (ie, pure audio).
  • the stream is sent to a speech recognition module for recognition and translation, wherein the speech recognition module is used to implement speech recognition and translation, and the translated result (ie, subtitle data) is written to the subtitle storage service (the live translation service and
  • the speech recognition module is equivalent to the subtitle generation service described above, and the subtitle storage service is responsible for storing the subtitle data.
  • the server pulls the video data (ie, the video stream) from the live broadcast delay service through the subtitle mixing service, and extracts the subtitle data corresponding to the time information from the subtitle storage service, according to the video.
  • the time information (such as timestamp) in the stream, audio stream and subtitle data is synchronously mixed into a live stream containing subtitles.
  • Figure 12 above provides a solution for real-time recognition, translation and subtitle synchronization overlay based on live stream.
  • the live background ie server
  • Superimposed as a video picture containing subtitles and mixes the video picture containing the subtitles with the audio synchronized with the content to realize real-time addition of live stream subtitles.
  • the solution has a wide range of usage scenarios, and does not require manual participation, and the subtitles of the solution are superimposed in the original video picture in real time, and the playback terminal does not need to perform additional processing, and the subtitle information can be displayed by direct playback.
  • FIG. 13 is a flowchart of a video stream processing method in a live broadcast scenario, where the video stream processing method in the live broadcast scenario can be used in a server, for example, the method can be used in the above FIG. 1 according to an exemplary example.
  • the server 240 is shown.
  • the video stream processing method in the live broadcast scenario may include the following steps:
  • Step 1301 Acquire first audio stream data in the live video stream data, and obtain second frame frame stream data in the live video stream data.
  • the server after receiving the live video stream pushed by the live recording terminal, the server transcodes the received live video stream to obtain the live video stream data.
  • the server may decompose (ie, demultiplex) the live video stream data into audio stream data (ie, the first audio stream data) and picture frame stream data after transcoding to obtain live video stream data (ie, The second picture frame stream data described above).
  • Step 1302 Perform speech recognition on the first audio stream data to obtain a speech recognition text.
  • Step 1303 Generate subtitle data according to the speech recognition text, where the subtitle data includes subtitle text and time information corresponding to the subtitle text.
  • step 1302 and the step 1303 For the execution process of the foregoing step 1302 and the step 1303, reference may be made to the descriptions of the steps 402 and 403 in the corresponding example of FIG. 4, and details are not described herein again.
  • Step 1304 determining a target picture frame in the second picture frame stream data, where the target picture frame is a picture frame corresponding to time information of the subtitle text.
  • Step 1305 generating a caption image containing the caption text.
  • Step 1306 superimposing the subtitle image on the upper layer of the target picture frame to obtain superimposed picture frame stream data.
  • Step 1307 combining the first audio stream data and the superimposed picture frame stream data into the processed live video stream data.
  • the server may obtain audio stream data in the live video stream data, perform voice recognition on the audio stream data, generate subtitle data according to the recognition result, and then perform subtitle data according to the time information.
  • the subtitle text is added into the picture frame of the corresponding subtitle text in the video stream, thereby obtaining a live video stream containing the subtitle, realizing accurate synchronization of the subtitle and the video picture, and at the same time, the live broadcast delay can be effectively reduced because manual insertion of subtitle data is not required.
  • subtitles have been added to the picture frame of the live video stream that is sent to the user terminal, and the user terminal can display the live video with the subtitles without further processing on the live video stream.
  • FIG. 14 is a schematic flowchart of processing of a live video stream according to an exemplary example.
  • the live recording terminal collects the live video and encodes it
  • the live broadcast stream is uploaded to the server through the live broadcast service.
  • the server transcodes the live broadcast stream and synchronizes the output time information. Pure picture stream (containing only picture frame data blocks) and pure audio stream (only audio frame data blocks).
  • the server realizes the delayed output of the pure picture stream through the live broadcast delay service (for example, the delay scheduled time), on the other hand, the server divides the pure audio stream into two paths, one through the live broadcast delay.
  • the service realizes the delayed output of the pure audio stream, and the other input the live translation service, sends the pure audio stream to the speech recognition module through the live translation service for recognition and translation, and writes the translated result (ie, the subtitle data) to the subtitle storage.
  • the service is responsible for the storage of subtitle data by the subtitle storage service.
  • the server pulls the video data (ie, the pure picture stream and the pure audio stream) from the live broadcast delay service through the subtitle mixing service, and extracts the corresponding time information from the subtitle storage service.
  • the subtitle data is synchronously mixed into a live stream containing subtitles according to time information (such as a time stamp) in the pure picture stream, the pure audio stream, and the subtitle data.
  • FIG. 15 is a block diagram showing the structure of a video stream processing apparatus in a live broadcast scenario, according to an illustrative example.
  • the video stream processing apparatus in the live broadcast scenario can be used in the system shown in FIG. 1 to perform all or part of the steps of the method provided by the example shown in FIG. 3, FIG. 4 or FIG.
  • the video stream processing apparatus in the live broadcast scenario may include:
  • the first obtaining module 1501 is configured to acquire first audio stream data in the live video stream data.
  • the voice recognition module 1502 is configured to perform voice recognition on the first audio stream data to obtain voice recognition text.
  • a subtitle generating module 1503 configured to generate subtitle data according to the speech recognition text, where the subtitle data includes subtitle text and time information corresponding to the subtitle text;
  • the subtitle adding module 1504 is configured to add the subtitle text to a corresponding picture frame in the live video stream data according to the time information corresponding to the subtitle text, and obtain the processed live video stream data.
  • the captioning module 1504 includes:
  • a decomposition unit configured to decompose the live video stream data into second audio stream data and first picture frame stream data
  • a first picture frame determining unit configured to determine a target picture frame in the first picture frame stream data, where the target picture frame is a picture frame corresponding to the time information
  • a first image generating unit configured to generate a caption image including the caption text
  • a first superimposing unit configured to superimpose the subtitle image on an upper layer of the target picture frame to obtain superimposed picture frame stream data
  • a first combining unit configured to combine the second audio stream data and the superimposed picture frame stream data into the processed live video stream data.
  • the first combination unit is specifically used
  • the aligned second audio stream data and the superimposed picture frame stream data are combined into the processed live video stream data.
  • the apparatus further includes:
  • a second acquiring module configured to add the caption text to a corresponding picture frame in the live video stream data according to the time information corresponding to the caption text, to obtain the processed live video stream data And acquiring the second picture frame stream data in the live video stream data;
  • the caption adding module 1504 includes:
  • a second picture frame determining unit configured to determine a target picture frame in the second picture frame stream data, where the target picture frame is a picture frame corresponding to the time information
  • a second image generating unit configured to generate a caption image including the caption text
  • a second superimposing unit configured to superimpose the subtitle image on an upper layer of the target picture frame to obtain superimposed picture frame stream data
  • a second combining unit configured to combine the first audio stream data and the superimposed picture frame stream data into the processed live video stream data.
  • the caption adding module 1504 is configured to: after the preset time length is delayed from the first time, execute the time information corresponding to the caption text, and add the caption text to the live stream.
  • the caption adding module 1504 is specifically configured to: when the caption data is obtained, perform the caption text according to the time information corresponding to the caption text, and add the caption text to the live video stream data.
  • the speech recognition module 1502 is specifically configured to:
  • the voice start frame is an audio frame starting from a voice
  • the voice ends A frame is an audio frame with a voice ending
  • the caption generating module 1503 is specifically configured to:
  • the subtitle text includes the translated text, or the subtitle text includes the speech recognition text and the translated text;
  • the apparatus further includes:
  • a request receiving module configured to receive a video stream acquisition request sent by the user terminal
  • an instruction obtaining module configured to acquire language indication information carried in the video stream obtaining request, where the language indication information is used to indicate a subtitle language
  • a pushing module configured to: when the subtitle language indicated by the language indication information is a language corresponding to the subtitle text, push the processed live video stream data to the user terminal.
  • the video stream processing apparatus may acquire audio stream data in the live video stream data, perform voice recognition on the audio stream data, and generate subtitle data according to the recognition result, and then according to the time information.
  • the subtitle text in the subtitle data is added into the picture frame of the corresponding subtitle text in the video stream, thereby obtaining a live video stream containing the subtitle, realizing accurate synchronization of the subtitle and the video picture, and at the same time, effectively reducing the subtitle data without manually inserting Live broadcast delay.
  • the computer device 1600 can be a live recording terminal 220, a server 240, or a user terminal 260 in the live broadcast system.
  • the computer device 1600 includes a central processing unit (CPU) 1601, a system memory 1604 including a random access memory (RAM) 1602 and a read only memory (ROM) 1603, and a system bus 1605 that connects the system memory 1604 and the central processing unit 1601. .
  • the computer device 1600 also includes a basic input/output system (I/O system) 1606 that facilitates transfer of information between various devices within the computer, and a large capacity for storing the operating system 1613, applications 1614, and other program modules 1615.
  • the basic input/output system 1606 includes a display 1608 for displaying information and an input device 1609 such as a mouse or keyboard for user input of information.
  • the display 1608 and input device 1609 are both connected to the central processing unit 1601 by an input and output controller 1610 that is coupled to the system bus 1605.
  • the basic input/output system 1606 can also include an input output controller 1610 for receiving and processing input from a plurality of other devices, such as a keyboard, mouse, or electronic stylus.
  • input/output controller 1610 also provides output to a display screen, printer, or other type of output device.
  • the mass storage device 1607 is connected to the central processing unit 1601 by a mass storage controller (not shown) connected to the system bus 1605.
  • the mass storage device 1607 and its associated computer readable medium provide non-volatile storage for the computer device 1600. That is, the mass storage device 1607 can include a computer readable medium (not shown) such as a hard disk or a CD-ROM drive.
  • the computer readable medium can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory or other solid state storage technologies, CD-ROM, DVD or other optical storage, tape cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices.
  • RAM random access memory
  • ROM read only memory
  • EPROM Erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • Computer device 1600 can be connected to the Internet or other network device via network interface unit 1611 coupled to said system bus 1605.
  • the memory further includes one or more programs, the one or more programs being stored in a memory, and the central processor 1601 implementing any one of the methods of FIG. 3, FIG. 4 or FIG. 13 by executing the one or more programs. All or part of the steps in the method.
  • non-transitory computer readable storage medium comprising instructions, such as a memory comprising a computer program (instructions) executable by a processor of a computer device to perform the various applications of the present application.
  • the video stream processing method in the live scene shown in the example can be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本申请公开了一种视频流处理方法及装置。该方法包括:获取直播视频流数据中的第一音频流数据;对第一音频流数据进行语音识别,获得语音识别文本;根据语音识别文本生成包含字幕文本的字幕数据;根据时间信息将该字幕文本添加入直播视频流数据中对应的画面帧,获得处理后的直播视频流数据。

Description

视频流处理方法、装置、计算机设备及存储介质
本申请要求于2018年04月25日提交中国专利局、申请号为201810380157.X、发明名称为“视频流处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网应用技术领域,特别涉及一种视频流处理方法、装置、计算机设备及存储介质。
背景
随着移动互联网的不断发展,视频直播类的应用也越来越广泛。视频直播作为一种新的内容传播方式,已经越来越受到欢迎。它不仅具有实时的特点,而且覆盖面更广(可以覆盖到网络电视、PC和移动终端),成本更低,操作起来更容易。
当在网络上做视频直播时,有时候需要传达给终端用户的信息不仅是图像和声音,还需要有字幕来提高用户的观看体验。
技术内容
本申请实例提供了一种视频流处理方法,所述方法包括:
获取直播视频流数据中的第一音频流数据;
对所述第一音频流数据进行语音识别,获得语音识别文本;
根据所述语音识别文本生成字幕数据,所述字幕数据中包含字幕文本以及所述字幕文本对应的时间信息;
根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据。
本申请实例还提供了一种视频流处理装置,所述装置包括:
第一获取模块,用于获取直播视频流数据中的第一音频流数据;
语音识别模块,用于对所述第一音频流数据进行语音识别,获得语 音识别文本;
字幕生成模块,用于根据所述语音识别文本生成字幕数据,所述字幕数据中包含字幕文本以及所述字幕文本对应的时间信息;
字幕添加模块,用于根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据。
本申请实例还提供了一种计算机设备,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述的视频流处理方法。
本申请实例还提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述的视频流处理方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图简要说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实例,并与说明书一起用于解释本申请的原理。
图1是本申请实例提供的一种直播流程示意图;
图2是是根据一示例性实例示出的一种直播系统的结构示意图;
图3是根据一示例性实例示出的一种视频流处理方法的流程图;
图4是根据一示例性实例示出的一种视频流处理方法的流程图;
图5是图4所示实例涉及的一种直播视频流数据的数据结构图;
图6是图4所示实例涉及的一种语音识别流程图;
图7是图4所示实例涉及的一种字幕数据的结构示意图;
图8是图4所示实例涉及的一种字幕叠加示意图;
图9是图4所示实例涉及的一种字幕叠加流程的示意图;
图10是图4所示实例涉及的一种直播流选择示意图;
图11是图4所示实例涉及的另一种直播流选择示意图;
图12是根据一示例性实例示出的一种直播视频流的处理流程示意图;
图13是根据一示例性实例示出的一种视频流处理方法的流程图;
图14是根据一示例性实例示出的一种直播视频流的处理流程示意图;
图15是根据一示例性实例示出的直播场景中的视频流处理装置的结构方框图;
图16是根据一示例性实例示出的一种计算机设备的结构框图。
实施方式
这里将详细地对示例性实例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
在对本申请所示的各个实例进行说明之前,首先对本申请涉及到的几个概念进行介绍:
1)字幕
字幕是指以文字形式显示在网络视频、电视、电影、舞台作品中的对话或者旁白等非影像内容,也泛指影视作品后期加工的文字。
2)直播
直播是一种通过流媒体技术,将图像、声音、文字等丰富的元素经互联网向用户展示生动、直观的真实画面的一整套技术,其涉及编码工具、流媒体数据、服务器、网络以及播放器等一系列服务模块。
3)实时翻译
实时翻译是指通过人工或者计算机将一种语言的语音或者文本即时翻译为另一种语言的语音或者文本。在本申请实例中,实时翻译可以是基于人工智能的语音识别和即时翻译。
在一些实例中,直播视频中的字幕通常在直播录制端(比如录制现场/演播室)通过人工插入来实现。比如,请参考图1,其示出了本申请一些实例提供的一种直播流程示意图。如图1所示,在直播录制端采集视频图像并进行编码的过程中,通过现场工作人员人工插入字幕数据,直播录制端通过直播接入服务,将直播视频流上传给服务器,服务器通过直播转码服务对直播视频流进行转码,并将转码后的直播视频流通过内容分发网络发送至用户终端侧的播放器进行播放。其中,所述直播录制端、服务器和用户终端的关系可以参见下图2。
然而,上述在直播视频中插入字幕的方案,需要在直播录制端通过人工插入字幕数据,字幕数据与直播视频画面同步的准确性较低,且通常会导致较高的直播延时,影响直播效果。
图2是根据一示例性实例示出的一种直播系统的结构示意图。该系统包括:直播录制终端220、服务器240以及若干个用户终端260。
直播录制终端220可以是手机、平板电脑、电子书阅读器、智能眼镜、智能手表、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
直播录制终端220对应有图像采集组件和音频采集组件。其中,该图像采集组件和音频采集组件可以是直播录制终端220的一部分,比如,该图像采集组件和音频采集组件可以是直播录制终端220内置的摄像头和内置的麦克风;或者,该图像采集组件和音频采集组件也可以作为直播录制终端220的外设设备与该用户终端220相连接,比如,该图像采集组件和音频采集组件可以分别是连接该直播录制终端220的摄像机和话筒;或者,该图像采集组件和音频采集组件也可以部分内置于直播录制终端220,部分作为直播录制终端220的外设设备,比如,该图像采集组件可以是直播录制终端220内置的摄像头,该音频采集组件可以是连接该直播录制终端220的耳机中的麦克风。本申请实例对于图像采集 组件和音频采集组件的实现形式不做限定。
用户终端260可以是具有视频播放功能的终端设备,比如,用户终端可以是手机、平板电脑、电子书阅读器、智能眼镜、智能手表、MP3/MP4播放器、膝上型便携计算机和台式计算机等等。
直播录制终端220和用户终端260分别与服务器240之间通过通信网络相连。在一些实例中,通信网络是有线网络或无线网络。
在本申请实例中,直播录制终端220可以将在本地录制的直播视频流上传至服务器240,并由服务器240对直播视频流进行相关处理后推送给用户终端260。
服务器240是一台服务器,或者由若干台服务器,或者是一个虚拟化平台,或者是一个云计算服务中心。
其中,上述直播录制终端220中可以安装有直播应用程序(Application,APP)客户端,比如腾讯视频客户端或者花样直播客户端等,服务器240可以是上述直播应用程序对应的直播服务器。
在直播时,直播录制终端运行直播应用程序的客户端,用户(也可以称为主播)在直播应用程序界面中触发启动直播功能后,直播应用程序的客户端调用直播录制终端中的图像采集组件和音频采集组件来录制直播视频流,并将录制的直播视频流上传至直播服务器,直播服务器接收该直播视频流,并为该直播视频流建立直播频道,用户终端对应的用户可以通过用户终端中安装的直播应用程序客户端或者浏览器客户端访问直播服务器,并在访问页面中选择该直播频道后,直播服务器将该直播视频流推送给用户终端,由用户终端在直播应用程序界面或者浏览器界面中播放该直播视频流。
在一些实例中,该系统还可以包括管理设备(图2未示出),该管理设备与服务器240之间通过通信网络相连。在一些实例中,通信网络是有线网络或无线网络。
在一些实例中,上述的无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也可以是任何网络,包括但不限于局域 网(Local Area Network,LAN)、城域网(Metropolitan Area Network,MAN)、广域网(Wide Area Network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合)。在一些实例中,使用包括超文本标记语言(Hyper Text Mark-up Language,HTML)、可扩展标记语言(Extensible Markup Language,XML)等的技术和/或格式来代表通过网络交换的数据。此外还可以使用诸如安全套接字层(Secure Socket Layer,SSL)、传输层安全(Transport Layer Security,TLS)、虚拟专用网络(Virtual Private Network,VPN)、网际协议安全(Internet Protocol Security,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实例中,还可以使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
图3是根据一示例性实例示出的一种直播场景中的视频流处理方法的流程图,该直播场景中的视频流处理方法可以用于如图2所示的直播系统中。如图3所示,该直播场景中的视频流处理方法可以包括如下步骤:
步骤31,获取直播视频流数据中的第一音频流数据。
其中,音频流数据可以是包含直播视频流中的各个音频帧的流式数据。
步骤32,对该第一音频流数据进行语音识别,获得语音识别文本。
在本申请实例中,语音识别是指将第一音频流数据中的语音识别为对应语言类型的文本。
步骤33,根据该语音识别文本生成字幕数据,该字幕数据中包含字幕文本以及该字幕文本对应的时间信息。
在本申请实例中,上述时间信息可以是用于指示字幕数据、音频流数据或者直播视频流数据的播放时间的信息。
步骤34,根据该字幕文本对应的时间信息,将该字幕文本添加入该直播视频流数据中对应的画面帧,获得处理后的直播视频流数据。
由于上述步骤31至步骤33所示的,获取音频流数据、进行语音识别以及根据语音识别结果生成字幕数据的步骤不可避免的需要消耗一 定的处理时间,因此,在本申请实例中,可以从第一时刻起延时预设时长后,执行上述根据字幕文本对应的时间信息,将该字幕文本添加入该直播视频流数据中对应的画面帧,获得处理后的直播视频流数据的步骤(即上述步骤34);其中,该第一时刻是获取到该直播视频流数据的时刻。
在本申请实例中,可以预先设置一个固定的延时时长(即上述预设时长,比如5分钟),在获取到直播视频流数据开始计时,一方面缓存该直播视频流数据,另一方面开始执行上述步骤31至步骤33,并缓存步骤33生成的字幕数据,当计时到达上述延时时长时,提取缓存的直播视频流数据和字幕数据,并根据提取到的直播视频流数据和字幕数据执行步骤34。
其中,上述预设时长可以由开发人员预先设置在代码中,或者,上述预设时长也可以由系统管理人员或者用户自行设置或更改。需要说明的是,该预设时长可以大于执行上述步骤31至步骤33所需的时长。
在另一种可能的实现方式中,也可以直接在获得上述字幕数据的时刻,执行上述步骤34。
在本申请实例中,对于一份直播视频流数据来说,在获取到该直播视频流数据后,一方面缓存该直播视频流数据,另一方面开始执行上述步骤31至步骤33,在成功存储字幕数据后,即可以从缓存中提取该字幕数据对应的直播视频流数据,并根据生成的字幕数据以及从缓存中提取到的直播视频流数据执行步骤34。
比如,服务器可以提供字幕生成服务、字幕存储服务和字幕混合服务,其中,字幕生成服务用于根据语音识别文本生成字幕数据,字幕存储服务用于接收字幕生成服务生成的字幕数据并进行存储,字幕混合服务用于将字幕存储服务存储的字幕数据中的字幕文本添加入直播视频流数据中的画面帧中。在本申请实例中,当字幕混合服务接收到字幕存储服务发送的,成功存储字幕数据的通知时,或者,当字幕混合服务查询到数据库中已经存在字幕存储服务存储的字幕数据时,字幕混合服务 可以确定字幕存储服务成功存储了上述字幕数据,此时,字幕混合服务可以开始执行上述步骤34。
通过上述图3所示的方案,在直播场景中,可以获取直播视频流数据中的音频流数据,并对音频流数据进行语音识别并根据识别结果生成字幕数据,再根据时间信息将字幕数据中的字幕文本添加入视频流中对应字幕文本的画面帧中,从而获得包含字幕的直播视频流,实现字幕与视频画面的准确同步,同时,由于不需要人工插入字幕数据,能够有效降低直播延时。
上述图3所示的方案可以由直播系统中的不同设备实现。比如,在一种可能的实现方式中,上述对视频流进行处理的方法可以由直播系统中的服务器执行,即服务器接收到直播录制终端上传的直播视频流之后,获取直播视频流数据,并对获取到的直播视频流数据进行上述图3所示的处理。
或者,在另一种可能的实现方式中,上述对视频流进行处理的方法也可以由直播系统中的直播录制终端执行,即直播录制终端在将直播视频流数据上传服务器之前,获取直播视频流数据,并对获取到的直播视频流数据进行上述图3所示的处理。
或者,在又一种可能的实现方式中,上述对视频流进行处理的方法也可以由直播系统中的用户终端执行,即用户终端接收到服务器推送的直播视频流数据后,在播放直播视频流数据之前,对直播视频流数据进行上述图3所示的处理。
本申请后续的实例,将以上述对视频流进行处理的方法由直播系统中的服务器执行为例进行说明。
图4是根据一示例性实例示出的一种直播场景中的视频流处理方法的流程图,该直播场景中的视频流处理方法可以用于服务器中,比如,该方法可以用于上述图1所示的服务器240。如图4所示,该直播场景中的视频流处理方法可以包括如下步骤:
步骤401,获取直播视频流数据中的第一音频流数据。
以执行主体是服务器为例,直播录制终端在直播现场录制直播视频,并将录制的视频编码为直播视频流(也可以称为原始视频流)后推送给服务器;服务器接收到直播录制终端推送的直播视频流后,首先对接收到的直播视频流进行转码,获得上述直播视频流数据。
在本申请实例中,直播视频流数据由画面帧流数据和音频流数据构成,其中,画面帧流数据由一系列的画面帧数据块组成,每个画面帧数据块包含若干画面帧,相应的,音频流数据由一系列的音频帧数据块组成,每个音频帧数据块包含若干音频帧。
其中,上述直播视频流数据中的画面帧数据块与音频帧数据块在时间上一一对应,也就是说,一个画面帧数据块的播放时间,与一个音频帧数据块的播放时间是完全相同的。比如,上述画面帧数据块和音频帧数据块中分别包含各自的时间信息,且画面帧数据块和音频帧数据块之间的对应关系通过各自的时间信息进行指示,即对于一一对应的画面帧数据块和音频帧数据块,两者包含的时间信息也是相同的。
比如,请参考图5,其示出了本申请实例涉及的一种直播视频流数据的数据结构图。
如图5所示,直播视频流数据中的一个画面帧数据块包含数据块头(header)和有效载荷(payload)两部分,其中,有效载荷包括画面帧数据块中的各个画面帧,数据块头中包含数据块头大小(header_size)、有效载荷大小(payload_size)、时长(duration)、索引(index)、协调世界时(Universal Time Coordinated,UTC)以及时间戳(timestamp)等信息。其中数据块头大小用于指示当前画面帧数据块中的数据块头所占用的数据量,有效载荷大小用于指示当前画面帧数据块中的有效载荷所占用的数据量,时长用于指示当前画面帧数据块中的各个画面帧的播放时长,索引用于指示当前画面帧数据块中的各个画面帧,协调世界时用于指示当前画面帧数据块被转码的系统时间(比如,可以是画面帧数据块中第一个画面帧被转码的系统时间),时间戳用于指示当前画面帧数据块在直播视频流中的时间位置。
相应的,在图5中,直播视频流数据中的一个音频帧数据块也包含数据块头和有效载荷两部分,其中,数据块头包括音频帧数据块中的各个音频帧,数据块头中包含数据块头大小、有效载荷大小、时长、索引、协调世界时以及时间戳等信息。
在图5所示的直播视频流数据中,画面帧数据块和音频帧数据块各自的时间信息可以通过各自的数据块头中的协调世界时和/或时间戳来表示,也就是说,在时间上同步的一组画面帧数据块和音频帧数据块,两者的数据块头中的协调世界时和时间戳也是相同的。
在本申请实例中,服务器转码获得直播视频流数据后,可以获取直播视频流数据中的第一音频流数据,同时,将直播视频流数据缓存在本地。
步骤402,对该第一音频流数据进行语音识别,获得语音识别文本。
由于一段音频流数据中可能包含多句语音,为了提高语音识别的准确性,在本申请实例中,服务器可以从第一音频流数据中提取出各段语音对应的音频帧,并对各段语音对应的音频帧分别进行语音识别。
比如,服务器可以对该第一音频流数据进行语音起止检测,获得该第一音频流数据中的语音起始帧和语音结束帧;该语音起始帧是一段语音开始的音频帧,该语音结束帧是一段语音结束的音频帧;服务器根据该第一音频流数据中的语音起始帧和语音结束帧,从该第一音频流数据中提取至少一段语音数据,该语音数据包括对应的一组语音起始帧和语音结束帧之间的音频帧;之后,服务器对该至少一段语音数据分别进行语音识别,获得该至少一段语音数据分别对应的识别子文本;最后,服务器将该至少一段语音数据分别对应的识别子文本确定为该语音识别文本。
服务器可以通过基因检测来实现语音起止检测。本实例中,基因检测又可以称为特性检测,服务器可以根据音频数据的特性来判断音频数据中的音频帧是否对于音频尾点。比如,请参考图6,其示出了本申请实例涉及的一种语音识别流程图。如图6所示,服务器在音频数据(即 上述第一音频流数据)中识别出一个语音起始帧之后,开始对该语音起始帧之后的各个音频帧进行基因检测,以确定当前检测的音频帧是否对应音频尾点(相当于上述语音结束帧),即执行步骤601,同时将检测后的各个音频帧输入语音识别模型进行语音识别,即执行步骤602,当检测到音频尾点时,服务器停止语音识别,即执行步骤603,并输出识别出的文本(步骤604),经过拆句处理(步骤605)后,进入后续的字幕输出(步骤606)流程。
步骤403,根据该语音识别文本生成字幕数据,该字幕数据中包含字幕文本以及该字幕文本对应的时间信息。
在本申请实例中,服务器可以将上述步骤获得的该语音识别文本翻译为目标语言对应的翻译文本,并根据该翻译文本生成该字幕文本;该字幕文本中包含该翻译文本,或者,该字幕文本中包含该语音识别文本和该翻译文本;然后,服务器再生成包含该字幕文本的该字幕数据。
在本申请实例中,服务器可以针对每种语言分别生成对应的字幕数据,比如,假设上述语音识别获得的语音识别文本对应的语言是中文,而目标语言包括英文、俄文、韩文和日文四种,以字幕文本中包含语音识别文本和翻译文本为例,服务器可以生成四种字幕数据,即“中文+英文”对应的字幕数据、“中文+俄文”对应的字幕数据、“中文+韩文”对应的字幕数据以及“中文+日文”对应的字幕数据。
在本申请实例中,字幕数据中还包含字幕文本对应的时间信息。比如,字幕数据中可以包含若干个字幕子数据,每个字幕子数据对应一段完整语音。请参考图7,其示出了本申请实例涉及的一种字幕数据的结构示意图。如图7所示,每个字幕子数据包括序列号(seq)、协调世界时、时长、时间戳以及字幕文本(text)等信息。其中,字幕子数据中的时长可以是一段语音的持续时长,字幕子数据中的协调世界时可以是对应的一段完整语音的起始时间点(即该段完整语音对应的第一个音频帧被转码时的协调世界时),字幕子数据中的时间戳可以是对应的一段完整语音的第一个音频帧的时间戳。其中,字幕子数据中的协调世界时 和/或时间戳即为该字幕子数据中包含的字幕文本的时间信息。其中,上述的一段语音可以是包含一个或者多个句子的语音片段。
其中,字幕文本对应的时间信息可以是字幕文本对应的语音的时间信息。比如,在上述步骤402中,服务器在对第一音频流数据进行语音识别时,记录每一段语音数据的时间信息。其中,一段语音数据的时间信息可以包括该段语音数据的语音起始帧对应的时间点(比如utc/时间戳),以及该段语音数据的持续时长。服务器在生成一段语音数据的识别字文本对应的字幕子数据时,将该段识别字文本的翻译文本作为对应的字幕子数据中的字幕文本,并将该段语音数据的时间信息作为该字幕子数据中的字幕文本的时间信息。
步骤404,将该直播视频流数据分解为第二音频流数据和第一画面帧流数据。
在本申请实例中,在将字幕数据中的字幕文本添加入直播视频流数据中的画面帧时,服务器可以首先将直播视频流数据分解为第二音频流数据和第一画面帧流数据,该分解步骤也称为音视频解复用。
步骤405,确定该第一画面帧流数据中的目标画面帧,该目标画面帧是与该字幕文本的时间信息对应的画面帧。
在本申请实例中,对于上述每一个字幕子数据,服务器可以获取该字幕子数据中的协调世界时和持续时长,根据该协调世界时和持续时长确定目标结束时间点(该目标结束时间点是该协调世界时之后,且与该协调世界时之间的时长为上述持续时长的时间点),并将上述第一画面帧流数据中,处于该字幕子数据中的协调世界时和该目标结束时间点之间的各个画面帧确定为上述目标画面帧。
或者,对于上述每一个字幕子数据,服务器可以获取该字幕子数据中的时间戳和持续时长,根据该时间戳和持续时长确定目标结束时间点(该目标结束时间点是处于该时间戳对应的时间点之后,且与该时间戳对应的时间点之间的时长为上述持续时长的时间点),并将上述第一画面帧流数据中,处于该字幕子数据中的时间戳对应的时间点和该目标结 束时间点之间的各个画面帧确定为上述目标画面帧。
步骤406,生成包含该字幕文本的字幕图像。
服务器可以对应每一个字幕子数据,分别生成该字幕子数据中的字幕文本对应的字幕图像。其中,该字幕图像可以是一个包含字幕文本的透明或者半透明图像。
步骤407,将该字幕图像叠加在该目标画面帧的上层,获得叠加后的画面帧流数据。
对于某一个字幕子数据,服务器可以将包含该字幕子数据中的字幕文本的字幕图像,叠加在该字幕子数据对应的每一个目标画面帧中,获得该字幕子数据对应的叠加后的画面帧流数据。
请参考图8,其示出了本申请实例涉及的一种字幕叠加示意图。如图8所示,画面帧81是画面帧流数据中,与字幕图像82相对应的目标画面帧中的一个画面帧,服务器将画面帧81与字幕图像82进行叠加,获得叠加后的画面帧83,并将画面帧流数据中的图像帧81替换为叠加后的画面帧83。
步骤408,将该第二音频流数据和该叠加后的画面帧流数据组合为处理后的直播视频流数据。
服务器可以将该第二音频流数据与该叠加后的画面帧流数据按照时间信息进行数据对齐;并将对齐后的该第二音频流数据与该叠加后的画面帧流数据组合为该处理后的直播视频流数据。
在本申请实例中,上述步骤404中分解获得的第二音频流数据和第一画面帧流数据分别由音频帧数据块和画面帧数据块组成,且分解前后的音频帧数据块和画面帧数据块中的时间信息不变。而在上述将字幕图像叠加至对应的画面帧的步骤(即上述步骤407)中,画面帧数据块对应的时间信息也保持不变。也就是说,上述第二音频流数据中包含的音频帧数据块,与叠加后的画面帧流数据中的画面帧数据块之间也是一一对应的关系,服务器可以将第二音频流数据和叠加后的画面帧流数据,对应相同时间信息(比如时间戳和/或协调世界时)的数据块进行对齐。
请参考图9,其示出了本申请实例涉及的一种字幕叠加流程的示意图。在图9中,一方面,服务器将输入的视频流(对应上述直播视频流数据)进行音视频解复用,获得音频和视频,并对视频部分进行解码得到各个画面帧;另一方面,服务器还获取字幕信息(对应上述字幕数据),并生成字幕图片(对应上述字幕图像);服务器将生成的字幕图片叠加到解码得到的对应的画面帧中(即图7中的视频叠加步骤,步骤407),并对叠加后画面帧进行视频编码获得视频,最后将编码获得的视频与上述音频进行复用,获得包含字幕的视频流。
在本申请实例中,服务器在接收到用户终端发送的请求后,将上述处理后的直播视频流数据推送给用户终端,由用户终端进行播放。
比如,服务器可以接收用户终端发送的视频流获取请求;获取该视频流获取请求中携带的语言指示信息,该语言指示信息用于指示字幕语言;当该语言指示信息指示的字幕语言是该字幕文本对应的语言时,向该用户终端推送该处理后的直播视频流数据。
观看直播的用户可以在用户终端侧请求获取包含指定语言的字幕的直播视频流。比如,用户可以在用户终端侧的字幕选择界面中选择某种语言的字幕,之后,用户终端向服务器发送视频流获取请求,该视频流获取请求中包含指示用户选择的字幕语言的语言指示信息,服务器接收到用户终端发送的视频流获取请求后,即可以获取到该语言指示信息。
对于上述步骤408中获得处理后的直播视频流数据,当用户终端发送的视频流获取请求中的语言指示信息所指示的字幕语言是上述步骤408中获得处理后的直播视频流数据中叠加的字幕文本对应的语言时,服务器即可以将上述处理后的直播视频流数据推送给用户终端,由用户终端进行播放。
在本申请实例中,服务器可以针对每一种语言或者语言组合的字幕文本生成对应的一条叠加字幕的直播视频流,当用户终端侧选择一种语言或者语言组合时,服务器即可以将叠加后该语言或者语言组合的字幕 的直播视频流发送给用户终端。
在一种可能的实现方式中,用户可以在进入直播界面时选择哪一种字幕对应的直播视频流。比如,请参考图10,其示出了本申请实例涉及的一种直播流选择示意图。如图10所示,用户点开某个直播频道时,用户终端展示直播视频流选择界面101,其中包含若干个直播入口101a,每个直播入口101a对应一种语言/语言组合的字幕,用户点击其中一个直播入口101a(图10示出为中文+英文的语言组合的字幕对应的直播入口)后,用户终端展示直播界面102,同时向服务器发送视频流获取请求,该视频流获取请求指示用户选择了中文+英文的语言组合的字幕,服务器将中英文字幕对应的直播视频流推送给用户终端,由用户终端在直播界面102中进行展示,此时,直播界面102中的字幕102a为中文+英文字幕。
在另一种可能的实现方式中,用户也可以在观看直播的过程中,切换不同字幕的直播视频流。比如,请参考图11,其示出了本申请实例涉及的另一种直播流选择示意图。如图11所示,在第一时刻,用户终端的直播界面112中展示的直播画面中的字幕112a为中英文字幕,当用户想要切换直播画面中的字幕的语言时,可以通过点击等方式呼出字幕选择菜单114,并选择另一语言/语言组合的字幕(如图11所示,用户选择中文+日文组合的字幕),之后,用户终端向服务器发送视频流获取请求,该视频流获取请求指示用户选择了中文+日文的语言组合的字幕,服务器将中日文字幕对应的直播视频流推送给用户终端,由用户终端在直播界面进行展示,如图11所示,在用户选择中日文字幕之后的第二时刻,直播界面112中的字幕切换为中日文字幕112b。
综上所述,本申请实例所示的方案,服务器可以获取直播视频流数据中的音频流数据,并对音频流数据进行语音识别并根据识别结果生成字幕数据,再根据时间信息将字幕数据中的字幕文本添加入视频流中对应字幕文本的画面帧中,从而获得包含字幕的直播视频流,实现字幕与视频画面的准确同步,同时,由于不需要人工插入字幕数据,能够有效 降低直播延时。
此外,本申请实例所示的方案,推送给用户终端的直播视频流的画面帧中已经添加了字幕,用户终端不需要对直播视频流做进一步处理即可以向用户展示带字幕的直播画面。
基于上述图4所示的方案,请参考图12,其是根据一示例性实例示出的一种直播视频流的处理流程示意图。如图12所示,直播录制终端通过摄像机采集直播画面并进行编码后,通过直播接入服务将直播流上传给服务器,服务器通过直播转码服务将接入的直播流转码,并输出时间信息同步的视频流(包含画面帧数据块和音频帧数据块)与纯音频流(只包含音频帧数据块)。在转码之后,一方面,服务器通过直播延时服务实现视频流的延时输出(比如,延时预定时长),另一方面,服务器通过直播翻译服务将转码获取的音频数据(即纯音频流)发送到语音识别模块进行识别和翻译,其中,该语音识别模块用于实现语音的识别与翻译,并将翻译的结果(即字幕数据)写入到字幕存储服务(这里的直播翻译服务和语音识别模块相当于上述字幕生成服务),由字幕存储服务负责字幕数据的存储。在上述延时的预定时长到达时,服务器通过字幕混合服务,从直播延时服务拉取视频数据(即上述视频流),并从字幕存储服务拉取到时间信息相对应的字幕数据,根据视频流、音频流与字幕数据中的时间信息(比如时间戳),同步混合为包含字幕的直播流。
上述图12提供了一种基于直播流的实时识别、翻译以及字幕同步叠加的解决方案,直播后台(即服务器)实时从直播流中获取音频流,采用人工智能算法,实时识别音频流中的音频信号,并翻译为各种目标语言字幕;然后根据在视频流、音频流以及字幕数据中插入的时间信息,实现视频画面、声音、字幕内容完全同步对齐;最后将内容同步的字幕与视频画面实时叠加为包含字幕的视频画面,并将包含字幕的视频画面与内容同步的音频混合在一起,实现直播流字幕实时添加功能。本方案具有广泛的使用场景,不需要人工的参与,且本方案的字幕实时叠加在 原始视频画面中,播放终端不需要做额外的处理,直接播放就能展现字幕信息。
图13是根据一示例性实例示出的一种直播场景中的视频流处理方法的流程图,该直播场景中的视频流处理方法可以用于服务器中,比如,该方法可以用于上述图1所示的服务器240。如图13所示,该直播场景中的视频流处理方法可以包括如下步骤:
步骤1301,获取直播视频流数据中的第一音频流数据,并获取直播视频流数据中的第二画面帧流数据。
以执行主体是服务器为例,服务器接收到直播录制终端推送的直播视频流后,对接收到的直播视频流进行转码获得上述直播视频流数据。在本申请实例中,服务器可以在转码获得直播视频流数据之后,将直播视频流数据分解(即解复用)为音频流数据(即上述第一音频流数据)和画面帧流数据(即上述第二画面帧流数据)。
其中,音频流数据和画面帧流数据的构成形式可以参考图4对应实例中的描述,此处不再赘述。
步骤1302,对该第一音频流数据进行语音识别,获得语音识别文本。
步骤1303,根据该语音识别文本生成字幕数据,该字幕数据中包含字幕文本以及该字幕文本对应的时间信息。
其中,上述步骤1302和步骤1303的执行过程可以参考图4对应实例中的步骤402和步骤403下的描述,此处不再赘述。
步骤1304,确定第二画面帧流数据中的目标画面帧,该目标画面帧是与字幕文本的时间信息相对应的画面帧。
步骤1305,生成包含该字幕文本的字幕图像。
步骤1306,将该字幕图像叠加在该目标画面帧的上层,获得叠加后的画面帧流数据。
步骤1307,将该第一音频流数据和该叠加后的画面帧流数据组合为处理后的直播视频流数据。
上述步骤1304至步骤1307所示的方案,与图4对应实例中的步骤 405至步骤408下的描述类似,此处不再赘述。
综上所述,本申请实例所示的方案,服务器可以获取直播视频流数据中的音频流数据,并对音频流数据进行语音识别并根据识别结果生成字幕数据,再根据时间信息将字幕数据中的字幕文本添加入视频流中对应字幕文本的画面帧中,从而获得包含字幕的直播视频流,实现字幕与视频画面的准确同步,同时,由于不需要人工插入字幕数据,能够有效降低直播延时。
此外,本申请实例所示的方案,推送给用户终端的直播视频流的画面帧中已经添加了字幕,用户终端不需要对直播视频流做进一步处理即可以向用户展示带字幕的直播画面。
基于上述图13所示的方案,请参考图14,其是根据一示例性实例示出的一种直播视频流的处理流程示意图。如图14所示,直播录制终端通过摄像机采集直播画面并进行编码后,通过直播接入服务将直播流上传给服务器,服务器通过直播转码服务将接入的直播流转码,并输出时间信息同步的纯画面流(只包含画面帧数据块)与纯音频流(只包含音频帧数据块)。在转码之后,一方面,服务器通过直播延时服务实现纯画面流的延时输出(比如,延时预定时长),另一方面,服务器将纯音频流分为两路,一路通过直播延时服务实现纯音频流的延时输出,另一路输入直播翻译服务,通过直播翻译服务将纯音频流发送到语音识别模块进行识别和翻译,并将翻译的结果(即字幕数据)写入到字幕存储服务,由字幕存储服务负责字幕数据的存储。在上述延时的预定时长到达时,服务器通过字幕混合服务,从直播延时服务拉取视频数据(即上述纯画面流和纯音频流),并从字幕存储服务拉取到时间信息相对应的字幕数据,根据纯画面流、纯音频流与字幕数据中的时间信息(比如时间戳),同步混合为包含字幕的直播流。
图15是根据一示例性实例示出的一种直播场景中的视频流处理装置的结构方框图。该直播场景中的视频流处理装置可以用于如图1所示系统中,以执行图3、图4或图13所示实例提供的方法的全部或者部分 步骤。该直播场景中的视频流处理装置可以包括:
第一获取模块1501,用于获取直播视频流数据中的第一音频流数据;
语音识别模块1502,用于对所述第一音频流数据进行语音识别,获得语音识别文本;
字幕生成模块1503,用于根据所述语音识别文本生成字幕数据,所述字幕数据中包含字幕文本以及所述字幕文本对应的时间信息;
字幕添加模块1504,用于根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据。
在一些实例中,所述字幕添加模块1504,包括:
分解单元,用于将所述直播视频流数据分解为第二音频流数据和第一画面帧流数据;
第一画面帧确定单元,用于确定所述第一画面帧流数据中的目标画面帧,所述目标画面帧是与所述时间信息对应的画面帧;
第一图像生成单元,用于生成包含所述字幕文本的字幕图像;
第一叠加单元,用于将所述字幕图像叠加在所述目标画面帧的上层,获得叠加后的画面帧流数据;
第一组合单元,用于将所述第二音频流数据和所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据。
在一些实例中,所述第一组合单元,具体用于,
将所述第二音频流数据与所述叠加后的画面帧流数据按照时间信息进行数据对齐;
将对齐后的所述第二音频流数据与所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据。
在一些实例中,所述装置还包括:
第二获取模块,用于在所述字幕添加模块根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面 帧,获得处理后的直播视频流数据之前,获取所述直播视频流数据中的第二画面帧流数据;
所述字幕添加模块1504,包括:
第二画面帧确定单元,用于确定所述第二画面帧流数据中的目标画面帧,所述目标画面帧是与所述时间信息对应的画面帧;
第二图像生成单元,用于生成包含所述字幕文本的字幕图像;
第二叠加单元,用于将所述字幕图像叠加在所述目标画面帧的上层,获得叠加后的画面帧流数据;
第二组合单元,用于将所述第一音频流数据和所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据。
在一些实例中,所述字幕添加模块1504,具体用于从第一时刻起延时预设时长后,执行所述根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据的步骤;其中,所述第一时刻是获取到所述直播视频流数据的时刻。
在一些实例中,所述字幕添加模块1504,具体用于在获得所述字幕数据的时刻,执行所述根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据的步骤。
在一些实例中,所述语音识别模块1502,具体用于,
对所述第一音频流数据进行语音起止检测,获得所述第一音频流数据中的语音起始帧和语音结束帧;所述语音起始帧是一段语音开始的音频帧,所述语音结束帧是一段语音结束的音频帧;
根据所述第一音频流数据中的语音起始帧和语音结束帧,从所述第一音频流数据中提取至少一段语音数据,所述语音数据包括对应的一组语音起始帧和语音结束帧之间的音频帧;
对所述至少一段语音数据分别进行语音识别,获得所述至少一段语音数据分别对应的识别子文本;
将所述至少一段语音数据分别对应的识别子文本获取为所述语音识别文本。
在一些实例中,所述字幕生成模块1503,具体用于,
将所述语音识别文本翻译为目标语言对应的翻译文本;
根据所述翻译文本生成所述字幕文本;所述字幕文本中包含所述翻译文本,或者,所述字幕文本中包含所述语音识别文本和所述翻译文本;
生成包含所述字幕文本的所述字幕数据。
在一些实例中,所述装置还包括:
请求接收模块,用于接收用户终端发送的视频流获取请求;
指示获取模块,用于获取所述视频流获取请求中携带的语言指示信息,所述语言指示信息用于指示字幕语言;
推送模块,用于当所述语言指示信息指示的字幕语言是所述字幕文本对应的语言时,向所述用户终端推送所述处理后的直播视频流数据。
综上所述,本申请实例所示的方案,视频流处理装置可以获取直播视频流数据中的音频流数据,并对音频流数据进行语音识别并根据识别结果生成字幕数据,再根据时间信息将字幕数据中的字幕文本添加入视频流中对应字幕文本的画面帧中,从而获得包含字幕的直播视频流,实现字幕与视频画面的准确同步,同时,由于不需要人工插入字幕数据,能够有效降低直播延时。
图16是本申请一个示例性实例示出的计算机设备1600的结构框图。计算机设备1600可以为所述直播系统中的直播录制终端220、服务器240或用户终端260。所述计算机设备1600包括中央处理单元(CPU)1601、包括随机存取存储器(RAM)1602和只读存储器(ROM)1603的系统存储器1604,以及连接系统存储器1604和中央处理单元1601的系统总线1605。所述计算机设备1600还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)1606,和用于存储操作系统1613、应用程序1614和其他程序模块1615的大容量存储设备1607。
所述基本输入/输出系统1606包括有用于显示信息的显示器1608和 用于用户输入信息的诸如鼠标、键盘之类的输入设备1609。其中所述显示器1608和输入设备1609都通过连接到系统总线1605的输入输出控制器1610连接到中央处理单元1601。所述基本输入/输出系统1606还可以包括输入输出控制器1610以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1610还提供输出到显示屏、打印机或其他类型的输出设备。
所述大容量存储设备1607通过连接到系统总线1605的大容量存储控制器(未示出)连接到中央处理单元1601。所述大容量存储设备1607及其相关联的计算机可读介质为计算机设备1600提供非易失性存储。也就是说,所述大容量存储设备1607可以包括诸如硬盘或者CD-ROM驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM、EEPROM、闪存或其他固态存储其技术,CD-ROM、DVD或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1604和大容量存储设备1607可以统称为存储器。
计算机设备1600可以通过连接在所述系统总线1605上的网络接口单元1611连接到互联网或者其它网络设备。
所述存储器还包括一个或者一个以上的程序,所述一个或者一个以上程序存储于存储器中,中央处理器1601通过执行该一个或一个以上程序来实现图3、图4或图13任一所示的方法中的全部或者部分步骤。
在示例性实例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括计算机程序(指令)的存储器,上述程序(指令)可由计算机设备的处理器执行以完成本申请各个实例所示的直播场景中的视频流处理方法。例如,所述非临时性计算机可读存储介质可以是 ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (16)

  1. 一种视频流处理方法,由计算机设备执行,所述方法包括:
    获取直播视频流数据中的第一音频流数据;
    对所述第一音频流数据进行语音识别,获得语音识别文本;
    根据所述语音识别文本生成字幕数据,所述字幕数据中包含字幕文本以及所述字幕文本对应的时间信息;
    根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据。
  2. 根据权利要求1所述的方法,其中,所述根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据,包括:
    将所述直播视频流数据分解为第二音频流数据和第一画面帧流数据;
    确定所述第一画面帧流数据中的目标画面帧,所述目标画面帧是与所述时间信息对应的画面帧;
    生成包含所述字幕文本的字幕图像;
    将所述字幕图像叠加在所述目标画面帧的上层,获得叠加后的画面帧流数据;
    将所述第二音频流数据和所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据。
  3. 根据权利要求2所述的方法,其中,所述将所述第二音频流数据和所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据,包括:
    将所述第二音频流数据与所述叠加后的画面帧流数据按照时间信息进行数据对齐;
    将对齐后的所述第二音频流数据与所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据。
  4. 根据权利要求1所述的方法,其中,所述根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据之前,还包括:
    获取所述直播视频流数据中的第二画面帧流数据;
    其中,所述根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据,包括:
    确定所述第二画面帧流数据中的目标画面帧,所述目标画面帧是与所述时间信息对应的画面帧;
    生成包含所述字幕文本的字幕图像;
    将所述字幕图像叠加在所述目标画面帧的上层,获得叠加后的画面帧流数据;
    将所述第一音频流数据和所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据。
  5. 根据权利要求1至4任一所述的方法,其中,
    从第一时刻起延时预设时长后,根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得所述处理后的直播视频流数据;
    其中,所述第一时刻是获取到所述直播视频流数据的时刻。
  6. 根据权利要求1至4任一所述的方法,其中,
    在成功存储所述字幕数据后,根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得所述处理后的直播视频流数据。
  7. 根据权利要求1至4任一所述的方法,其中,所述对所述第一音频流数据进行语音识别,获得语音识别文本,包括:
    对所述第一音频流数据进行语音起止检测,获得所述第一音频流数据中的语音起始帧和语音结束帧;所述语音起始帧是一段语音开始的音频帧,所述语音结束帧是一段语音结束的音频帧;
    根据所述第一音频流数据中的语音起始帧和语音结束帧,从所述第一音频流数据中提取至少一段语音数据,所述语音数据包括对应的一组语音起始帧和语音结束帧之间的音频帧;
    对所述至少一段语音数据分别进行语音识别,获得所述至少一段语音数据分别对应的识别子文本;
    将所述至少一段语音数据分别对应的识别子文本确定为所述语音识别文本。
  8. 根据权利要求1至4任一所述的方法,其中,所述根据所述语音识别文本生成字幕数据,包括:
    将所述语音识别文本翻译为目标语言对应的翻译文本;
    根据所述翻译文本生成所述字幕文本;所述字幕文本中包含所述翻译文本;
    生成包含所述字幕文本的所述字幕数据。
  9. 根据权利要求1至4任一所述的方法,其中,所述根据所述语音识别文本生成字幕数据,包括:
    将所述语音识别文本翻译为目标语言对应的翻译文本;
    根据所述翻译文本生成所述字幕文本;所述字幕文本中包含所述语音识别文本和所述翻译文本;
    生成包含所述字幕文本的所述字幕数据。
  10. 根据权利要求1至4任一所述的方法,所述方法还包括:
    接收用户终端发送的视频流获取请求;
    获取所述视频流获取请求中携带的语言指示信息,所述语言指示信息用于指示字幕语言;
    当所述语言指示信息指示的字幕语言是所述字幕文本对应的语言时,向所述用户终端推送所述处理后的直播视频流数据。
  11. 一种视频流处理装置,所述装置包括:
    第一获取模块,用于获取直播视频流数据中的第一音频流数据;
    语音识别模块,用于对所述第一音频流数据进行语音识别,获得语音识别文本;
    字幕生成模块,用于根据所述语音识别文本生成字幕数据,所述字幕数据中包含字幕文本以及所述字幕文本对应的时间信息;
    字幕添加模块,用于根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据。
  12. 根据权利要求11所述的装置,其中,所述字幕添加模块,包括:
    分解单元,用于将所述直播视频流数据分解为第二音频流数据和第一画面帧流数据;
    第一画面帧确定单元,用于确定所述第一画面帧流数据中的目标画面帧,所述目标画面帧是与所述时间信息对应的画面帧;
    第一图像生成单元,用于生成包含所述字幕文本的字幕图像;
    第一叠加单元,用于将所述字幕图像叠加在所述目标画面帧的上层,获得叠加后的画面帧流数据;
    第一组合单元,用于将所述第二音频流数据和所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据。
  13. 根据权利要求12所述的装置,其中,所述第一组合单元,具体用于,
    将所述第二音频流数据与所述叠加后的画面帧流数据按照时间信息进行数据对齐;
    将对齐后的所述第二音频流数据与所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据。
  14. 根据权利要求11所述的装置,所述装置还包括:
    第二获取模块,用于在所述字幕添加模块根据所述字幕文本对应的时间信息,将所述字幕文本添加入所述直播视频流数据中对应的画面帧,获得处理后的直播视频流数据之前,获取所述直播视频流数据中的第二画面帧流数据;
    所述字幕添加模块,包括:
    第二画面帧确定单元,用于确定所述第二画面帧流数据中的目标画面帧,所述目标画面帧是与所述时间信息对应的画面帧;
    第二图像生成单元,用于生成包含所述字幕文本的字幕图像;
    第二叠加单元,用于将所述字幕图像叠加在所述目标画面帧的上层,获得叠加后的画面帧流数据;
    第二组合单元,用于将所述第一音频流数据和所述叠加后的画面帧流数据组合为所述处理后的直播视频流数据。
  15. 一种计算机设备,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至10任一所述的视频流处理方法。
  16. 一种计算机可读存储介质,所述存储介质中存储有至少一条指 令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至10任一所述的视频流处理方法。
PCT/CN2019/079830 2018-04-25 2019-03-27 视频流处理方法、装置、计算机设备及存储介质 WO2019205872A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19792095.2A EP3787300A4 (en) 2018-04-25 2019-03-27 VIDEO STREAM PROCESSING METHOD AND DEVICE, COMPUTER DEVICE AND STORAGE MEDIUM
US16/922,904 US11463779B2 (en) 2018-04-25 2020-07-07 Video stream processing method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810380157.XA CN108401192B (zh) 2018-04-25 2018-04-25 视频流处理方法、装置、计算机设备及存储介质
CN201810380157.X 2018-04-25

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/922,904 Continuation US11463779B2 (en) 2018-04-25 2020-07-07 Video stream processing method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2019205872A1 true WO2019205872A1 (zh) 2019-10-31

Family

ID=63100553

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/079830 WO2019205872A1 (zh) 2018-04-25 2019-03-27 视频流处理方法、装置、计算机设备及存储介质

Country Status (4)

Country Link
US (1) US11463779B2 (zh)
EP (1) EP3787300A4 (zh)
CN (1) CN108401192B (zh)
WO (1) WO2019205872A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814732A (zh) * 2020-07-23 2020-10-23 上海优扬新媒信息技术有限公司 一种身份验证方法及装置
CN112637670A (zh) * 2020-12-15 2021-04-09 上海哔哩哔哩科技有限公司 视频生成方法及装置
CN113806570A (zh) * 2021-09-22 2021-12-17 维沃移动通信有限公司 图像生成方法和生成装置、电子设备和存储介质
EP3926968A1 (en) * 2020-06-15 2021-12-22 Interactive Standard LLC System and method for exchanging ultra short media content
CN113873306A (zh) * 2021-09-23 2021-12-31 深圳市多狗乐智能研发有限公司 一种将实时翻译字幕叠加画面经硬件投射到直播间的方法
CN114007091A (zh) * 2021-10-27 2022-02-01 北京市商汤科技开发有限公司 一种视频处理方法、装置、电子设备及存储介质
CN114063863A (zh) * 2021-11-29 2022-02-18 维沃移动通信有限公司 视频处理方法、装置及电子设备
CN114584830A (zh) * 2020-12-02 2022-06-03 青岛海尔多媒体有限公司 用于处理视频的方法及装置、家电设备
CN116471435A (zh) * 2023-04-12 2023-07-21 央视国际网络有限公司 语音和字幕的调整方法和装置、电子设备、存储介质

Families Citing this family (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108401192B (zh) * 2018-04-25 2022-02-22 腾讯科技(深圳)有限公司 视频流处理方法、装置、计算机设备及存储介质
CN109005470B (zh) * 2018-08-27 2020-11-10 佛山龙眼传媒科技有限公司 一种在线合成字幕的方法、系统与装置
CN109195007B (zh) * 2018-10-19 2021-09-07 深圳市轱辘车联数据技术有限公司 视频生成方法、装置、服务器及计算机可读存储介质
CN109348252B (zh) * 2018-11-01 2020-01-10 腾讯科技(深圳)有限公司 视频播放方法、视频传输方法、装置、设备及存储介质
CN109257659A (zh) * 2018-11-16 2019-01-22 北京微播视界科技有限公司 字幕添加方法、装置、电子设备及计算机可读存储介质
CN109379628B (zh) * 2018-11-27 2021-02-02 Oppo广东移动通信有限公司 视频处理方法、装置、电子设备及计算机可读介质
CN109495792A (zh) * 2018-11-30 2019-03-19 北京字节跳动网络技术有限公司 一种视频的字幕添加方法、装置、电子设备及可读介质
CN109819319A (zh) * 2019-03-07 2019-05-28 重庆蓝岸通讯技术有限公司 一种录像记录关键帧的方法
CN110035326A (zh) * 2019-04-04 2019-07-19 北京字节跳动网络技术有限公司 字幕生成、基于字幕的视频检索方法、装置和电子设备
CN110225265A (zh) * 2019-06-21 2019-09-10 深圳市奥拓电子股份有限公司 视频转播过程中的广告替换方法、系统及存储介质
CN110393921B (zh) * 2019-08-08 2022-08-26 腾讯科技(深圳)有限公司 云游戏的处理方法、装置、终端、服务器及存储介质
CN110933485A (zh) * 2019-10-21 2020-03-27 天脉聚源(杭州)传媒科技有限公司 一种视频字幕生成方法、系统、装置和存储介质
CN112752047A (zh) * 2019-10-30 2021-05-04 北京小米移动软件有限公司 视频录制方法、装置、设备及可读存储介质
CN111107383B (zh) * 2019-12-03 2023-02-17 广州方硅信息技术有限公司 视频处理方法、装置、设备及存储介质
CN113014984A (zh) * 2019-12-18 2021-06-22 深圳市万普拉斯科技有限公司 实时添加字幕方法、装置、计算机设备和计算机存储介质
CN111010614A (zh) * 2019-12-26 2020-04-14 北京奇艺世纪科技有限公司 一种显示直播字幕的方法、装置、服务器及介质
CN111212320B (zh) * 2020-01-08 2023-07-14 腾讯科技(深圳)有限公司 一种资源合成方法、装置、设备及存储介质
US11195533B2 (en) * 2020-03-25 2021-12-07 Disney Enterprises, Inc. Systems and methods for incremental natural language understanding
CN111432229B (zh) * 2020-03-31 2024-05-10 卡斯柯信号有限公司 一种对行车指挥进行记录分析与直播的方法与装置
CN111522971A (zh) * 2020-04-08 2020-08-11 广东小天才科技有限公司 一种直播教学中辅助用户听课的方法及装置
CN111479124A (zh) * 2020-04-20 2020-07-31 北京捷通华声科技股份有限公司 一种实时播放方法和装置
CN113613025A (zh) * 2020-05-05 2021-11-05 安徽文徽科技有限公司 一种实时语音转换字幕数据同步处理与画面合成直播的方法及装置
CN111601154B (zh) * 2020-05-08 2022-04-29 北京金山安全软件有限公司 一种视频处理方法及相关设备
CN111901615A (zh) * 2020-06-28 2020-11-06 北京百度网讯科技有限公司 直播视频的播放方法和装置
CN111836062A (zh) * 2020-06-30 2020-10-27 北京小米松果电子有限公司 视频播放方法、装置及计算机可读存储介质
CN113301357B (zh) * 2020-07-27 2022-11-29 阿里巴巴集团控股有限公司 直播方法、装置及电子设备
CN112102843A (zh) * 2020-09-18 2020-12-18 北京搜狗科技发展有限公司 一种语音识别方法、装置和电子设备
CN112087653A (zh) * 2020-09-18 2020-12-15 北京搜狗科技发展有限公司 一种数据处理方法、装置和电子设备
CN114257843A (zh) * 2020-09-24 2022-03-29 腾讯科技(深圳)有限公司 一种多媒体数据处理方法、装置、设备及可读存储介质
CN112188241A (zh) * 2020-10-09 2021-01-05 上海网达软件股份有限公司 一种用于直播流实时生成字幕的方法及系统
CN112380922B (zh) * 2020-10-23 2024-03-22 岭东核电有限公司 复盘视频帧确定方法、装置、计算机设备和存储介质
CN114501160A (zh) * 2020-11-12 2022-05-13 阿里巴巴集团控股有限公司 生成字幕的方法和智能字幕系统
CN114598893B (zh) * 2020-11-19 2024-04-30 京东方科技集团股份有限公司 文字的视频实现方法及系统、电子设备、存储介质
CN112511910A (zh) * 2020-11-23 2021-03-16 浪潮天元通信信息系统有限公司 实时字幕的处理方法和装置
CN112637620A (zh) * 2020-12-09 2021-04-09 杭州艾耕科技有限公司 一种对音视频流中物品和语言实时识别分析的方法与装置
CN112616062B (zh) * 2020-12-11 2023-03-10 北京有竹居网络技术有限公司 一种字幕显示方法、装置、电子设备及存储介质
CN112653932B (zh) * 2020-12-17 2023-09-26 北京百度网讯科技有限公司 用于移动终端的字幕生成方法、装置、设备以及存储介质
CN112770146B (zh) * 2020-12-30 2023-10-03 广州酷狗计算机科技有限公司 内容数据的设置方法、装置、设备以及可读存储介质
CN112839237A (zh) * 2021-01-19 2021-05-25 阿里健康科技(杭州)有限公司 网络直播中的视音频处理方法、计算机设备和介质
CN112929744B (zh) * 2021-01-22 2023-04-07 北京百度网讯科技有限公司 用于分割视频剪辑的方法、装置、设备、介质和程序产品
CN115150631A (zh) * 2021-03-16 2022-10-04 北京有竹居网络技术有限公司 字幕处理方法、装置、电子设备和存储介质
CN115086753A (zh) * 2021-03-16 2022-09-20 北京有竹居网络技术有限公司 直播视频流的处理方法、装置、电子设备和存储介质
CN113099282B (zh) * 2021-03-30 2022-06-24 腾讯科技(深圳)有限公司 一种数据处理方法、装置及设备
KR102523813B1 (ko) * 2021-04-06 2023-05-15 주식회사 한글과컴퓨터 영상에 대한 키워드 기반 검색을 가능하게 하는 영상 스트리밍 서비스 서버 및 그 동작 방법
CN113114687B (zh) * 2021-04-14 2022-07-15 深圳维盟科技股份有限公司 一种iptv合流方法及系统
CN113259776B (zh) * 2021-04-14 2022-11-22 北京达佳互联信息技术有限公司 字幕与音源的绑定方法及装置
KR102523814B1 (ko) * 2021-04-15 2023-05-15 주식회사 한글과컴퓨터 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치 및 그 동작 방법
CN113099292A (zh) * 2021-04-21 2021-07-09 湖南快乐阳光互动娱乐传媒有限公司 一种基于视频的多语种字幕生成方法及装置
CN112995736A (zh) * 2021-04-22 2021-06-18 南京亿铭科技有限公司 语音字幕合成方法、装置、计算机设备及存储介质
KR102523816B1 (ko) * 2021-05-07 2023-05-15 주식회사 한글과컴퓨터 영상이 재생되는 화면 상에 사용자의 질의 문장에 대한 답변 자막을 표시하는 전자 장치 및 그 동작 방법
CN113596494B (zh) * 2021-07-27 2023-05-30 北京达佳互联信息技术有限公司 信息处理方法、装置、电子设备、存储介质及程序产品
CN113613059B (zh) * 2021-07-30 2024-01-26 杭州时趣信息技术有限公司 一种短播视频处理方法、装置及设备
EP4322536A1 (en) 2021-08-05 2024-02-14 Samsung Electronics Co., Ltd. Electronic device and method for multimedia playback in electronic device
CN115967822A (zh) * 2021-10-12 2023-04-14 北京字跳网络技术有限公司 信息显示方法、装置、电子设备和存储介质
CN113891108A (zh) * 2021-10-19 2022-01-04 北京有竹居网络技术有限公司 字幕优化方法、装置、电子设备和存储介质
CN114125331A (zh) * 2021-11-11 2022-03-01 北京有竹居网络技术有限公司 一种字幕添加系统
CN114125358A (zh) * 2021-11-11 2022-03-01 北京有竹居网络技术有限公司 云会议字幕显示方法、系统、装置、电子设备和存储介质
CN114040220A (zh) * 2021-11-25 2022-02-11 京东科技信息技术有限公司 直播方法和装置
CN114007116A (zh) * 2022-01-05 2022-02-01 凯新创达(深圳)科技发展有限公司 一种视频处理方法、视频处理装置
CN114420104A (zh) * 2022-01-27 2022-04-29 网易有道信息技术(北京)有限公司 自动生成字幕的方法及其相关产品
CN114501113B (zh) * 2022-01-30 2024-05-31 深圳创维-Rgb电子有限公司 投屏录制方法、设备及计算机可读存储介质
CN114449310A (zh) * 2022-02-15 2022-05-06 平安科技(深圳)有限公司 视频剪辑方法、装置、计算机设备及存储介质
CN114554238B (zh) * 2022-02-23 2023-08-11 北京有竹居网络技术有限公司 直播语音同传方法、装置、介质及电子设备
CN114598933B (zh) * 2022-03-16 2022-12-27 平安科技(深圳)有限公司 一种视频内容处理方法、系统、终端及存储介质
KR20240050038A (ko) * 2022-10-11 2024-04-18 삼성전자주식회사 디스플레이 장치 및 디스플레이 방법
CN115643424A (zh) * 2022-10-25 2023-01-24 上海哔哩哔哩科技有限公司 直播数据处理方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103561217A (zh) * 2013-10-14 2014-02-05 深圳创维数字技术股份有限公司 一种生成字幕的方法及终端
CN104581221A (zh) * 2014-12-25 2015-04-29 广州酷狗计算机科技有限公司 视频直播的方法和装置
CN105744346A (zh) * 2014-12-12 2016-07-06 深圳Tcl数字技术有限公司 字幕切换方法及装置
CN108401192A (zh) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 视频流处理方法、装置、计算机设备及存储介质
CN108600773A (zh) * 2018-04-25 2018-09-28 腾讯科技(深圳)有限公司 字幕数据推送方法、字幕展示方法、装置、设备及介质

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7047191B2 (en) * 2000-03-06 2006-05-16 Rochester Institute Of Technology Method and system for providing automated captioning for AV signals
US7171365B2 (en) * 2001-02-16 2007-01-30 International Business Machines Corporation Tracking time using portable recorders and speech recognition
US6925438B2 (en) * 2002-10-08 2005-08-02 Motorola, Inc. Method and apparatus for providing an animated display with translated speech
KR20050118733A (ko) * 2003-04-14 2005-12-19 코닌클리케 필립스 일렉트로닉스 엔.브이. 시청각 스트림상에 자동 더빙을 수행하는 시스템 및 방법
JP4127668B2 (ja) * 2003-08-15 2008-07-30 株式会社東芝 情報処理装置、情報処理方法、およびプログラム
TW201104563A (en) * 2009-07-27 2011-02-01 Ipeer Multimedia Internat Ltd Multimedia subtitle display method and system
KR101830656B1 (ko) * 2011-12-02 2018-02-21 엘지전자 주식회사 이동 단말기 및 이의 제어방법
US9437246B2 (en) * 2012-02-10 2016-09-06 Sony Corporation Information processing device, information processing method and program
WO2013136715A1 (ja) * 2012-03-14 2013-09-19 パナソニック株式会社 受信装置、放送通信連携システムおよび放送通信連携方法
US20130295534A1 (en) * 2012-05-07 2013-11-07 Meishar Meiri Method and system of computerized video assisted language instruction
CN103458321B (zh) * 2012-06-04 2016-08-17 联想(北京)有限公司 一种字幕加载方法及装置
GB2510116A (en) * 2013-01-23 2014-07-30 Sony Corp Translating the language of text associated with a video
IL225480A (en) * 2013-03-24 2015-04-30 Igal Nir A method and system for automatically adding captions to broadcast media content
US9355094B2 (en) * 2013-08-14 2016-05-31 Google Inc. Motion responsive user interface for realtime language translation
US9953631B1 (en) * 2015-05-07 2018-04-24 Google Llc Automatic speech recognition techniques for multiple languages
CN105959772B (zh) * 2015-12-22 2019-04-23 合一网络技术(北京)有限公司 流媒体与字幕即时同步显示、匹配处理方法、装置及系统
CN105828101B (zh) * 2016-03-29 2019-03-08 北京小米移动软件有限公司 生成字幕文件的方法及装置
CN107690089A (zh) * 2016-08-05 2018-02-13 阿里巴巴集团控股有限公司 数据处理方法、直播方法及装置
CN106303303A (zh) * 2016-08-17 2017-01-04 北京金山安全软件有限公司 一种媒体文件字幕的翻译方法、装置及电子设备
US10397645B2 (en) * 2017-03-23 2019-08-27 Intel Corporation Real time closed captioning or highlighting method and apparatus
CN107222792A (zh) * 2017-07-11 2017-09-29 成都德芯数字科技股份有限公司 一种字幕叠加方法及装置
CN108063970A (zh) * 2017-11-22 2018-05-22 北京奇艺世纪科技有限公司 一种处理直播流的方法和装置
CN111758264A (zh) * 2018-02-26 2020-10-09 谷歌有限责任公司 预先录制的视频的自动语音翻译配音

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103561217A (zh) * 2013-10-14 2014-02-05 深圳创维数字技术股份有限公司 一种生成字幕的方法及终端
CN105744346A (zh) * 2014-12-12 2016-07-06 深圳Tcl数字技术有限公司 字幕切换方法及装置
CN104581221A (zh) * 2014-12-25 2015-04-29 广州酷狗计算机科技有限公司 视频直播的方法和装置
CN108401192A (zh) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 视频流处理方法、装置、计算机设备及存储介质
CN108600773A (zh) * 2018-04-25 2018-09-28 腾讯科技(深圳)有限公司 字幕数据推送方法、字幕展示方法、装置、设备及介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3787300A4 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3926968A1 (en) * 2020-06-15 2021-12-22 Interactive Standard LLC System and method for exchanging ultra short media content
US11438287B2 (en) 2020-06-15 2022-09-06 Interactive Standard LLC System and method for generating and reproducing ultra short media content
CN111814732A (zh) * 2020-07-23 2020-10-23 上海优扬新媒信息技术有限公司 一种身份验证方法及装置
CN111814732B (zh) * 2020-07-23 2024-02-09 度小满科技(北京)有限公司 一种身份验证方法及装置
CN114584830A (zh) * 2020-12-02 2022-06-03 青岛海尔多媒体有限公司 用于处理视频的方法及装置、家电设备
CN112637670A (zh) * 2020-12-15 2021-04-09 上海哔哩哔哩科技有限公司 视频生成方法及装置
CN113806570A (zh) * 2021-09-22 2021-12-17 维沃移动通信有限公司 图像生成方法和生成装置、电子设备和存储介质
CN113873306A (zh) * 2021-09-23 2021-12-31 深圳市多狗乐智能研发有限公司 一种将实时翻译字幕叠加画面经硬件投射到直播间的方法
CN114007091A (zh) * 2021-10-27 2022-02-01 北京市商汤科技开发有限公司 一种视频处理方法、装置、电子设备及存储介质
CN114063863A (zh) * 2021-11-29 2022-02-18 维沃移动通信有限公司 视频处理方法、装置及电子设备
CN116471435A (zh) * 2023-04-12 2023-07-21 央视国际网络有限公司 语音和字幕的调整方法和装置、电子设备、存储介质

Also Published As

Publication number Publication date
US11463779B2 (en) 2022-10-04
EP3787300A1 (en) 2021-03-03
EP3787300A4 (en) 2021-03-03
CN108401192B (zh) 2022-02-22
CN108401192A (zh) 2018-08-14
US20200336796A1 (en) 2020-10-22

Similar Documents

Publication Publication Date Title
WO2019205872A1 (zh) 视频流处理方法、装置、计算机设备及存储介质
US11252444B2 (en) Video stream processing method, computer device, and storage medium
US11272257B2 (en) Method and apparatus for pushing subtitle data, subtitle display method and apparatus, device and medium
US9286940B1 (en) Video editing with connected high-resolution video camera and video cloud server
WO2017107578A1 (zh) 流媒体与字幕即时同步显示、匹配处理方法、装置及系统
US20170311006A1 (en) Method, system and server for live streaming audio-video file
US9736552B2 (en) Authoring system for IPTV network
WO2017063399A1 (zh) 一种视频播放方法和装置
WO2016150317A1 (zh) 直播视频的合成方法、装置及系统
CN109348252B (zh) 视频播放方法、视频传输方法、装置、设备及存储介质
US20150062353A1 (en) Audio video playback synchronization for encoded media
US11227620B2 (en) Information processing apparatus and information processing method
CN112601101B (zh) 一种字幕显示方法、装置、电子设备及存储介质
CN112616062B (zh) 一种字幕显示方法、装置、电子设备及存储介质
US20140208351A1 (en) Video processing apparatus, method and server
KR20150083355A (ko) 증강 미디어 서비스 제공 방법, 장치 및 시스템
US20140003792A1 (en) Systems, methods, and media for synchronizing and merging subtitles and media content
CN112437337A (zh) 一种直播实时字幕的实现方法、系统及设备
CN114040255A (zh) 直播字幕生成方法、系统、设备及存储介质
KR20140106161A (ko) 콘텐츠 재생 방법 및 장치
WO2021101024A1 (ko) 클라우드 기반 동영상 가상 스튜디오 서비스 시스템
WO2024087732A1 (zh) 直播数据处理方法及系统
CN113891108A (zh) 字幕优化方法、装置、电子设备和存储介质
CN114339284A (zh) 直播延迟的监控方法、设备、存储介质及程序产品
JP4755717B2 (ja) 放送受信端末装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19792095

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019792095

Country of ref document: EP

Effective date: 20201125