CN113099282B

CN113099282B - Data processing method, device and equipment

Info

Publication number: CN113099282B
Application number: CN202110344443.2A
Authority: CN
Inventors: 时永方; 何玫峻; 周煜
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2022-06-24
Anticipated expiration: 2041-03-30
Also published as: CN113099282A

Abstract

The embodiment of the application discloses a data processing method, a device and equipment, wherein the data processing method comprises the following steps: pulling a first media stream of a stream to be pushed, and extracting an audio stream from the first media stream; performing identification processing on the audio stream to obtain a subtitle text corresponding to the audio stream; synchronizing the caption text with the first media stream, wherein the synchronizing comprises adding the caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream; packaging the synchronized caption text and the first media stream to form a second media stream; the second media stream is streamed to the decoding device. By adopting the embodiment of the application, the synchronous real-time subtitles can be generated for the media stream to be pushed.

Description

Data processing method, device and equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing apparatus, and a data processing device.

Background

In the playing process of multimedia such as audio and video, the synchronous display of subtitles becomes an important means for helping people to understand characters in an auxiliary manner. For the traditional multimedia playing process, such as VoD (Video on Demand), Video recording and playing, etc., these multimedia have the characteristics of standard, non-real time, etc., before these multimedia are played, the server already has corresponding media resources, so the media resources can be translated in advance to obtain subtitles, and then the subtitles are encoded into the media resources to become a part of the media resources; thus, the subtitles are synchronously displayed when the media resources are played. In recent years, with the development of mobile internet technology, nonstandard streaming media applications (e.g., live video applications) with high real-time performance are increasingly favored by people; taking live video as an example, media Content Generated by live broadcasting belongs to User Generated Content (UGC) Generated by a User in real time, and has the characteristics of nonstandard property, high real-time property and the like; the media content is generated in real time in a media stream mode and transmitted in real time, so that the prior translation is not carried out to obtain subtitles like the traditional technology; therefore, how to generate real-time subtitles of media streams becomes a hot topic of current research.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device and data processing equipment, which can generate synchronous real-time subtitles for a media stream to be pushed.

In one aspect, an embodiment of the present application provides a data processing method, where the data processing method includes:

pulling a first media stream of a stream to be pushed, and extracting an audio stream from the first media stream;

carrying out identification processing on the audio stream to obtain a subtitle text corresponding to the audio stream;

synchronizing between the subtitle text and the first media stream; the synchronization comprises adding the caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream;

packaging the synchronized caption text and the first media stream to form a second media stream;

the second media stream is streamed to the decoding device.

In the embodiment of the application, an audio stream is extracted from a first media stream to be pushed for identification processing, so as to obtain a subtitle text corresponding to the audio stream; therefore, real-time caption texts can be generated for the first media stream by timely pulling the first media stream to be pushed for audio identification processing. In addition, the subtitle text code is not required to be embedded into the first media stream, and the second media stream can be obtained and pushed to the decoding equipment only by synchronizing the subtitle text and the first media stream and then packaging the synchronized subtitle text and the first media stream. The synchronization may include adding the subtitle text to a custom field of the first media stream, or encapsulating the subtitle text into a subtitle encapsulation packet that is synchronized with the first media stream; in the second media stream obtained in this way, the first media stream and the subtitle text are decoupled and synchronized with each other, so that it can be ensured that the decoding device can display the synchronized real-time subtitles while playing the first media stream, the playing effect of the media stream is effectively improved, and the method is particularly suitable for playing scenes with non-standard and high-real-time multimedia.

On the other hand, an embodiment of the present application provides a data processing method, including:

pulling the second media stream; the second media stream is obtained by packaging the first media stream and the subtitle text synchronized with the first media stream; the caption text is obtained by identifying and processing the audio stream in the first media stream; the synchronization between the first media stream and the caption file comprises adding a caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream;

decapsulating the second media stream to obtain a first media stream and a subtitle text;

if the caption display instruction is detected, analyzing the first media stream and the caption text; and playing the analyzed first media stream, and synchronously displaying the analyzed caption text in the playing process of the first media stream.

In the embodiment of the application, the pulled second media stream is obtained by encapsulating the first media stream and a subtitle text synchronized with the first media stream in real time, and the subtitle text is obtained by identifying and processing an audio stream in the first media stream; after the second media stream is pulled, the second media stream can be unpacked to obtain a first media stream and a subtitle text; if the decoding device side detects the caption display instruction, the first media stream and the caption text can be analyzed to play the analyzed first media stream, and the analyzed caption text is synchronously displayed in the playing process of the first media stream. In this way, the synchronization between the first media stream and the subtitle file includes adding the subtitle text to the custom field of the first media stream, or encapsulating the subtitle text into a subtitle encapsulation packet synchronized with the first media stream; the first media stream and the caption text in the second media stream are mutually decoupled and synchronized, so that the decoding device which detects the caption display instruction can display the synchronized real-time caption while playing the first media stream, the playing effect of the media stream is effectively improved, and the method is particularly suitable for playing scenes of multimedia with nonstandard and high real-time performance.

In another aspect, an embodiment of the present application provides a data processing apparatus, including:

the stream pulling unit is used for pulling a first media stream of the stream to be pushed and extracting an audio stream from the first media stream;

the processing unit is used for identifying and processing the audio stream to obtain a subtitle text corresponding to the audio stream;

the processing unit is also used for synchronizing the subtitle text and the first media stream; the synchronization comprises adding the caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream;

the processing unit is further used for packaging the synchronized subtitle text and the first media stream to form a second media stream;

and the stream pushing unit is used for pushing the second media stream to the decoding equipment.

In an implementation manner, the processing unit is configured to perform identification processing on an audio stream to obtain a subtitle text corresponding to the audio stream, and is specifically configured to perform the following steps:

sequentially intercepting N audio segments from an audio stream;

sequentially identifying the N audio clips to obtain M groups of subtitle information and the time offset of each group of subtitle information, wherein the M groups of subtitle information and the time offset of each group of subtitle information form a subtitle text;

any one of the N audio clips is represented as the ith audio clip, and the ith audio clip corresponds to K groups of subtitle information; any one group of subtitle information in the M groups of subtitle information is represented as jth group of subtitle information, and the time offset of the jth group of subtitle information refers to the offset of the start timestamp of the jth group of subtitle information relative to the start timestamp of the audio stream; i. j, N, M and K are positive integers, i is less than or equal to N, K is less than or equal to M, and j is less than or equal to M.

In one implementation, a first media stream includes an audio stream and a video stream; the processing unit is configured to, when synchronizing between the subtitle text and the first media stream, specifically perform the following steps:

determining M video frames respectively matched with the M groups of subtitle information in the video stream according to the time offset of the M groups of subtitle information;

constructing custom fields at the associated positions of the matched M video frames respectively; and adding M groups of subtitle information and the time offset of each group of subtitle information into the corresponding custom field.

In one implementation mode, the video stream comprises a plurality of video frames, the video frames are sequentially encapsulated in a plurality of video code stream units in sequence, and the video frame matched with the jth group of subtitle information is encapsulated in a target video code stream unit;

the processing unit is configured to, when a custom field is constructed at an associated position of a video frame matched with the jth group of subtitle information, specifically execute the following steps: constructing a target subtitle code stream unit at the relevant position of the target video code stream unit; configuring a target caption code stream unit into a custom field;

the processing unit is configured to, when adding the jth group of subtitle information and the time offset of the jth group of subtitle information to the corresponding custom field, specifically execute the following steps: packaging the jth group of subtitle information and the time offset of the jth group of subtitle information into a target subtitle code stream unit;

wherein, the relevant position of the target video code stream unit comprises: the position between the reference video code stream unit and the target video code stream unit; the reference video code stream unit is a video code stream unit used for packaging a previous video frame of the video frame matched with the jth group of subtitle information.

In one implementation, the custom field is a caption code stream unit, and each group of caption information and the time offset of each group of caption information in the M groups of caption information are respectively packaged into the M caption code stream units;

a processing unit, configured to encapsulate the synchronized subtitle text and the first media stream to form a second media stream, and specifically configured to execute the following steps: and packaging the M subtitle code stream units and the first media stream to form a second media stream.

acquiring the time offset of a first group of subtitle information in the subtitle text;

determining a video frame matched with the first group of subtitle information in the video stream according to the time offset of the first group of subtitle information;

and aligning the first group of subtitle information with the determined video frame.

In one implementation, the video stream is encapsulated in a video encapsulation packet and the audio stream is encapsulated in an audio encapsulation packet; a processing unit, configured to encapsulate the synchronized subtitle text and the first media stream to form a second media stream, and specifically configured to perform the following steps:

packaging the aligned M groups of subtitle information and the time offset of each group of subtitle information into a subtitle packaging packet;

and encapsulating the subtitle encapsulation packet, the video encapsulation packet and the audio encapsulation packet to form a second media stream.

In the embodiment of the application, an audio stream is extracted from a first media stream to be pushed for identification processing, so as to obtain a subtitle text corresponding to the audio stream; therefore, real-time caption texts can be generated for the first media stream by timely pulling the first media stream to be pushed for audio identification processing. In addition, the subtitle text code is not required to be embedded into the first media stream, and the second media stream can be obtained and pushed to the decoding equipment only by synchronizing the subtitle text and the first media stream and then packaging the synchronized subtitle text and the first media stream. The synchronization comprises adding the caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream; in the second media stream obtained in this way, the first media stream and the subtitle text are decoupled and synchronized with each other, so that it can be ensured that the decoding device can display the synchronized real-time subtitles while playing the first media stream, the playing effect of the media stream is effectively improved, and the method is particularly suitable for a multimedia playing scene with nonstandard and high real-time performance.

a stream pulling unit, configured to pull the second media stream; the second media stream is obtained by packaging the first media stream and the subtitle text synchronized with the first media stream; the caption text is obtained by identifying and processing the audio stream in the first media stream; the synchronization between the first media stream and the caption file comprises adding a caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream;

the processing unit is used for de-encapsulating the second media stream to obtain a first media stream and a subtitle text;

the processing unit is further used for analyzing the first media stream and the subtitle text if the subtitle display instruction is detected; and playing the analyzed first media stream, and synchronously displaying the analyzed caption text in the playing process of the first media stream.

In one implementation, the first media stream further includes a video stream, and a plurality of video frames included in the video stream are sequentially encapsulated in a plurality of video stream units; the caption text comprises M groups of caption information and time offset of each group of caption information, wherein M is a positive integer; each group of subtitle information in the M groups of subtitle information and the time offset of each group of subtitle information are respectively packaged into the M subtitle code stream units; the processing unit is configured to, when parsing the first media stream and the subtitle text, specifically execute the following steps:

analyzing a current code stream unit to be processed in the first media stream;

if the current code stream unit is a subtitle code stream unit, extracting a corresponding group of subtitle information and the time offset of the extracted subtitle information from the current code stream unit;

analyzing the next code stream unit to obtain a video frame matched with the extracted subtitle information;

the processing unit is configured to play the parsed first media stream, and when the parsed subtitle text is synchronously displayed in a playing process of the first media stream, specifically, execute the following steps:

and displaying the extracted subtitle information while playing the video frame obtained by analysis.

In one implementation, the first media stream further includes a video stream, the video stream is encapsulated in a video encapsulation packet, and the audio stream is encapsulated in an audio encapsulation packet; the caption text is encapsulated in the caption encapsulating packet;

the processing unit is configured to play the parsed first media stream, and when the parsed subtitle text is synchronously displayed in the playing process of the first media stream, specifically, the processing unit is configured to execute the following steps:

and playing the audio stream obtained by analyzing from the audio packaging packet and the video stream obtained by analyzing from the video packaging packet, and synchronously displaying the caption text obtained by analyzing from the caption packaging packet in the playing process.

In one implementation, the processing unit is further configured to perform the following steps: if the command of prohibiting the subtitle display is detected, ignoring the subtitle text and analyzing the first media stream; and playing the analyzed first media stream.

In one implementation, the first media stream further includes a video stream, and a plurality of video frames included in the video stream are sequentially encapsulated in a plurality of video stream units; the caption text comprises M groups of caption information and time offset of each group of caption information, wherein M is a positive integer; each group of subtitle information in the M groups of subtitle information and the time offset of each group of subtitle information are respectively packaged into the M subtitle code stream units; the processing unit is configured to, when ignoring the subtitle text and parsing the first media stream, specifically execute the following steps:

reading the header information of a current code stream unit to be processed in a first media stream, wherein the header information is packaged with a type label;

if the type label in the header information is a subtitle label, skipping the current code stream unit;

analyzing the next code stream unit to obtain a video frame;

a processing unit, configured to execute the following steps when playing the parsed first media stream: and playing the video frame obtained by analysis.

In the embodiment of the application, the pulled second media stream is obtained by encapsulating the first media stream and a subtitle text synchronized with the first media stream in real time, and the subtitle text is obtained by identifying and processing an audio stream in the first media stream; after the second media stream is pulled, the second media stream can be unpacked to obtain a first media stream and a subtitle text; if the decoding device side detects the caption display instruction, the first media stream and the caption text can be analyzed to play the analyzed first media stream, and the analyzed caption text is synchronously displayed in the playing process of the first media stream. In this way, the synchronization between the first media stream and the subtitle file includes adding the subtitle text to the custom field of the first media stream, or encapsulating the subtitle text into a subtitle encapsulation packet synchronized with the first media stream; the first media stream and the caption text in the second media stream are mutually decoupled and synchronized, so that the decoding device which detects the caption display instruction can display the synchronized real-time caption while playing the first media stream, the playing effect of the media streams is effectively improved, and the method is particularly suitable for a multimedia playing scene with nonstandard and high real-time performance.

In another aspect, an embodiment of the present application provides a data processing apparatus, which includes a processor and a computer-readable storage medium, wherein:

a processor adapted to implement a computer program;

a computer-readable storage medium, in which a computer program is stored, which computer program is adapted to be loaded by a processor and to carry out the above-mentioned data processing method.

In another aspect, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is read and executed by a processor of a computer device, the computer program causes the computer device to execute the data processing method described above.

In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method.

In the embodiment of the application, an audio stream is extracted from a first media stream to be pushed for identification processing, so as to obtain a subtitle text corresponding to the audio stream; therefore, real-time caption texts can be generated for the first media stream by timely pulling the first media stream to be pushed for audio identification processing. In addition, the subtitle text code is not required to be embedded into the first media stream, and the second media stream can be obtained and pushed to the decoding equipment only by synchronizing the subtitle text and the first media stream and then packaging the synchronized subtitle text and the first media stream. The synchronization between the first media stream and the caption file comprises adding a caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream; in the second media stream obtained in this way, the first media stream and the subtitle text are decoupled and synchronized with each other, so that it can be ensured that the decoding device can display the synchronized real-time subtitles while playing the first media stream, the playing effect of the media stream is effectively improved, and the method is particularly suitable for playing scenes with non-standard and high-real-time multimedia.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram illustrating a structure of a package file according to an exemplary embodiment of the present application;

FIG. 2 depicts an architectural diagram of a data processing system provided by an illustrative embodiment of the present application;

fig. 3 is a schematic diagram illustrating a subtitle generating process of a subtitle generating apparatus according to an exemplary embodiment of the present application;

FIG. 4 is a flow chart illustrating a data processing method provided by an exemplary embodiment of the present application;

FIG. 5a is a diagram illustrating a process for determining an associated location provided by an exemplary embodiment of the present application;

fig. 5b is a schematic diagram illustrating a synchronization manner between the subtitle text and the first media stream according to an exemplary embodiment of the present application;

fig. 5c is a schematic diagram illustrating a structure of a package file of a second media stream according to an exemplary embodiment of the present application;

FIG. 5d is a diagram illustrating a structure of a package file of a second media stream according to another exemplary embodiment of the present application;

FIG. 6 is a flow chart illustrating a data processing method according to another exemplary embodiment of the present application;

fig. 7 is a schematic diagram illustrating a data processing flow at a decoding device side according to an exemplary embodiment of the present application;

fig. 8a is a schematic diagram illustrating an application of a data processing method provided by an exemplary embodiment of the present application in a live video scene;

FIG. 8b is a schematic diagram illustrating an application of a data processing method in a video session scenario according to an exemplary embodiment of the present application;

FIG. 9 is a block diagram of a data processing apparatus according to an exemplary embodiment of the present application;

FIG. 10 is a block diagram of a data processing apparatus according to another exemplary embodiment of the present application;

fig. 11 shows a schematic structural diagram of a data processing device according to an exemplary embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application relates to an audio and video coding process of media content. The media content mentioned in the embodiment of the application refers to real-time multimedia which is generated in real time in the processes of live video, video session and the like and needs to be transmitted in real time; the media content may include audio content and video content. The audio and video coding refers to a technology of coding audio content and video content included in media content to obtain a media stream corresponding to the media content; the media stream may include an audio stream and a video stream, the audio content in the media content is encoded to obtain an audio stream corresponding to the audio content, and the video content in the media content is encoded to obtain a video stream corresponding to the video content.

The media stream needs to be transmitted after being encapsulated. Specifically, after the encoding device side of the media content acquires the media content, the media content is encoded to obtain a media stream; then, the encoding equipment side of the media content encapsulates the media stream, and pushes the encapsulated file obtained by encapsulation to the decoding equipment side of the media content; the decoding equipment side of the media content pulls the encapsulated file, the encapsulated file is unpacked to obtain a media stream, then the media stream is analyzed to obtain the media content, and therefore the decoding equipment side of the media content can play the media content obtained through analysis. In one implementation, the media stream may be encapsulated in an FLV (FLASH VIDE O) encapsulation format to obtain an FLV encapsulated file; the FLV is a media stream packaging format, and the FLV packaging file obtained by packaging the FLV packaging format has the characteristics of small volume, high loading speed, high quality of media content obtained by de-packaging and the like. Fig. 1 shows a schematic structure diagram of a packaged file provided in an exemplary embodiment of the present application, and as shown in fig. 1, an FLV packaged file includes a file header and a file body. The header may include, but is not limited to: type identification information of the FLV encapsulation format, byte information occupied by a header, and the like. The file body may include a combination of a plurality of "byte information occupied by a previous pack packet + pack packet"; the encapsulation packet may include, but is not limited to, at least one of: a script encapsulation packet (script TAG), an audio encapsulation packet (audio TAG), a video encapsulation packet (video TAG), and the like; the script encapsulation package is used to encapsulate meta information (e.g., duration) of the video stream and meta information (e.g., width, height, etc.) of the audio stream, the audio encapsulation package is used to encapsulate the audio stream, and the video encapsulation package is used to encapsulate the video stream.

The encapsulation Unit of the video stream in the video encapsulation packet is NALU (Network Abstract Layer Unit), i.e. NALU is the basic Unit of the video stream during transmission. The NALU may include header information, which may include a type tag of the NALU, and an actual transport payload for encapsulating data of the type indicated by the type tag. For example, the type label in the header of a NALU is "0 x 5", which indicates that the data encapsulated in the actual transport payload of the NALU is an I frame (key frame); the type label in the header of a NALU is "0 x 6", which indicates that the data encapsulated in the actual transport payload of the NALU is SEI (Supplemental Enhancement Information).

The video stream may include a plurality of video frames, and in general, one video frame is encapsulated in one NALU to form one video stream unit; as shown in fig. 1, a plurality of video stream units are encapsulated in a video encapsulation packet of an FLV encapsulation file. It should be noted that the video stream provided in the embodiment of the present application may be an h.264 video stream, and the h.264 is a digital video compression format, and has the characteristics of low bit rate, high quality of a video frame obtained by decoding, strong fault tolerance capability, strong network adaptability, and the like.

Based on this, the embodiments of the present application provide a data processing scheme, where the data processing scheme may be: for the subtitle generating device side, after the first media stream of the stream to be pushed is pulled, the audio stream extracted from the first media stream can be identified to obtain the subtitle text corresponding to the audio stream; then real-time synchronization can be carried out between the caption text and the first media stream, the synchronized caption text and the first media stream are packaged, and a second media stream obtained by packaging is pushed to decoding equipment; the synchronization comprises two modes of adding the caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream. The subtitle generating device side can generate the real-time subtitle of the first media stream by carrying out real-time synchronization between the subtitle text and the first media stream without recoding and embedding the subtitle text into the first media stream, wherein the real-time subtitle refers to the subtitle which is synchronous with both the video content and the audio content of the media content. For the decoding device side, the pulled second media stream is obtained by encapsulating the first media stream and a caption text synchronized with the first media stream in real time, and the caption text is obtained by identifying and processing the audio stream in the first media stream; after the second media stream is pulled, the second media stream can be unpacked to obtain a first media stream and a subtitle text; if the decoding device side detects the caption display instruction, the first media stream and the caption text can be analyzed to play the analyzed first media stream, and the analyzed caption text is synchronously displayed in the playing process of the first media stream; and if the decoding equipment side detects the command of prohibiting the subtitle display, ignoring the subtitle text, analyzing and playing the first media stream. The decoding equipment side pulls the acquired second media stream to perform real-time synchronization between the first media stream and the caption text, wherein the caption text is a real-time caption generated for the first media stream; and when the decoding device side detects the caption display instruction, the caption text in the second media stream can be analyzed to realize synchronous display of the caption text in the process of playing the first media stream; in the case where the decoding apparatus side detects the subtitle display prohibition instruction, the first media stream can be directly parsed and played.

The data processing scheme provided by the embodiment of the application has the characteristics of seamless access, seamless transmission, seamless display and the like of real-time subtitles. Firstly, after real-time synchronization is carried out between a first media stream and a caption text on a caption generating device side, the synchronized caption text and the first media stream are packaged to form a second media stream, and the second media stream is pushed to a decoding device side; real-time synchronization can be achieved between the first media stream and the subtitle text without re-encoding and embedding the subtitle text into the first media stream to generate a real-time subtitle of the first media stream; it can be seen that the first media stream and the subtitle text are decoupled from each other but synchronized with each other, and the subtitle text is seamlessly inserted into the first media stream. Secondly, the subtitle text is seamlessly accessed into the first media stream to form a second media stream, the subtitle text does not need to be encoded and embedded into the first media stream again, the whole time delay caused by encoding is avoided, and the process of forming the second media stream by the first media stream is very short; it can be seen that the second media stream is seamlessly transmitted to the decoding device side. Then, for the decoding device side which detects the caption display instruction, the second media stream can be unpacked, and the real-time caption text is synchronously displayed in the process of playing the first media stream; for the decoding side which detects the instruction for prohibiting the subtitle display, the existence of the subtitle text has no influence on the playing of the first media stream, and the subtitle text can be ignored and directly analyzed and played; it follows that the subtitle text is displayed seamlessly at the decoding device side.

Referring to the data processing system suitable for implementing the data processing scheme, fig. 2 shows an architecture diagram of a data processing system provided in an exemplary embodiment of the present application, and as shown in fig. 2, the data processing system may include an encoding device 201, a subtitle generating device 202, and a first decoding device 203 and a second decoding device 204, and the number of the first decoding device 203 and the second decoding device 204 is not limited in the embodiment of the present application. Wherein the encoding device 201 is a device for generating a first media stream; the encoding apparatus 201 may include a terminal or a server. The subtitle generating device 202 is a device for generating a subtitle text corresponding to the first media stream, and forming a second media stream after the subtitle text is synchronized with the first media stream; the subtitle generating apparatus 202 may include a terminal or a server. The first decoding device 203 is a device having a subtitle display capability, the second decoding device 204 is a device having no subtitle display capability, and both the first decoding device 203 and the second decoding device 204 may be terminals. The encoding device 201, the subtitle generating device 202, and the first decoding device 203 and the second decoding device 204 may be directly or indirectly connected through wired communication or wireless communication, and the present application is not limited thereto. It should be noted that the server mentioned in this embodiment of the present application may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content delivery Network) service, and a big data and artificial intelligence platform; the terminal mentioned in the embodiment of the present application may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted device, a smart speaker, a smart watch, and the like, but is not limited thereto.

The subtitle display capability is: hardware or software supporting subtitle display (for example, a subtitle display application program supporting subtitle display, subtitle display driver software, or the like); for example, the first decoding apparatus 203 is provided with hardware or software supporting subtitle display. The lack of subtitle display capability may refer to: the hardware or software supporting the subtitle display is not provided, for example, the second decoding apparatus 204 does not have the hardware or software supporting the subtitle display.

The following describes the encoding device 201, the subtitle generating device 202, the first decoding device 203, and the second decoding device 204 included in the data processing system, respectively:

(1) encoding apparatus 201

The encoding device is configured to generate a first media stream and to stream the first media stream into a content distribution network. The coding device runs a streaming media application, and can acquire media content through the streaming media application, code the media content to obtain a first media stream of the media content, and push the first media stream to a content distribution network, and the content distribution network forwards the first media stream. In an implementation manner, a streaming media application running in an encoding device may be a live video application, the encoding device may encode live media content (live media content includes live video and live audio) collected by the live video application to obtain a first media stream, and push the first media stream to a content distribution network, and the first media stream is forwarded by the content distribution network, where the first media stream includes a video stream obtained by encoding the live video in the live media content and an audio stream obtained by encoding the live audio in the live media content. In another implementation, the streaming media application running in the encoding device may be a video session application, and the encoding device may encode session media content (the session media content includes session video and session audio) collected by the video session application to obtain a first media stream, push the first media stream to a content distribution network, and forward the first media stream by the content distribution network, where the first media stream includes a video stream obtained by encoding the session video in the session media content, and an audio stream obtained by encoding the session audio in the session media content.

(2) Subtitle generating apparatus 202

The caption generating device is used for synchronizing the first media stream and the caption text corresponding to the first media stream, forming a second media stream according to the synchronized caption text and the first media stream, and pushing the second media stream to a content distribution network. Fig. 3 is a schematic diagram illustrating a subtitle generating process of a subtitle generating apparatus according to an exemplary embodiment of the present application, and as shown in fig. 3, the subtitle generating apparatus pulls a first media stream that does not include a subtitle text from a content distribution network, separates an audio stream from the first media stream, and performs recognition processing on the audio stream to obtain a subtitle text corresponding to the audio stream; the subtitle generating device synchronizes between the subtitle text and the first media stream, packages the synchronized subtitle text and the first media stream to form a second media stream, and re-pushes the second media stream containing the subtitle text to the content distribution network, so that the first decoding device 203 and the second decoding device 204 can pull the second media stream from the content distribution network; after the second media stream is pushed to the content distribution network, the subtitle generating device may determine whether a new media stream exists in the content distribution network, where the new media stream is still pushed to the content distribution network by the encoding device; if the new media stream exists, the subtitle generating device can continue to pull the new media stream from the content distribution network and process the new media stream until the new media stream pushed by the encoding device does not exist in the content distribution network; if not, the subtitle generating flow of the subtitle generating device is ended.

It should be noted that the encoding device 201 and the subtitle generating device 202 may be integrated in the same device, for example, the subtitle generating device 202 may be integrated in the encoding device 201 as a subtitle generating module. In this case, after the encoding device 201 generates the first media stream, the first media stream of the stream to be pushed may be pulled, the audio stream is extracted from the first media stream, and the audio stream is identified to obtain the subtitle text corresponding to the audio stream; the encoding device 201 synchronizes the subtitle text with the first media stream, and encapsulates the synchronized subtitle text with the first media stream to form a second media stream; after forming the second media stream, the encoding apparatus 201 streams the second media stream to the content distribution network, so that the first decoding apparatus 203 and the second decoding apparatus 204 can pull the second media stream from the content distribution network; after the second media stream is pushed to the content distribution network, the encoding device 201 may determine whether the streaming media application running in the encoding device 201 finishes acquiring the media content; if not, the encoding device 201 may continue to generate a new media stream according to the acquired new media content, and process the extracted new media stream until the streaming media application finishes acquiring the media content; if so, the encoding apparatus 201 ends the processing flow.

(3) First decoding device 203

The first decoding apparatus is an apparatus having a subtitle display capability, and the first decoding apparatus being an apparatus having a subtitle display capability may include at least one of: the first decoding device is provided with hardware supporting subtitle display; or, the first decoding device runs a subtitle display driver software (for example, a subtitle display driver) supporting subtitle display; or, a streaming media application runs in the first decoding device, and the streaming media application supports subtitle display. The first decoding equipment runs a streaming media application, and the streaming media application running in the first decoding equipment pulls a second media stream from a content distribution network; for a first decoding device with subtitle display capability, if a subtitle display instruction is detected, synchronously rendering a subtitle text encapsulated in a second media stream in the playing process of a first media stream encapsulated in the second media stream; if the instruction for prohibiting the subtitle display is detected, the subtitle text can be ignored and the first media stream can be directly played. Specifically, a streaming media application running in the first decoding device pulls the second media stream, and decapsulates the second media stream to obtain the first media stream and the subtitle text; then, if a subtitle display instruction is detected, the first decoding device analyzes the first media stream and the subtitle text, plays the analyzed first media stream, and synchronously displays the analyzed subtitle text in the playing process of the first media stream; and if the command for prohibiting caption display is detected, the first decoding equipment ignores the caption text and plays the analyzed first media stream. In one implementation, the streaming media application running in the first decoding device is a live video application, the live video application includes a subtitle display switch, and if the subtitle display switch is turned on, a subtitle display instruction is detected; the first media stream is generated from live media content; after the video live broadcast application pulls the second media stream, the subtitle text encapsulated in the second media stream is synchronously displayed in the playing process of the first media stream encapsulated in the second media stream, namely the video live broadcast application plays the live broadcast media content and simultaneously displays the subtitle text in the playing process of the live broadcast media content. In another implementation manner, the streaming media application running in the first decoding device is a video session application, the video session application includes a subtitle display switch, and if the subtitle display switch is turned off, a subtitle display prohibition instruction is detected; the first media stream is generated from the session media content; after the video session application pulls the second media stream, the subtitle text encapsulated in the second media stream is ignored, and the first media stream encapsulated in the second media stream is analyzed and played, that is, the video session application plays the session media content.

(4) Second decoding device 204

The second decoding device is a device without subtitle display capability, and the second decoding device is a device without subtitle display capability means that: the second decoding apparatus is not provided with hardware or software supporting subtitle display. The second decoding device runs a streaming media application, the streaming media application running in the second decoding device pulls the second media stream from the content distribution network, and the streaming media application running in the second decoding device ignores the subtitle text and plays the parsed first media stream. Specifically, a streaming media application running in a second decoding device pulls a second media stream, and decapsulates the second media stream to obtain a first media stream and a subtitle text; then, the streaming media application running in the second decoding device includes a subtitle display switch, and for the second decoding device without subtitle display capability, in one implementation, the subtitle display switch may default to an unavailable state; and the second decoding equipment ignores the subtitle text by default, analyzes the first media stream and plays the analyzed first media stream. In another implementation, the subtitle switch may also be set to an available state, but in a case where the subtitle display switch is turned on or turned off, that is, in a case where a subtitle display instruction is detected or a subtitle display instruction is prohibited, the second decoding apparatus ignores the subtitle text, parses the first media stream, and plays the parsed first media stream. In one implementation, the streaming media application running in the second decoding device is a live video application, the live video application includes a subtitle display switch, and if the subtitle display switch is turned on, a subtitle display instruction is detected; the first media stream is generated according to the live media content, after the second media stream is pulled by the live video application, because the second decoding device does not have the subtitle display capability, the live video application ignores the subtitle text in the second media stream, and plays the first media stream in the second media stream, that is, the live video application plays the live media content. In another implementation, the streaming media application running in the second decoding device is a video session application, and the second decoding device does not have hardware or software supporting subtitle display; the video session application comprises a subtitle display switch, and if the subtitle display function is started, a subtitle display instruction is detected; the first media stream is generated according to the session media content, after the video session application pulls the second media stream, because the second decoding device does not have the subtitle display capability, the video session application ignores the subtitle text in the second media stream, and plays the first media stream in the second media stream, that is, the video session application plays the session media content.

In this embodiment of the present application, the encoding device 201 is configured to generate a first media stream and push the first media stream to a content distribution network. The subtitle generating device 202 is configured to synchronize the first media stream with the subtitle text corresponding to the first media stream, form a second media stream according to the synchronized subtitle text and the first media stream, and push the second media stream to the content distribution network again. The first decoding device 203 with subtitle display capability is configured to pull the second media stream, and synchronously display a subtitle text encapsulated in the second media stream in a playing process of the first media stream encapsulated in the second media stream when a subtitle display instruction is detected; and under the condition that a command for prohibiting subtitle display is detected, directly playing the first media stream obtained by analysis by ignoring the subtitle text. The second decoding device 204 without the subtitle display capability is configured to pull the second media stream, ignore the subtitle text encapsulated in the second media stream when detecting the subtitle display instruction or detecting the subtitle display prohibition instruction, parse the first media stream encapsulated in the second media stream, and play the parsed first media stream. Through the mutual cooperation among all devices in the data processing system, the subtitle generating device 202 can realize real-time synchronization between the first media stream and the subtitle text without recoding and embedding the subtitle text into the first media stream, the first media stream and the subtitle text are mutually decoupled and mutually synchronized, and the subtitle text is seamlessly connected into the first media stream; in addition, the subtitle text does not need to be re-encoded into the first media stream, so that the whole time delay caused by encoding is avoided, the process of forming the second media stream by the subtitle generating device 202 according to the first media stream is very short, and the seamless transmission of the second media stream to the decoding device is ensured; in addition, when detecting the subtitle display instruction, the first decoding device 203 with subtitle display capability can synchronously display the real-time subtitle text in the process of playing the first media stream, and the subtitle text is seamlessly displayed in the first decoding device; when the first decoding device 203 with the subtitle display capability detects the subtitle display prohibition instruction, the presence of the subtitle text has no influence on the playing of the first media stream, and the first media stream can be directly analyzed and played; when the second decoding device 204 without subtitle display capability detects a subtitle display instruction or a subtitle display prohibition instruction, the presence of the subtitle text has no influence on the playing of the first media stream, and the first media stream can be directly parsed and played.

It is to be understood that the data processing system described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

It should be noted that the data processing method provided in the embodiment of the present application further relates to a blockchain technology, where a blockchain is a novel application mode of computer technologies such as distributed data storage, P2P (Peer-to-Peer) transmission, a consensus mechanism, and an encryption algorithm. The blockchain is essentially a decentralized database, which is a string of data blocks associated by using cryptography, each data block contains information of a batch of network transactions, and the information is used for verifying the validity (anti-counterfeiting) of the information and generating the next block. The blockchain cryptographically secures the data to be non-falsifiable and non-falsifiable. For example, the encoding device, the subtitle generating device, and the decoding device may be nodes in a blockchain network, and the first media stream, the subtitle text, the second media stream, and the like, referred to in the embodiments of the present application, may be stored in the blockchain network in the form of chunks, and the first media stream, the second media stream, and the like are transmitted in the blockchain network, so that the transmission process of the media streams is more secure and reliable based on the characteristic that the chunks in the blockchain cannot be tampered and forged.

Fig. 4 is a flowchart illustrating a data processing method provided by an exemplary embodiment of the present application, which may be executed by the subtitle generating apparatus 202 in the data processing system shown in fig. 2; the data processing method may include the following steps S401 to S404:

s401, pulling a first media stream of the stream to be pushed, and extracting an audio stream from the first media stream.

The first media stream is encoded from media content. The media content may include video content and audio content, and the first media stream may include a video stream and an audio stream, the video stream being encoded with the video content and the audio stream being encoded with the audio content. After pulling the first media stream to be streamed, the audio stream may be extracted from the first media stream.

The first media stream may be pulled from the content distribution network, and the first media stream is streamed to the content distribution network by the encoding device. The encoding device runs a streaming media application (e.g., a live video application, a session video application, etc.), and the encoding device may encode media content (e.g., live media content, session media content, etc.) acquired by the streaming media application to obtain a first media stream corresponding to the media content, and push the first media stream to a content distribution network. In this way, the subtitle generating device and the encoding device are two independent devices, the first media stream is pushed to the content distribution network by the encoding device, and the subtitle generating device can pull the first media stream from the content distribution network and extract the audio stream from the first media stream.

The first media stream may also be encoded from the captured media content. In this manner, the subtitle generating apparatus and the encoding apparatus are integrated in a unified apparatus, and the subtitle generating apparatus may be integrated in the encoding apparatus as one subtitle generating module. The encoding device runs a streaming media application (e.g., a live video application, a video session application, etc.), and the encoding device can encode media content (e.g., live media content, session media content, etc.) acquired by the streaming media application to obtain a first media stream corresponding to the media content; therefore, the subtitle generating module of the coding device can pull the first media stream of the stream to be pushed and separate the audio stream from the first media stream.

S402, the audio stream is identified to obtain the subtitle text corresponding to the audio stream.

The identification process of the audio stream is performed in segments. The N audio segments can be sequentially intercepted from the audio stream, and the N audio segments are sequentially identified to obtain M groups of subtitle information and the time offset of each group of subtitle information, where the M groups of subtitle information and the time offset of each group of subtitle information together form a subtitle text, and N, M are positive integers. The recognition processing of the audio segment may refer to performing speech recognition processing on the audio segment by using a speech recognition model.

Wherein, the time length of each audio clip can be the same or different; that is, the audio stream may be intercepted according to a fixed duration to obtain N audio segments with the same duration; the audio stream may also be randomly intercepted, and the duration of the intercepted N audio segments is not necessarily the same. Any one of the N audio segments is represented as the ith audio segment, the ith audio segment corresponds to K groups of subtitle information, i is a positive integer less than or equal to N, and K is a positive integer less than or equal to M; that is, the subtitle information identified for an audio clip may be in one or more groups. Any one group of subtitle information in the M groups of subtitle information is represented as jth group of subtitle information, the time offset of the jth group of subtitle information refers to the offset of a start time stamp of the jth group of subtitle information relative to a start time stamp of the audio stream, and j is a positive integer less than or equal to M; for example, if the start time stamp of the jth subtitle information is 158663 and the start time stamp of the audio stream is 158660, the time offset of the jth subtitle information is 3.

S403, synchronizing between the subtitle text and the first media stream.

The synchronization may include any of: and the synchronous mode is that the caption text is added into the self-defined field of the first media stream, or the synchronous mode is that the caption text is packaged into a caption packaging packet which is synchronous with the first media stream. The two synchronization modes are described below:

(1) and adding the subtitle text into the self-defined field of the first media stream.

The synchronization method for adding the subtitle text into the custom field of the first media stream may be as follows: determining M video frames respectively matched with the M groups of subtitle information in the video stream according to the time offset of the M groups of subtitle information; respectively constructing custom fields at the associated positions of the M matched video frames; and adding M groups of subtitle information and the time offset of each group of subtitle information into the corresponding custom field.

Wherein, any group of subtitle information in the M groups of subtitle information is represented as a jth group of subtitle information, and a video frame in the video stream that matches the jth group of subtitle information may refer to: and video frames with the same time stamp in the video stream as the starting time stamp of the jth group of subtitle information. The video stream comprises a plurality of video frames, and each video frame corresponds to a time stamp; determining a start time stamp of the jth group of subtitle information according to the time offset of the jth group of subtitle information and the start time stamp of the audio stream; for example, the start timestamp of the jth subtitle information is equal to the sum of the start timestamp of the audio stream and the time offset of the jth subtitle information, the start timestamp of the audio stream is 158600, the time offset of the jth subtitle information is 8, and then the start timestamp of the jth subtitle information is equal to the sum of the start timestamp of the audio stream and the time offset of the jth subtitle information, that is, the start timestamp of the jth subtitle information is 158608; then, a video frame with the same timestamp as the starting timestamp of the jth group of subtitle information in the video stream may be determined, and a video frame with the same timestamp as the starting timestamp of the jth group of subtitle information in the video stream may be determined as a video frame matching the jth group of subtitle information.

The video stream can comprise a plurality of video frames, and the plurality of video frames are sequentially encapsulated in a plurality of video code stream units; and any one group of subtitle information in the M groups of subtitle information is represented as jth group of subtitle information, and the video frame matched with the jth group of subtitle information is packaged in the target video code stream unit. Constructing the custom field at the associated position of the video frame matching the jth group of subtitle information may refer to: and constructing a target subtitle code stream unit at the relevant position of the target video code stream unit, and configuring the target subtitle code stream unit into a self-defined field, wherein the target subtitle code stream unit is a network abstraction layer unit with a type label in header information as a subtitle label (namely '0 x 6').

It should be noted that, the associated positions of the target video stream unit may include: the position between the reference video code stream unit and the target video code stream unit; the reference video code stream unit is a video code stream unit used for packaging a previous video frame of the video frame matched with the jth group of subtitle information. Fig. 5a is a schematic diagram illustrating a determination process of an associated position according to an exemplary embodiment of the present application, as shown in fig. 5a, a target video code stream unit 501 is configured to encapsulate a video frame matching with the jth group of subtitle information, a reference video code stream unit 502 is configured to encapsulate a previous video frame of the video frame matching with the jth group of subtitle information, a position between the target video code stream unit 501 and the reference video code stream unit 502 is an associated position of the target video code stream unit 501, and a target subtitle code stream unit 503 is constructed at the associated position of the target video code stream unit 501.

Constructing a target subtitle code stream unit at a relevant position of the target video code stream unit, configuring the target subtitle code stream unit into a custom field, and then adding the jth group of subtitle information and the time offset of the jth group of subtitle information into the corresponding custom field may refer to: and encapsulating the jth group of subtitle information and the time offset of the jth group of subtitle information in a target subtitle code stream unit. Further, encapsulating the jth group of subtitle information and the time offset of the jth group of subtitle information in the target subtitle code stream unit may include: packaging the jth group of subtitle information into the actual transmission load of the target subtitle code stream unit, and packaging the time offset of the jth group of subtitle information into the header information of the target subtitle code stream unit; or, the start timestamp of the jth group of subtitle information may be obtained by calculation according to the time offset of the jth group of subtitle information and the start timestamp of the audio stream, the jth group of subtitle information is encapsulated in the actual transmission load of the target subtitle code stream unit, and the start timestamp of the jth group of subtitle information is encapsulated in the header information of the target subtitle code stream unit.

Any one of the M groups of subtitle information is represented as a jth group of subtitle information, and the jth group of subtitle information may include translation text of one or more language types. Encapsulating the jth group of subtitle information in the actual transmission load of the target subtitle code stream unit may refer to: the translated text under each language type and the corresponding language type is encapsulated in the actual transmission load of the target caption code stream unit according to the target format, the target format mentioned in the embodiment of the application can be a JSON format, and the JSON format is a lightweight data storage format irrelevant to the development language, and has the characteristics of convenience in adding encapsulation content, clear chromatography structure and the like. Specifically, when the jth group of subtitle information includes a translation text of one language type, encapsulating the translation text of each language type and the corresponding language type in the actual transmission load of the target subtitle code stream unit according to the target format may refer to: and packaging the key value pair consisting of the language type and the caption text under the language type into the actual transmission load of the target caption code stream unit. When the jth group of subtitle information includes translation texts of multiple language types, encapsulating the translation texts of each language type and corresponding language type in the actual transmission load of the target subtitle code stream unit according to the target format may refer to: adopting a plurality of key value pairs to encapsulate the translation texts of a plurality of language types under each language type into the actual transmission load of a target caption code stream unit; that is, one language type is used as a keyword, the translated text in the language type is used as a value of the keyword, a key value pair is generated, each language type in a plurality of language types corresponds to a key value pair, that is, a plurality of key value pairs can be generated, and the plurality of key value pairs are encapsulated in the actual transmission load of the target caption code stream unit. By adopting the key value pair mode for packaging, the packaging process of the caption information can be more standard and convenient.

In summary, the synchronization method, the synchronization process and the synchronization result of adding the subtitle text to the custom field of the first media stream can be seen in fig. 5 b. Fig. 5b is a schematic diagram illustrating a synchronization manner between subtitle text and a first media stream according to an exemplary embodiment of the present application, where white squares in fig. 5b represent multiple video bitstream units; the M groups of subtitle information and the time offset of each group of subtitle information are encapsulated in corresponding subtitle code stream units to obtain M subtitle code stream units, and gray squares in FIG. 5b represent the M subtitle code stream units; m subtitle stream units are added to the video stream at the associated positions of the corresponding video stream units.

(2) And a synchronous mode of packaging the caption text into a caption packaging packet which is synchronous with the first media stream.

The synchronous manner for encapsulating the caption text into the caption encapsulation packet synchronized with the first media stream may be: acquiring the time offset of a first group of subtitle information in the subtitle text; determining a video frame matched with the first group of subtitle information in the video stream according to the time offset of the first group of subtitle information; and aligning the first group of subtitle information with the determined video frame.

The video frames in the video stream that match the first set of subtitle information may refer to: and video frames in the video stream with the same time stamp as the starting time stamp of the first group of subtitle information. The video stream comprises a plurality of video frames, and each video frame corresponds to a time stamp; determining a start time stamp of the first set of subtitle information according to the time offset of the first set of subtitle information and the start time stamp of the audio stream; for example, the start timestamp of the first set of subtitle information is equal to the sum of the start timestamp of the audio stream and the time offset of the first set of subtitle information; a video frame having a timestamp that is the same as the starting timestamp of the first set of subtitle information may then be determined in the video stream, and a video frame having a timestamp that is the same as the starting timestamp of the first set of subtitle information in the video stream may be determined as a video frame that matches the first set of subtitle information.

Aligning the first set of subtitle information with the determined video frame may refer to: adding a synchronous label to the first group of subtitle information and the video frame matched with the first group of subtitle information; it can be understood that, adding a synchronization tag in a caption code stream unit for encapsulating the first set of caption information and the time offset of the first set of caption information, and adding a synchronization tag in a video code stream unit for encapsulating a video frame matching the first set of caption information; the video code stream unit with the synchronous label and the subtitle code stream unit with the synchronous label can synchronously display subtitle information packaged in the subtitle code stream unit with the synchronous label in the process of displaying video frames packaged in the video code stream unit with the synchronous label.

S404, the synchronized caption text and the first media stream are encapsulated to form a second media stream.

Correspondingly, corresponding to the above two synchronization methods between the caption text and the first media stream, the method of encapsulating the synchronized caption text and the first media stream to form the second media stream may also include two cases:

(1) for a synchronous manner of adding subtitle text to a custom field of a first media stream, the encapsulation process may include: and packaging the M subtitle code stream units and the first media stream to form a second media stream. Specifically, the first media stream may include a video stream and an audio stream; the synchronized caption text and the first media stream comprise audio streams and mixed media streams which are composed of a caption code stream unit and a video code stream unit; the audio stream can be encapsulated in an audio encapsulation packet, the mixed media stream composed of the caption code stream unit and the video code stream unit is encapsulated in a video encapsulation packet, and the audio encapsulation packet and the video encapsulation packet are encapsulated to form a second media stream. Fig. 5c illustrates a structural diagram of an encapsulation file of a second media stream according to an exemplary embodiment of the present application, and as shown in fig. 5c, an audio stream is encapsulated in an audio encapsulation packet, and a mixed media stream composed of a subtitle stream unit (e.g., a gray square shown in fig. 5 c) and a video stream unit (e.g., a white square shown in fig. 5 c) is encapsulated in a video encapsulation packet.

(2) For the synchronous way of encapsulating the subtitle text into the subtitle encapsulation packet synchronized with the first media stream, the encapsulation process can refer to fig. 5 d. Fig. 5d illustrates a schematic structural diagram of an encapsulation file of a second media stream according to another exemplary embodiment of the present application, where as shown in fig. 5d, a video stream is encapsulated in a video encapsulation packet, an audio stream is encapsulated in an audio encapsulation packet, and M groups of aligned subtitle information and a time offset of each group of subtitle information are encapsulated in a subtitle encapsulation packet; and encapsulating the subtitle encapsulation packet, the video encapsulation packet and the audio encapsulation packet to form the second media stream.

And the M groups of subtitle information and the time offset of each group of subtitle information can be encapsulated in a subtitle encapsulation packet in an M key value pair mode. Specifically, a time offset may be used as a keyword, the subtitle information corresponding to the time offset may be used as a value of a corresponding keyword, and M key value pairs may be generated from M groups of subtitle information and the time offset of each group of subtitle information, so that the M key value pairs may be encapsulated in a subtitle encapsulation packet.

S405, the second media stream is pushed to a decoding device.

After the synchronized subtitle text and the first media stream are encapsulated to form a second media stream, the second media stream may be directly pushed to a decoding device, so that the decoding device decapsulates the second media stream. Or after the synchronized subtitle text and the first media stream are encapsulated to form a second media stream, the second media stream may also be pushed to a content distribution network, so that the decoding device pulls the second media stream from the content distribution network.

In the embodiment of the application, the user-defined field is constructed at the key position of the video frame matched with the subtitle information, the subtitle information and the time offset of the subtitle information are added into the user-defined field, the subtitle text does not need to be re-encoded and embedded into the first media stream, the real-time synchronization between the subtitle text and the first media stream can be realized, the first media stream and the subtitle text are mutually decoupled and mutually synchronized, and the seamless access of the subtitle text is realized; in addition, the first group of subtitle information in the subtitle text is aligned with the video frame matched with the first group of subtitle information, and the subtitle text does not need to be re-coded into the first media stream, so that real-time synchronization between the subtitle text and the first media stream can be realized, and the first media stream and the subtitle text are mutually decoupled and synchronized, so that seamless access of the subtitle text is realized. And the mode of synchronizing between the caption text and the first media stream is adopted, the original first media stream is not changed, the load of recoding the first media stream can be reduced, and the side effects of delay, video frame image quality damage and the like caused by recoding the first media stream are eliminated. In addition, the network abstraction layer unit is a basic unit during transmission of the video stream, and the embodiment of the application encapsulates the subtitle information by using the network abstraction layer unit of the SEI type (namely, the subtitle type), so that the synchronization process between the subtitle text and the first media stream has higher compatibility and lower complexity.

Fig. 6 shows a flowchart of a data processing method provided in another exemplary embodiment of the present application, which may be performed by a decoding device (e.g., the first decoding device 203 or the second decoding device 204 in the data processing system shown in fig. 2); the data processing method may include the following steps S601 to S604:

s601, pulling the second media stream.

The second media stream is obtained by packaging the first media stream and the subtitle text synchronized with the first media stream; the synchronization between the first media stream and the caption file comprises adding a caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream; the caption text is obtained by performing recognition processing on the audio stream in the first media stream. For a specific generation process of the second media stream, reference may be made to the description of the embodiment shown in fig. 4, which is not described herein again.

S602, decapsulating the second media stream to obtain the first media stream and the subtitle text.

After the second media stream is pulled, the second media stream may be decapsulated to obtain the first media stream and the subtitle text synchronized with the first media stream. The first media stream may include a video stream and an audio stream; the video stream may include a plurality of video frames; the subtitle text may include M sets of subtitle information and a time offset for each set of subtitle information, M being a positive integer. In one implementation, the second media stream may be obtained by encapsulating an audio encapsulation packet and a video encapsulation packet; the audio encapsulating packet is used for encapsulating the audio stream; the video packaging packet is used for packaging a plurality of video code stream units and M subtitle code stream units; the video stream comprises a plurality of video frames which are sequentially encapsulated in a plurality of video stream units; m groups of subtitle information and the time offset of each group of subtitle information are encapsulated in M subtitle code stream units; and de-encapsulating the second media stream to obtain a plurality of video code stream units and M subtitle code stream units. In another implementation, the second media stream may be obtained by encapsulating an audio encapsulation packet, a video encapsulation packet, and a subtitle encapsulation packet, where the audio encapsulation packet is used to encapsulate an audio stream, the video encapsulation packet is used to encapsulate a video stream, and the subtitle encapsulation packet is used to encapsulate a subtitle text; and decapsulating the second media stream to obtain the first media stream and the subtitle text.

S603, if the caption display instruction is detected, the first media stream and the caption text are analyzed.

S604, playing the analyzed first media stream, and synchronously displaying the analyzed caption text in the playing process of the first media stream.

In steps S603 to S604, if a subtitle display instruction is detected, subtitle display capability may be detected; if the subtitle display capability is detected, the first media stream and the subtitle text can be analyzed, the analyzed first media stream is played, and the analyzed subtitle text is synchronously displayed in the playing process of the first media stream; if the subtitle text is detected to be incapable of displaying the subtitle, the subtitle text can be ignored, the first media stream can be analyzed, and the analyzed first media stream can be played. The subtitle display capability is: hardware or software supporting subtitle display; the lack of subtitle display capability means: there is no hardware or software to support the subtitle display. In this way, when the caption display instruction is detected and the decoding device is detected to have the caption display capability, the caption text can be seamlessly and synchronously displayed in real time in the process of playing the first media stream.

In an implementation manner, if a synchronous manner of adding the subtitle text to the custom field of the first media stream in the embodiment shown in fig. 4 is adopted, a specific implementation manner of parsing the first media stream and the subtitle text may include: analyzing a current code stream unit to be processed in the first media stream to obtain a type label in header information of current code stream information; if the type label in the header information of the current code stream unit is a subtitle label, determining that the current code stream unit is a subtitle code stream unit; if the current code stream unit is a subtitle code stream unit, extracting a corresponding group of subtitle information and the time offset of the extracted subtitle information from the current code stream unit; and analyzing the next code stream unit to obtain a video frame matched with the extracted subtitle information. Accordingly, the specific implementation manner of playing the parsed first media stream and synchronously displaying the parsed subtitle text in the playing process of the first media stream may include: and displaying the extracted subtitle information while playing the video frame obtained by analysis. After the current code stream unit and the next code stream unit of the current code stream unit are analyzed, the residual code stream units in the first media stream can be analyzed continuously.

In another implementation manner, in the process of synchronizing by using the synchronization manner of encapsulating the subtitle text into the subtitle encapsulation packet synchronized with the first media stream in the embodiment shown in fig. 4, the first group of subtitle information in the subtitle text and the video frames matched with the first group of subtitle information are aligned, both the first group of subtitle information and the video frames matched with the first group of subtitle information have synchronization tags, the video stream is encapsulated in the video encapsulation packet, the audio stream is encapsulated in the audio encapsulation packet, and the subtitle text is encapsulated in the subtitle encapsulation packet. Then, the specific implementation of playing the parsed first media stream and synchronously displaying the parsed subtitle text during the playing process of the first media stream may include: and playing the audio stream obtained by analyzing the audio packaging packet and the video stream obtained by analyzing the video packaging packet, and synchronously displaying the caption text obtained by analyzing the caption packaging packet in the playing process. Specifically, when the video encapsulation packet is analyzed to obtain a video frame matched with the first group of subtitle information, the first group of subtitle information with the same synchronous label as the video frame matched with the first group of subtitle information can be searched in the subtitle text, and the first group of subtitle information can be synchronously displayed in the process of playing the video frame matched with the first group of subtitle information; because the first group of subtitle information and the video frames matched with the first group of subtitle information are aligned, each group of subtitle information in the subsequent groups of subtitle information of the subtitle texts is aligned with the video frames matched with the subtitle information, and the subtitle texts can be synchronously displayed in the process of playing the first media stream.

It should be noted that a set of subtitle information may include one or more language types and translated texts in each language type, and each language type and the translated texts in the corresponding language type are encapsulated in a subtitle code stream unit in the form of a key value pair, that is, one or more key value pairs are encapsulated in one subtitle code stream unit, and each key value pair corresponds to one language type. And under the condition that a group of subtitle information comprises a language type, displaying the analyzed subtitle information, namely acquiring the translated text under the language type from the key value pair corresponding to the language type, and displaying the translated text under the language type. Under the condition that a group of subtitle information comprises multiple language types, before the analyzed subtitle information is displayed, the target language type needs to be obtained, so that the translation text under the target language type can be obtained from the key value pair corresponding to the target language type, and the translation text under the target language type is displayed. The target language type may be set by a user by definition, for example, the target language type is an operating system language type of the decoding device set by the user by definition, and the target language type may also be a language type of a streaming media application (for example, a live video application, a video session application, and the like) set by the user by definition; alternatively, the target language type may be set by default by an operating system of the decoding device, or may be set by default by a streaming application (e.g., a live video application, a video session application, etc.).

If the instruction for prohibiting the subtitle display is detected, the subtitle text can be ignored, the first media stream is analyzed, and the analyzed first media stream is played. By the method, for the condition that the instruction for prohibiting subtitle display is detected, because the subtitle text is not re-encoded into the first media stream, the subtitle text in the second media stream does not influence the decoding process of the first media stream, and the decoding device can normally decode and play the first media stream when the instruction for prohibiting subtitle display is detected.

The specific implementation of ignoring the subtitle text and parsing the first media stream may include: reading the header information of a current code stream unit to be processed in a first media stream, wherein the header information is packaged with a type label; if the type label in the header information is a subtitle label, skipping the current code stream unit; and analyzing the next code stream unit of the current code stream unit to obtain the video frame. The specific implementation of playing the parsed first media stream may include: and playing the video frame obtained by analysis. After the current code stream unit and the next code stream unit of the current code stream unit are analyzed, the residual code stream units in the first media stream can be analyzed continuously.

The second media stream may be a media stream received by a streaming media application, which is exemplified by that the streaming media application is a live video application, and the second media stream is a live video stream received by the live video application. In one implementation, the video live application includes a subtitle display switch; if the subtitle display switch is turned on (for example, a user of a live video application actively turns on the subtitle display switch), determining that a subtitle display instruction is detected; if the subtitle display instruction is detected, the subtitle display capability can be detected; if the first media stream has the subtitle display capability, the subtitle text can be synchronously displayed in the process of playing the first media stream; if the subtitle display capability is not available, the subtitle text in the second media stream can be ignored, and the first media stream can be directly analyzed and played. In another implementation, the live video application includes a caption display switch; if the subtitle display switch is turned off (for example, a user of a live video application actively turns off the subtitle display switch), it is determined that a subtitle display prohibition instruction is detected, and the subtitle text in the second media stream can be ignored, and the first media stream can be directly parsed and played. Through this kind of mode, the user that the live video was used can be according to the hobby, and the subtitle that the live video was used shows the switch, does not produce the influence to the user that other live videos were used, effectively promotes the live effect that the live video was used, promotes the live use experience that the live video was used.

For the scheme of synchronizing between the subtitle text and the first media stream by using the synchronization manner of adding the subtitle text to the custom field of the first media stream in the embodiment shown in fig. 4, the data processing flow on the decoding apparatus side may be summarized as the flowchart shown in fig. 7. Fig. 7 is a schematic diagram illustrating a data processing flow at a decoding device side according to an exemplary embodiment of the present application, and taking the decoding device shown in fig. 7 with a subtitle display capability as an example, first, after pulling the second media stream, the video encapsulation packet encapsulated in the second media stream may be transmitted to a video buffer (video buffer) of the decoding device. Secondly, reading a current code stream unit to be processed from a first media stream packaged in the video packaging packet, obtaining a type label in the header information of the current code stream unit, and judging whether the type label in the header information of the current code stream unit is a subtitle label. And if the type label in the header information of the current code stream unit is not the subtitle label, analyzing the current code stream unit, and playing the video frame obtained by analysis. And if the type label in the header information of the current code stream unit is a subtitle label, further detecting the subtitle display instruction. If the caption display instruction is detected, the decoding equipment extracts a group of corresponding caption information and the time offset of the extracted caption information from the current code stream unit, and analyzes the next code stream unit to obtain a video frame matched with the extracted caption information; and displaying the extracted subtitle information while playing the analyzed video frame. If the instruction for prohibiting the subtitle display is detected, the current code stream unit can be skipped over, the next code stream unit of the current code stream unit is analyzed to obtain a video frame, and the video frame obtained by analysis is played. After the second media stream is analyzed, whether a new media stream exists in the content distribution network or not can be judged to be analyzed; if yes, pulling a new media stream for data processing; if not, the data processing process ends.

In the real-time example of the application, under the condition that the subtitle display instruction is detected and the subtitle display capability is provided, the first media stream and the subtitle text can be analyzed, the analyzed first media stream is played, the analyzed subtitle text is synchronously displayed in the playing process of the first media stream, and seamless and real-time synchronous display of the subtitle text is achieved. Under the condition that a command for prohibiting subtitle display is detected or a command for displaying subtitles is detected and the subtitle display capability is unavailable, the subtitle text can be ignored, the first media stream can be analyzed, and the analyzed first media stream can be played; for the case that the command for prohibiting subtitle display is detected, or the command for subtitle display is detected and the first media stream does not have subtitle display capability, because the subtitle text is not re-encoded into the first media stream, the subtitle text in the second media stream does not affect the decoding process of the first media stream, and it is ensured that the first media stream can be normally decoded and played when the command for prohibiting subtitle display is detected, or the command for subtitle display is detected and the first media stream does not have subtitle display capability. In addition, a user of a streaming media application (such as a live video application) can independently turn on or off the subtitle display capability of the subtitle display switch through the live video application, and does not influence users of other live video applications, so that the live broadcast effect of the live video application is effectively improved, and the use experience of the live video application is improved.

The above description describes a specific implementation process of the data processing method and a data processing system suitable for implementing the data processing method, and the following description describes an applicable scenario of the data processing method:

(1) and (5) live video scenes.

Fig. 8a shows an application diagram of a data processing method in a live video scene according to an exemplary embodiment of the present application, and as shown in fig. 8a, in the live video scene, an encoding device is a anchor terminal used by an anchor user, a live video application runs in the anchor terminal, a decoding device is a viewer terminal used by a viewer user, and a live video application runs in the viewer terminal. Firstly, a live video application of a main broadcast terminal collects live broadcast media content, and the collected live broadcast media content is displayed in a video playing interface 81 of the live video application of the main broadcast terminal; the method comprises the steps that a video live broadcast application of a main broadcast terminal encodes collected live broadcast media content to form a first media stream, wherein the first media stream comprises a video stream obtained by encoding the video content in the live broadcast media content and an audio stream obtained by encoding the audio content in the live broadcast media content; the video live application of the anchor terminal pushes the first media stream to the content distribution network. Secondly, the subtitle generating device can pull the first media stream from the content distribution network, generate a subtitle text synchronous with the first media stream, and package the synchronous subtitle text and the first media stream to obtain a second media stream; this process can be seen in the detailed description of the embodiment shown in fig. 4 above. Then, the video live application of the viewer terminal may pull the second media stream from the content distribution network, and decapsulate the second media stream to obtain the first media stream and the subtitle text. The audience terminal is provided with hardware or software supporting subtitle display, and a video playing interface of a video live broadcast application of the audience terminal can comprise a subtitle display switch 821; if the subtitle display switch 821 is turned on by the viewer user, subtitle texts can be synchronously displayed in the subtitle display area 822 of the video playing interface 82 in the process of playing the first media stream in the video playing interface 82 of the video live broadcast application of the viewer terminal; in the process of playing the first media stream in the video playing interface 82, the viewer user may also autonomously select to turn off or turn on the subtitle display switch 821 to turn off the display of the subtitle text or synchronously display the subtitle text according to his/her needs. Through this kind of mode, can effectively promote the live effect of the live application of video, effectively promote the live application of video and use experience.

(2) A video session scenario.

Fig. 8b is a schematic diagram illustrating an application of a data processing method in a video session scenario according to an exemplary embodiment of the present application, and as shown in fig. 8b, in the video session scenario, taking a two-person video session as an example, terminals used by two parties of the video session are both an encoding device and a decoding device; in the process of transmitting session media content of a first session user to a second terminal used by a second session user by a first terminal used by the first session user, an encoding device is the first terminal used by the first session user, a video session application runs in the first terminal, a decoding device is the second terminal used by the second session user, and a video session application runs in the second terminal; in the process of transmitting session media content of a second session user to a first terminal used by a first session user by a second terminal used by the second session user, an encoding device is the second terminal used by the second session user, a video session application is operated in the second terminal, a decoding device is the first terminal used by the first session user, and a video session application is operated in the first terminal.

In the process of transmitting session media content of a first session user to a second terminal used by a second session user by a first terminal used by the first session user, firstly, a video session application of the first terminal collects the session media content of the first session user, the collected session media content is displayed in a session interface 83 of the video session application of the first terminal, the video session application of the first terminal encodes the collected session media content of the first session user to form a first media stream, and the first media stream comprises a video stream encoded by video content in the session media content of the first session user and an audio stream encoded by audio content in the session media content of the first session user; the video session application of the first terminal pushes the first media stream to the content distribution network. Secondly, the caption generating device can pull the first media stream from the content distribution network, generate a caption text synchronous with the first media stream, and package the synchronous caption text and the first media stream to obtain a second media stream; this process can be seen in the detailed description of the embodiment shown in fig. 4 above. Then, the video session application of the second terminal may pull the second media stream from the content distribution network, and decapsulate the second media stream to obtain the first media stream and the subtitle text. The second terminal is provided with hardware or software supporting subtitle display, and a subtitle display switch 841 can be included in the session interface 84 of the video session application of the second terminal; if the subtitle display switch 841 is turned on by the second session user, subtitle texts may be synchronously displayed in the subtitle display area 842 of the session interface 84 during the process of playing the first media stream in the session interface 84; during the process of playing the first media stream in the session interface 84, the second session user may also autonomously select to close or open the subtitle display switch 841 according to his own needs, so as to close the display of the subtitle text or synchronously display the subtitle text.

Similarly, in the process of transmitting the session media content of the second session user to the first terminal used by the first session user by the second terminal used by the second session user, firstly, the video session application of the second terminal collects the session media content of the second session user, the collected session media content is displayed in the session interface 84 of the video session application of the second terminal, the video session application of the second terminal encodes the collected session media content of the second session user to form a third media stream, and the third media stream includes a video stream encoded by the video content in the session media content of the second session user and an audio stream encoded by the audio content in the session media content of the second session user; the video session application of the second terminal streams the third media stream to the content distribution network. Secondly, the subtitle generating device can pull the third media stream from the content distribution network, generate a subtitle text synchronous with the third media stream, and package the synchronous subtitle text and the third media stream to obtain a fourth media stream; this process can be seen in the description above of forming the second media stream from the first media stream in the embodiment shown in fig. 4. Then, the video session application of the first terminal may pull the fourth media stream from the content distribution network, and decapsulate the fourth media stream to obtain the third media stream and the subtitle text. The first terminal is provided with hardware or software supporting subtitle display, and a subtitle display switch 831 can be included in the session interface 83 of the video session application of the first terminal; if the caption display switch 831 is turned on by the first session user, the caption text can be synchronously displayed in the caption display area 832 of the session interface 83 during the playing of the third media stream in the session interface 83; during the playing of the third media stream in the session interface 83, the first session user may also independently select to turn off or turn on the subtitle display switch 831 to turn off the display of the subtitle text or synchronously display the subtitle text according to his/her needs. By the method, the session effect of the video session application can be effectively improved, and the use experience of the video session application can be effectively improved.

While the method of the embodiments of the present application has been described in detail above, to facilitate better implementation of the above-described aspects of the embodiments of the present application, the apparatus of the embodiments of the present application is provided below accordingly.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus according to an exemplary embodiment of the present application, where the data processing apparatus 90 may be disposed in a data processing device, and the data processing device may be a subtitle generating device 202 in the data processing system shown in fig. 2; the data processing device 90 may be used to perform the corresponding steps in the method embodiment shown in fig. 4, and the data processing device 90 may comprise the following units:

a stream pulling unit 901, configured to pull a first media stream of a stream to be pushed, and extract an audio stream from the first media stream;

the processing unit 902 is configured to perform identification processing on the audio stream to obtain a subtitle text corresponding to the audio stream;

a processing unit 902, further configured to synchronize between the subtitle text and the first media stream; the synchronization comprises adding caption texts into a self-defined field of the first media stream or packaging the caption texts into a caption packaging packet which is synchronous with the first media stream;

the processing unit 902 is further configured to encapsulate the synchronized subtitle text and the first media stream to form a second media stream;

a stream pushing unit 903, configured to push the second media stream to the decoding apparatus.

In an implementation manner, the processing unit 902 is configured to perform identification processing on an audio stream to obtain a subtitle text corresponding to the audio stream, and specifically configured to perform the following steps:

sequentially intercepting N audio segments from an audio stream;

In one implementation, a first media stream includes an audio stream and a video stream; the processing unit 902 is configured to, when synchronizing between a subtitle text and a first media stream, specifically perform the following steps:

constructing custom fields at the associated positions of the matched M video frames respectively; and adding M groups of subtitle information and the time offset of each group of subtitle information into the corresponding self-defined field.

the processing unit 902 is configured to, when a custom field is constructed at an associated position of a video frame matched with the jth group of subtitle information, specifically execute the following steps: constructing a target subtitle code stream unit at the relevant position of the target video code stream unit; configuring a target caption code stream unit into a custom field;

the processing unit 902 is configured to, when adding the jth group of subtitle information and the time offset of the jth group of subtitle information to the corresponding custom field, specifically execute the following steps: packaging the jth group of subtitle information and the time offset of the jth group of subtitle information into a target subtitle code stream unit;

the processing unit 902 is configured to perform the following steps when encapsulating the synchronized subtitle text and the first media stream to form a second media stream: and packaging the M subtitle code stream units and the first media stream to form a second media stream.

In one implementation, a first media stream includes an audio stream and a video stream; the processing unit 902 is configured to, when synchronizing the subtitle text with the first media stream, specifically perform the following steps:

the first set of subtitle information is aligned with the determined video frame.

In one implementation, the video stream is encapsulated in a video encapsulation packet and the audio stream is encapsulated in an audio encapsulation packet; the processing unit 902 is configured to encapsulate the synchronized subtitle text and the first media stream to form a second media stream, and specifically configured to execute the following steps:

According to an embodiment of the present application, the units in the data processing apparatus 90 shown in fig. 9 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple units with smaller functions to form the same operation, without affecting the implementation of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical applications, the functions of one unit can also be implemented by a plurality of units, or the functions of a plurality of units can also be implemented by one unit. In other embodiments of the present application, the data processing apparatus 90 may also include other units, and in practical applications, the functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units. According to another embodiment of the present application, the data processing apparatus 90 as shown in fig. 9 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the corresponding method as shown in fig. 4 on a general-purpose computing device including a general-purpose computer such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and the data processing method of the embodiment of the present application may be implemented. The computer program may be embodied on a computer-readable storage medium, for example, and loaded into and executed by the above-described computing apparatus via the computer-readable storage medium.

In the embodiment of the application, an audio stream is extracted from a first media stream to be pushed for identification processing, so as to obtain a subtitle text corresponding to the audio stream; therefore, real-time caption texts can be generated for the first media stream by timely pulling the first media stream to be pushed for audio identification processing. In addition, the subtitle text code is not required to be embedded into the first media stream, and the second media stream can be obtained and pushed to the decoding equipment only by synchronizing the subtitle text and the first media stream and then packaging the synchronized subtitle text and the first media stream. The synchronization comprises adding the caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream; in the second media stream obtained in this way, the first media stream and the subtitle text are decoupled and synchronized with each other, so that it can be ensured that the decoding device can display the synchronized real-time subtitles while playing the first media stream, the playing effect of the media stream is effectively improved, and the method is particularly suitable for playing scenes with non-standard and high-real-time multimedia.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a data processing apparatus according to another exemplary embodiment of the present application, where the data processing apparatus 100 may be disposed in a data processing device, and the data processing device may be a decoding device (e.g., a first decoding device 203 and a second decoding device 204) in the data processing system shown in fig. 2; the data processing device 100 may be used to perform the corresponding steps in the method embodiment shown in fig. 6, the data processing device 100 may comprise the following units:

a stream pulling unit 1001 configured to pull the second media stream; the second media stream is obtained by packaging the first media stream and the subtitle text synchronized with the first media stream; the caption text is obtained by identifying and processing the audio stream in the first media stream; the synchronization between the first media stream and the caption file comprises adding a caption text into a self-defined field of the first media stream or packaging the caption text into a caption packaging packet synchronized with the first media stream;

the processing unit 1002 is configured to decapsulate the second media stream to obtain a first media stream and a subtitle text;

the processing unit 1002 is further configured to, if a subtitle display instruction is detected, parse the first media stream and the subtitle text; and playing the analyzed first media stream, and synchronously displaying the analyzed caption text in the playing process of the first media stream.

In one implementation, the first media stream further includes a video stream, and a plurality of video frames included in the video stream are sequentially encapsulated in a plurality of video stream units; the caption text comprises M groups of caption information and time offset of each group of caption information, wherein M is a positive integer; each group of subtitle information in the M groups of subtitle information and the time offset of each group of subtitle information are respectively packaged into the M subtitle code stream units; the processing unit 1002 is configured to, when parsing the first media stream and the subtitle text, specifically execute the following steps:

analyzing a current code stream unit to be processed in the first media stream;

the processing unit 1002 is configured to play the parsed first media stream, and when the parsed subtitle text is synchronously displayed in the playing process of the first media stream, specifically, execute the following steps:

In one implementation, the processing unit 1002 is further configured to perform the following steps: if the command of prohibiting the subtitle display is detected, ignoring the subtitle text and analyzing the first media stream; and playing the analyzed first media stream.

In one implementation manner, the first media stream further includes a video stream, and a plurality of video frames included in the video stream are sequentially encapsulated in a plurality of video stream units; the caption text comprises M groups of caption information and time offset of each group of caption information, wherein M is a positive integer; each group of subtitle information in the M groups of subtitle information and the time offset of each group of subtitle information are respectively packaged into the M subtitle code stream units; the processing unit 1002 is configured to, when ignoring the subtitle text and parsing the first media stream, specifically execute the following steps:

analyzing the next code stream unit to obtain a video frame;

the processing unit 1002 is configured to, when playing the parsed first media stream, specifically execute the following steps: and playing the video frame obtained by analysis.

According to an embodiment of the present application, the units in the data processing apparatus 100 shown in fig. 10 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) therein may be further split into multiple functionally smaller units to form one or several other units, which may achieve the same operation without affecting the achievement of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the data processing apparatus 100 may also include other units, and in practical applications, the functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units. According to another embodiment of the present application, the data processing apparatus 100 as shown in fig. 10 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the corresponding method as shown in fig. 6 on a general-purpose computing device including a general-purpose computer such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, and a storage element, and the data processing method of the embodiment of the present application may be implemented. The computer program may be, for example, embodied on a computer-readable storage medium, and loaded into and executed in the above-described computing apparatus via the computer-readable storage medium.

In the embodiment of the application, the pulled second media stream is obtained by encapsulating the first media stream and a subtitle text synchronized with the first media stream in real time, and the subtitle text is obtained by identifying and processing an audio stream in the first media stream; after the second media stream is pulled, the second media stream can be unpacked to obtain a first media stream and a subtitle text; if the decoding device side detects the caption display instruction, the first media stream and the caption text can be analyzed to play the analyzed first media stream, and the analyzed caption text is synchronously displayed in the playing process of the first media stream. In this way, the synchronization between the first media stream and the subtitle file includes adding the subtitle text to the custom field of the first media stream, or encapsulating the subtitle text into a subtitle encapsulation packet synchronized with the first media stream; the first media stream and the caption text in the second media stream are mutually decoupled and synchronized, so that the decoding device which detects the caption display instruction can display the synchronized real-time caption while playing the first media stream, the playing effect of the media streams is effectively improved, and the method is particularly suitable for playing scenes of multimedia with nonstandard and high real-time performance.

Referring to fig. 11, fig. 11 shows a schematic structural diagram of a data processing device according to an exemplary embodiment of the present application, where the data processing device 110 includes at least a processor 1101, a computer-readable storage medium 1102, and a communication interface 1103. The processor 1101, the computer-readable storage medium 1102, and the communication interface 1103 may be connected by a bus or other means. Communication interface 1103 may be used to pull or push media streams. A computer-readable storage medium 1102 may be stored in the memory, the computer-readable storage medium 1102 being used to store a computer program, the computer program comprising computer instructions. Processor 1101 is used to execute computer instructions. Processor 1101 (or Central Processing Unit, CPU)) is a computing core and a control core of data Processing apparatus 110, and is adapted to implement one or more computer instructions, and in particular, is adapted to load and execute one or more computer instructions to implement corresponding method flows or corresponding functions.

Embodiments of the present application also provide a computer-readable storage medium (Memory), which is a Memory device in the data processing device 110 and is used for storing programs and data. It is understood that the computer readable storage medium 1102 herein may comprise a built-in storage medium in the data processing device 110, and may of course also comprise an extended storage medium supported by the data processing device 110. The computer readable storage medium provides a storage space that stores an operating system of the data processing device 110. Also stored in the memory space are one or more computer instructions, which may be one or more computer programs (including program code), suitable for loading and execution by the processor 1101. It should be noted that the computer-readable storage medium 1102 may be a high-speed RAM Memory, or may be a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory; optionally, at least one computer readable storage medium located remotely from the aforementioned processor 1101.

The data processing device 110 may be the subtitle generating device 202 in the data processing system shown in fig. 2, and the computer-readable storage medium 1102 has stored therein a computer program comprising one or more computer instructions; one or more computer instructions are loaded and executed by processor 1101 to implement the corresponding steps in the method embodiment shown in fig. 4; in particular implementations, computer instructions in the computer-readable storage medium 1102 are loaded by the processor 1101 and perform the following steps:

and pushing the second media stream to the decoding device.

In one implementation, when the computer instructions in the computer-readable storage medium 1102 are loaded by the processor 1101 and perform recognition processing on an audio stream to obtain a subtitle text corresponding to the audio stream, the following steps are specifically performed:

sequentially intercepting N audio segments from an audio stream;

In one implementation, a first media stream includes an audio stream and a video stream; the computer instructions in the computer readable storage medium 1102, when loaded by the processor 1101 and executed to synchronize between the subtitle text and the first media stream, are specifically configured to perform the steps of:

the computer instructions in the computer-readable storage medium 1102, when loaded by the processor 1101 and executed to construct the custom field at the associated position of the video frame that matches the jth set of subtitle information, are specifically configured to perform the following steps: constructing a target subtitle code stream unit at the relevant position of the target video code stream unit; configuring a target caption code stream unit into a custom field;

the computer instructions in the computer-readable storage medium 1102 are loaded by the processor 1101 and executed to add the jth group of subtitle information and the time offset of the jth group of subtitle information to the corresponding custom field, and are specifically configured to perform the following steps: packaging the jth group of subtitle information and the time offset of the jth group of subtitle information into a target subtitle code stream unit;

the computer instructions in the computer-readable storage medium 1102, when loaded by the processor 1101 and executed to encapsulate the synchronized subtitle text with the first media stream to form the second media stream, are specifically configured to perform the following steps: and packaging the M subtitle code stream units and the first media stream to form a second media stream.

In one implementation, the video stream is encapsulated in a video encapsulation packet and the audio stream is encapsulated in an audio encapsulation packet; the computer instructions in the computer-readable storage medium 1102, when loaded by the processor 1101 and executed to encapsulate the synchronized subtitle text with the first media stream to form the second media stream, are specifically configured to perform the following steps:

The data processing device 110 may be a decoding device (e.g., the first decoding device 203 or the second decoding device 204) in the data processing system shown in fig. 2, and the computer-readable storage medium 1102 has stored therein a computer program, which includes one or more computer instructions; one or more computer instructions are loaded and executed by processor 1101 to implement the corresponding steps in the method embodiment shown in fig. 6; in particular implementations, the computer instructions in the computer-readable storage medium 1102 are loaded by the processor 1101 and perform the following steps:

In one implementation, the first media stream further includes a video stream, and a plurality of video frames included in the video stream are sequentially encapsulated in a plurality of video stream units; the caption text comprises M groups of caption information and time offset of each group of caption information, wherein M is a positive integer; each group of subtitle information in the M groups of subtitle information and the time offset of each group of subtitle information are respectively packaged into the M subtitle code stream units; the computer instructions in the computer-readable storage medium 1102, when loaded and executed by the processor 1101, are particularly adapted to perform the steps of:

analyzing a current code stream unit to be processed in the first media stream;

the computer instructions in the computer-readable storage medium 1102 are loaded by the processor 1101 and executed to play the parsed first media stream, and when the parsed subtitle text is synchronously displayed in the playing process of the first media stream, the following steps are specifically executed:

the computer instructions in the computer-readable storage medium 1102 are loaded by the processor 1101 and executed to play the parsed first media stream, and when the parsed subtitle text is synchronously displayed in the playing process of the first media stream, are specifically configured to perform the following steps:

In one implementation, the computer instructions in the computer-readable storage medium 1102 are loaded by the processor 1101 and are further configured to perform the steps of: if the command of prohibiting the subtitle display is detected, ignoring the subtitle text and analyzing the first media stream; and playing the analyzed first media stream.

In one implementation, the first media stream further includes a video stream, and a plurality of video frames included in the video stream are sequentially encapsulated in a plurality of video stream units; the caption text comprises M groups of caption information and time offset of each group of caption information, wherein M is a positive integer; each group of subtitle information in the M groups of subtitle information and the time offset of each group of subtitle information are respectively packaged into the M subtitle code stream units; the computer instructions in the computer-readable storage medium 1102, when loaded by the processor 1101 and executed to ignore subtitle text and parse the first media stream, are specifically configured to perform the following steps:

analyzing the next code stream unit to obtain a video frame;

when the computer instructions in the computer-readable storage medium 1102 are loaded by the processor 1101 and executed to play the parsed first media stream, the following steps are specifically executed: and playing the video frame obtained by analysis.

In the embodiment of the application, an audio stream is extracted from a first media stream to be pushed for identification processing, so as to obtain a subtitle text corresponding to the audio stream; therefore, real-time caption texts can be generated for the first media stream by timely pulling the first media stream to be pushed for audio identification processing. In addition, the subtitle text code is not required to be embedded into the first media stream, and the second media stream can be obtained and pushed to the decoding equipment only by synchronizing the subtitle text and the first media stream and then packaging the synchronized subtitle text and the first media stream. The synchronization comprises adding caption texts into a self-defined field of the first media stream or packaging the caption texts into a caption packaging packet which is synchronous with the first media stream; in the second media stream obtained in this way, the first media stream and the caption text are decoupled and synchronized with each other, so that the decoding device can display the synchronized real-time caption while playing the first media stream, the playing effect of the media stream is effectively improved, and the method is particularly suitable for playing scenes of multimedia with nonstandard and high real-time performance.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method provided in the above-mentioned various alternative modes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data processing, the method comprising:

performing identification processing on the audio stream to obtain a subtitle text corresponding to the audio stream;

synchronizing between the subtitle text and the first media stream; the synchronizing comprises adding the subtitle text to a custom field of the first media stream; the self-defined field is obtained by constructing and configuring a caption code stream unit at the relevant position of the video frame matched with the caption text in the video stream of the first media stream;

packaging the synchronized subtitle text and the first media stream to form a second media stream; the first media stream and the subtitle text in the second media stream are mutually decoupled and mutually synchronized;

streaming the second media stream to a decoding device; enabling the decoding device to detect the subtitle display capability when detecting a subtitle display instruction, respectively analyzing the first media stream and the subtitle text if the subtitle display capability is detected, playing the analyzed first media stream, synchronously displaying the analyzed subtitle text in the playing process of the first media stream, ignoring the subtitle text if the subtitle display capability is detected to be unavailable, analyzing the first media stream, and playing the analyzed first media stream; or when a command for prohibiting subtitle display is detected, ignoring the subtitle text, analyzing the first media stream, and playing the analyzed first media stream;

wherein, the ignoring the subtitle text, analyzing the first media stream, and playing the analyzed first media stream means: and skipping the caption code stream unit in the second media stream, not analyzing, and analyzing and playing the data except the caption code stream unit in the second media stream.

2. The method of claim 1, wherein the recognizing the audio stream to obtain the subtitle text corresponding to the audio stream comprises:

sequentially intercepting N audio segments from the audio stream;

sequentially identifying the N audio clips to obtain M groups of subtitle information and the time offset of each group of subtitle information, wherein the M groups of subtitle information and the time offset of each group of subtitle information form the subtitle text;

3. The method of claim 2, wherein the first media stream comprises the audio stream and a video stream; the synchronizing between the subtitle text and the first media stream includes:

constructing custom fields at the associated positions of the matched M video frames respectively; and the number of the first and second groups,

and adding the M groups of subtitle information and the time offset of each group of subtitle information into corresponding custom fields.

4. The method of claim 3, wherein the video stream comprises a plurality of video frames, the plurality of video frames are sequentially encapsulated in a plurality of video stream units, and the video frame matching the jth group of subtitle information is encapsulated in a target video stream unit;

constructing a custom field at an associated location of a video frame that matches the jth set of subtitle information, comprising: constructing a target subtitle code stream unit at the relevant position of the target video code stream unit; configuring the target caption code stream unit into a custom field;

adding the jth group of subtitle information and the time offset of the jth group of subtitle information to corresponding custom fields, including: encapsulating the jth group of subtitle information and the time offset of the jth group of subtitle information in the target subtitle code stream unit;

wherein, the relevant position of the target video code stream unit comprises: referencing a position between a video code stream unit and the target video code stream unit; the reference video code stream unit is a video code stream unit used for packaging a previous video frame of the video frame matched with the jth group of subtitle information.

5. The method of claim 3 or 4, wherein the custom field is a caption code stream unit, and each group of caption information and the time offset of each group of caption information in the M groups of caption information are respectively encapsulated into the M caption code stream units;

the encapsulating the synchronized subtitle text and the first media stream to form a second media stream includes: and packaging the M subtitle code stream units and the first media stream to form the second media stream.

6. A method of data processing, the method comprising:

pulling the second media stream; the second media stream is obtained by packaging a first media stream and a subtitle text synchronized with the first media stream; the subtitle text is obtained by identifying and processing the audio stream in the first media stream; the synchronization between the first media stream and the caption text comprises adding the caption text to a custom field of the first media stream; the user-defined field is obtained by constructing and configuring a caption code stream unit at the relevant position of a video frame matched with the caption text in the video stream of the first media stream; the first media stream and the subtitle text in the second media stream are mutually decoupled and mutually synchronized;

decapsulating the second media stream to obtain the first media stream and the subtitle text;

if a subtitle display instruction is detected, performing subtitle display capability detection, if subtitle display capability is detected, analyzing the first media stream and the subtitle text, playing the analyzed first media stream, and synchronously displaying the analyzed subtitle text in the playing process of the first media stream, and if subtitle display capability is detected not to be available, ignoring the subtitle text, analyzing the first media stream, and playing the analyzed first media stream;

if a command for prohibiting subtitle display is detected, ignoring the subtitle text and analyzing the first media stream; and playing the analyzed first media stream;

7. The method of claim 6, wherein the first media stream further comprises a video stream, and the video stream comprises a plurality of video frames sequentially encapsulated in a plurality of video stream units; the subtitle text comprises M groups of subtitle information and time offset of each group of subtitle information, wherein M is a positive integer; each group of subtitle information in the M groups of subtitle information and the time offset of each group of subtitle information are respectively packaged into the M subtitle code stream units; the parsing the first media stream and the subtitle text includes:

analyzing a current code stream unit to be processed in the first media stream;

the playing the parsed first media stream, and synchronously displaying the parsed subtitle text in the playing process of the first media stream, including:

8. The method of claim 6, wherein the first media stream further comprises a video stream, and the video stream comprises a plurality of video frames sequentially encapsulated in a plurality of video stream units; the subtitle text comprises M groups of subtitle information and time offset of each group of subtitle information, wherein M is a positive integer; each group of subtitle information in the M groups of subtitle information and the time offset of each group of subtitle information are respectively packaged into M subtitle code stream units; the ignoring the subtitle text and parsing the first media stream includes:

reading the header information of a current code stream unit to be processed in the first media stream, wherein the header information is packaged with a type label;

analyzing the next code stream unit to obtain a video frame;

the playing the parsed first media stream includes: and playing the video frame obtained by analysis.

9. A data processing apparatus, characterized in that the data processing apparatus comprises:

the stream pulling unit is used for pulling a first media stream of a stream to be pushed and extracting an audio stream from the first media stream;

the processing unit is used for identifying the audio stream to obtain a subtitle text corresponding to the audio stream;

the processing unit is further configured to synchronize between the subtitle text and the first media stream; the synchronizing comprises adding the subtitle text to a custom field of the first media stream; the self-defined field is obtained by constructing and configuring a caption code stream unit at the relevant position of the video frame matched with the caption text in the video stream of the first media stream;

the processing unit is further configured to encapsulate the synchronized subtitle text and the first media stream to form a second media stream; the first media stream and the subtitle text in the second media stream are mutually decoupled and mutually synchronized;

the stream pushing unit is used for pushing the second media stream to decoding equipment; enabling the decoding device to detect the subtitle display capability when detecting a subtitle display instruction, respectively analyzing the first media stream and the subtitle text if the subtitle display capability is detected, playing the analyzed first media stream, synchronously displaying the analyzed subtitle text in the playing process of the first media stream, ignoring the subtitle text if the subtitle display capability is detected to be unavailable, analyzing the first media stream, and playing the analyzed first media stream; or when a command for prohibiting subtitle display is detected, ignoring the subtitle text, analyzing the first media stream, and playing the analyzed first media stream;

10. A data processing apparatus, characterized in that the data processing apparatus comprises:

a stream pulling unit, configured to pull the second media stream; the second media stream is obtained by packaging a first media stream and a subtitle text synchronized with the first media stream; the subtitle text is obtained by identifying and processing the audio stream in the first media stream; the synchronization between the first media stream and the caption text comprises adding the caption text to a custom field of the first media stream; the user-defined field is obtained by constructing and configuring a caption code stream unit at the relevant position of a video frame matched with the caption text in the video stream of the first media stream; the first media stream and the subtitle text in the second media stream are mutually decoupled and mutually synchronized;

the processing unit is used for decapsulating the second media stream to obtain the first media stream and the subtitle text;

the processing unit is further configured to perform subtitle display capability detection if a subtitle display instruction is detected, parse the first media stream and the subtitle text if subtitle display capability is detected, play the parsed first media stream, and synchronously display the parsed subtitle text in a playing process of the first media stream, ignore the subtitle text and parse the first media stream if no subtitle display capability is detected, and play the parsed first media stream; if a subtitle display prohibiting instruction is detected, ignoring the subtitle text and analyzing the first media stream; and playing the analyzed first media stream;

wherein, ignoring the subtitle text, parsing the first media stream, and playing the parsed first media stream means: and skipping the caption code stream unit in the second media stream, not analyzing, and analyzing and playing the data except the caption code stream unit in the second media stream.

11. A data processing apparatus, characterized in that the data processing apparatus comprises:

a processor adapted to implement a computer program; and the number of the first and second groups,

a computer-readable storage medium, having stored thereon a computer program adapted to be loaded by the processor and to execute the data processing method of any of claims 1 to 5, or the data processing method of any of claims 6 to 8.

12. A computer-readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor and to execute the data processing method according to any one of claims 1 to 5, or the data processing method according to any one of claims 6 to 8.