CN114827663A

CN114827663A - Distributed live broadcast frame insertion system and method

Info

Publication number: CN114827663A
Application number: CN202210381421.8A
Authority: CN
Inventors: 朱侠
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-07-29
Anticipated expiration: 2042-04-12
Also published as: CN114827663B

Abstract

The invention discloses a distributed live broadcast frame inserting system and a distributed live broadcast frame inserting method. The method comprises the following steps: the decoder, the encoder and the AI engines are respectively arranged on different hosts; the decoder decodes the video stream, stores the video frame into the database, and stores the audio data and the video frame metadata into the message queue; each AI engine acquires video frame metadata from the message queue, pulls a corresponding video frame from the database, generates an intermediate frame and interpolation frame metadata, stores the intermediate frame into the database, and stores the interpolation frame metadata into the message queue; the encoder acquires audio data and frame insertion metadata from the message queue, pulls corresponding video frames and intermediate frames from the database, combines and encodes the acquired data, and outputs a target video stream. According to the invention, a plurality of distributed AI engines are used for executing the frame interpolation task of the high-quality video, so that the requirement on the frame interpolation quantity in a live broadcast scene is met, the live broadcast frame interpolation efficiency is improved, and the live broadcast video smoothness is improved.

Description

Distributed live broadcast frame insertion system and method

Technical Field

The invention relates to the technical field of video processing, in particular to a distributed live frame insertion system and a distributed live frame insertion method.

Background

The current frame interpolation technology mainly relies on an AI engine to process adjacent frames to create an interpolated frame between the adjacent frames, and further improves the video playing fluency by improving the number of video playing frames. In the current frame interpolation technology for live broadcasting, frame interpolation is usually performed only for relatively smooth scenes and slow motion, but for scenes with severe motion, for example: sports events are often difficult to improve significantly. The common requirement in the live broadcast scene is to promote a video with 25 frames/S to 50 frames/S, while the current AI engine has higher calculation force requirement, and the efficiency of the common AI engine in a single-machine working mode is only 5 frames/S, so that the requirement on the frame insertion quantity in the live broadcast scene is difficult to meet, and the support cannot be provided for the moving live broadcast scene.

The above-mentioned contents are only for assisting understanding of the technical solution of the present invention, and do not represent an admission that the above-mentioned contents are related art.

Disclosure of Invention

The invention mainly aims to provide a distributed live broadcast frame insertion system and a distributed live broadcast frame insertion method, and aims to solve the technical problems that the frame insertion efficiency is low in a common live broadcast frame insertion mode, and the requirement on the frame insertion quantity in a live broadcast scene cannot be met.

In order to achieve the above object, the present invention provides a distributed live broadcast frame insertion system, including: the decoder, the encoder and the AI engines are respectively arranged on different hosts;

the decoder is used for decoding the acquired video stream to obtain audio data, video frames and video frame metadata, storing the video frames into a target database and storing the audio data and the video frame metadata into a message queue;

each AI engine is used for acquiring video frame metadata obtained by decoding of the decoder from the message queue, pulling a video frame corresponding to the video frame metadata from the target database, generating an intermediate frame according to the pulled video frame, generating frame insertion metadata corresponding to the intermediate frame, storing the intermediate frame into the target database, and storing the frame insertion metadata into the message queue;

and the encoder is used for acquiring the audio data decoded by the decoder and the frame insertion metadata generated by the AI engine from the message queue, pulling the video frame and the intermediate frame corresponding to the frame insertion metadata from the target database, merging and encoding the video frame, the intermediate frame and the audio data, and outputting a target video stream.

Optionally, the encoder is further configured to delete the intermediate frame in the target database when a video frame and an intermediate frame corresponding to the frame insertion metadata are obtained through local buffering, record a first frame number corresponding to the intermediate frame in a preset shaping array, search a previous intermediate frame number corresponding to the first frame number in the preset shaping array forward based on the first frame number, and delete the video frame in the target database if the previous intermediate frame number corresponding to the first frame number is recorded in the preset shaping array.

Optionally, the encoder is further configured to search a next inter frame number corresponding to the first frame number backward in the preset shaping array based on the first frame number, and delete a next video frame corresponding to the inter frame in the target database if the next inter frame number corresponding to the first frame number is recorded in the preset shaping array.

Optionally, the encoder is further configured to delete, after the target video stream is output, video frames and intermediate frames in the target database, whose frame numbers are smaller than the first frame number.

Optionally, the decoder is further configured to number a video frame obtained through decoding, generate a second frame number corresponding to the video frame, generate video frame metadata corresponding to the video frame according to the second frame number, store the video frame including the second frame number in the target database, and store the audio data and the video frame metadata in a message queue;

each AI engine is also used for determining a third frame number corresponding to the pulled video frame, generating a fourth frame number corresponding to the intermediate frame according to the third frame number, generating frame insertion metadata corresponding to the intermediate frame according to the third frame number and the fourth frame number, storing the intermediate frame containing the fourth frame number into the target database, and storing the frame insertion metadata into the message queue;

the encoder is further configured to extract a fifth frame number and a sixth frame number from the obtained frame interpolation metadata, and pull the video frame and the intermediate frame corresponding to the fifth frame number and the sixth frame number respectively from the target database.

Optionally, the decoder is further configured to number the video frame obtained through decoding, generate a first odd frame number corresponding to the video frame, and generate video frame metadata corresponding to the video frame according to the first odd frame number;

each AI engine is further configured to determine a second odd frame number corresponding to the pulled video frame, generate a first even frame number corresponding to the intermediate frame according to an even number adjacent to the second odd frame number, and generate frame insertion metadata corresponding to the intermediate frame according to the second odd frame number and the first even frame number.

Optionally, the decoder is further configured to number the video frame obtained through decoding, generate a second even frame number corresponding to the video frame, and generate video frame metadata corresponding to the video frame according to the second even frame number;

each AI engine is further configured to determine a third even frame number corresponding to the pulled video frame, generate a third odd frame number corresponding to the intermediate frame according to an odd number adjacent to the third even frame number, and generate frame insertion metadata corresponding to the intermediate frame according to the third even frame number and the third odd frame number.

Optionally, the encoder is further configured to obtain a video absolute time and an audio absolute time, determine whether the video absolute time is greater than the audio absolute time, obtain, from the message queue, audio data decoded by the decoder when the video absolute time is greater than the audio absolute time, and update the audio absolute time according to a display timestamp corresponding to the audio data.

Optionally, the encoder is further configured to, when the video absolute time is smaller than the audio absolute time, obtain frame insertion metadata generated by the AI engine from the message queue, pull a corresponding video frame and an intermediate frame from the target database according to the frame insertion metadata, and update the video absolute time according to a display timestamp corresponding to the video frame or the intermediate frame.

In addition, in order to achieve the above object, the present invention further provides a distributed live frame insertion method, where the method is applied to the distributed live frame insertion system, and the distributed live frame insertion system includes: a decoder, an encoder, and a plurality of AI engines respectively deployed on different hosts, the method comprising:

the decoder decodes the acquired video stream to obtain audio data, video frames and video frame metadata, stores the video frames into a target database, and stores the audio data and the video frame metadata into a message queue;

each AI engine acquires video frame metadata obtained by decoding of the decoder from the message queue, pulls a video frame corresponding to the video frame metadata from the target database, generates an intermediate frame according to the pulled video frame, generates frame insertion metadata corresponding to the intermediate frame, stores the intermediate frame into the target database, and stores the frame insertion metadata into the message queue;

and the encoder acquires the audio data decoded by the decoder and the frame insertion metadata generated by the AI engine from the message queue, pulls the video frame and the intermediate frame corresponding to the frame insertion metadata from the target database, combines and encodes the video frame, the intermediate frame and the audio data, and outputs a target video stream.

The distributed live broadcast frame inserting system comprises: the decoder, the encoder and the AI engines are respectively arranged on different hosts; the decoder decodes the acquired video stream to obtain audio data, video frames and video frame metadata, stores the video frames into a target database and stores the audio data and the video frame metadata into a message queue; each AI engine acquires video frame metadata obtained by decoding of a decoder from the message queue, pulls a video frame corresponding to the video frame metadata from the target database, generates an intermediate frame according to the pulled video frame, generates interpolation frame metadata corresponding to the intermediate frame, stores the intermediate frame into the target database and stores the interpolation frame metadata into the message queue; and the encoder acquires the audio data decoded by the decoder and the interpolation frame metadata generated by the AI engine from the message queue, pulls the video frame and the intermediate frame corresponding to the interpolation frame metadata from the target database, combines and encodes the video frame, the intermediate frame and the audio data, and outputs the target video stream. Through the mode, the distributed deployed AI engines are used for executing the frame interpolation task of the high-quality video, multi-frame interpolation data can be generated in a short time, the requirement for the frame interpolation quantity in a live scene is met, the live frame interpolation efficiency is improved, and the live video smoothness is improved.

Drawings

Fig. 1 is a block diagram of a distributed live frame insertion system according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of video frame cleaning according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a distributed live frame insertion method according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

An embodiment of the present invention provides a distributed live broadcast frame insertion system, and referring to fig. 1, fig. 1 is a structural block diagram of a first embodiment of a distributed live broadcast frame insertion system according to the present invention.

In this embodiment, the distributed live frame insertion system includes: a decoder 10, an encoder 30, and a plurality of AI engines 20, each deployed on a different host;

the decoder 10 is configured to decode the acquired video stream to obtain audio data, video frames, and video frame metadata, store the video frames in a target database, and store the audio data and the video frame metadata in a message queue.

Preferably, the target database is a Key-value database, video frame data is stored in Key values, the decoder 10 numbers the decoded video frames to generate frame numbers corresponding to the video frames, Key-value data is generated according to the frame numbers and the video frames corresponding to one another, and the Key-value data is stored in the Key-value database.

It should be understood that compared to the message bus, the advantage of using the Key-value database for video frame transmission in this embodiment is that: the video frames are large (the size of each frame of video of 1080P is 3 Mm), and the performance of the message bus is reduced by using the traditional message bus for transmission; and, the message bus has the characteristic of deleting the message after pulling the message, use the message bus to transmit, after the AI engine 20 pulls the video frame, still need to upload again, have occupied the surplus flow, and adopt the Key-value database to transmit the video frame, after the AI engine 20 pulls the video frame, only need to store the intermediate frame newly generated into the database, does not need to upload the video frame pulled again, the transmission performance is better.

Accordingly, the decoder 10 generates video frame metadata according to a frame number and a Presentation Time Stamp (PTS) corresponding to each video frame, where the Presentation Time Stamp is used to inform a player of a Time for displaying the video frame, in this embodiment, the video frame and the metadata are split and transmitted in different manners, where the metadata is used to inform the AI engine 20 of which video frames are ready to be completed. Optionally, the decoder 10 generates corresponding video frame metadata for each frame of video frame, that is, each video frame metadata generated by the decoder 10 only contains the frame number and the display timestamp of one frame of video frame, for example: { frameno1:1, pts1:1234567 }. Preferably, the decoder 10 generates video frame metadata with a frame number and a display time stamp corresponding to each of two adjacent video frames, for example: the decoder 10 decodes the a video frame, the B video frame, and the C video frame … …, generates a video frame metadata with the frame number and the display timestamp corresponding to the a video frame and the B video frame, and generates a video frame metadata with the frame number and the display timestamp corresponding to the B video frame and the C video frame, and in this way, indicates an encoding task for the AI engine 20, so that when the AI engine 20 acquires the video frame metadata, the AI engine pulls two adjacent video frames according to the video frame metadata, thereby executing a corresponding frame insertion task: and generating intermediate frames corresponding to two adjacent video frames.

In another implementation, the decoder 10 generates the video frame metadata according to the frame insertion frequency corresponding to the frame insertion task: and determining two adjacent reference video frames corresponding to the frame inserting task according to the frame inserting frequency by taking the first frame video frame obtained by decoding as a starting point. The decoder 10 generates a video frame metadata according to the frame number and the display timestamp corresponding to two adjacent reference video frames, and for a video frame that does not need to be referred to as a frame insertion task, the decoder 10 generates the corresponding video frame metadata for each frame of video frame. For example, the frame interpolation frequency corresponding to the frame interpolation task is to generate an intermediate frame for every 4 frames, two adjacent reference video frames corresponding to the first frame interpolation task are determined to be a fourth frame video frame and a fifth frame video frame, corresponding video frame metadata are respectively generated for the first three frame video frames, and one video frame metadata is generated according to the frame numbers and the display timestamps corresponding to the fourth frame video frame and the fifth frame video frame.

Each AI engine 20 is configured to obtain the video frame metadata decoded by the decoder 10 from the message queue, pull the video frame corresponding to the video frame metadata from the target database, generate an intermediate frame according to the pulled video frame, generate the inter frame metadata corresponding to the intermediate frame, store the intermediate frame in the target database, and store the inter frame metadata in the message queue.

It should be noted that, taking the video frame metadata stored in the message queue as an example generated by the decoder 10 using the frame number and the display timestamp corresponding to each two adjacent video frames as follows: the AI engine 20 pulls two frames of video frames from the target database at a time, when the frame insertion task is completed, the AI engines 20 respectively obtain video frame metadata from the message queue according to the queue order, pull adjacent X video frames and Y video frames from the target database according to the frame numbers of the adjacent video frames of the video frame metadata, generate intermediate frames according to the adjacent two frames of video frames, and generate frame numbers corresponding to the intermediate frames, in order to avoid the encoder missing a first frame of video frame, generate frame insertion metadata according to the frame numbers and display timestamps respectively corresponding to the previous video frame and the intermediate frame of the intermediate frame, that is, generate frame insertion metadata according to the frame numbers and display timestamps respectively corresponding to the X video frame and the intermediate frame, for example, the metadata obtained by the AI engine 20 from the message queue are { frameno1:1, frameno2:3, pts1: 4567, pts2:1234569}, where the frame numbers of 1 and 3 correspond to the adjacent two frames of video frames, the metadata stored in the message queue by the AI engine 20 is { frameno1:1, frameno2:2, pts1:1234567, pts2:1234568}, where the frame number corresponding to the intermediate frame is 2.

It should be understood that a plurality of AI engines 20 of the present embodiment generate intermediate frames from two adjacent frames of video frames according to an interpolation frame generation algorithm set by itself, for example, analyze two adjacent frames of video frames based on an optical flow method, determine an intermediate position between the same displacement points in the two frames of data, and thereby generate intermediate frames. After completing the frame insertion task, AI engines 20 generate target video frame metadata corresponding to the intermediate frames and store the target video frame metadata in the message queue, so that encoder 30 obtains the video frames with completed frame insertion.

The encoder 30 is configured to obtain, from the message queue, the audio data decoded by the decoder 10 and the frame insertion metadata generated by the AI engine 20, pull the video frame and the intermediate frame corresponding to the frame insertion metadata from the target database, merge and encode the video frame, the intermediate frame, and the audio data, and output a target video stream.

It should be noted that, taking the video frame metadata stored in the message queue as an example generated by the decoder 10 using the frame number and the display timestamp corresponding to each two adjacent video frames as follows: a plurality of AI engines 20 pull metadata generated by the decoder 10 from the message queue, at this time, the message queue automatically deletes the metadata, after the AI engines 20 generate an intermediate frame, the frame interpolation metadata is generated according to the frame number corresponding to the previous video frame and the intermediate frame and stored in the message queue, and the frame interpolation metadata acquired by the encoder 30 from the message queue is the video frame metadata generated by the AI engines 20 after frame interpolation processing, including the frame numbers corresponding to the video frame and the intermediate frame respectively. Extracting frame numbers from the frame insertion metadata, pulling corresponding video frames and intermediate frames from a target database based on the extracted frame numbers, and outputting a target video stream after merging and encoding according to the obtained video frames, intermediate frames and audio data.

In a specific implementation, when the encoder 30 pulls the video frame and the intermediate frame from the target database to be locally buffered, the data in the target database is cleaned according to a first cleaning policy, and after the target video stream is output, the data in the target database is cleaned again according to a second cleaning policy. The first cleaning strategy is to clean the currently pulled intermediate frame in the target database, judge whether the previous intermediate frame of the intermediate frame is cleaned, if yes, clean the currently pulled video frame in the target database, and the second cleaning strategy is to clean the data with the frame number smaller than that of the currently pulled intermediate frame in the target database.

Specifically, the decoder 10 is further configured to number a video frame obtained by decoding, generate a second frame number corresponding to the video frame, generate video frame metadata corresponding to the video frame according to the second frame number, store the video frame including the second frame number in the target database, and store the audio data and the video frame metadata in a message queue;

each AI engine 20 is further configured to determine a third frame number corresponding to the pulled video frame, generate a fourth frame number corresponding to the intermediate frame according to the third frame number, generate frame insertion metadata corresponding to the intermediate frame according to the third frame number and the fourth frame number, store the intermediate frame including the fourth frame number in the target database, and store the frame insertion metadata in the message queue; (ii) a

The encoder 30 is further configured to extract a fifth frame number and a sixth frame number from the obtained frame interpolation metadata, and pull the video frame and the intermediate frame corresponding to the fifth frame number and the sixth frame number, respectively, from the target database.

In a specific implementation, preferably, the decoder 10 generates video frame metadata according to frame numbers corresponding to two adjacent video frames, so that the AI engine 20 can obtain a reference video frame for performing a frame interpolation task, that is, directly pull two video frames for generating an intermediate frame, and the AI engine 20 can generate frame interpolation metadata according to the obtained video frame and a frame number corresponding to the processed intermediate frame, so that the encoder 30 can directly pull two video frames for merging coding. Wherein the first-fifth frame numbers are used to distinguish from each other and to characterize the frame numbers of different video frames or intermediate frames, respectively.

The distributed live broadcast frame insertion system in the embodiment includes: a decoder 10, an encoder 30, and a plurality of AI engines 20 respectively disposed on different hosts; the decoder 10 decodes the acquired video stream to obtain audio data, video frames and video frame metadata, stores the video frames in a target database, and stores the audio data and the video frame metadata in a message queue; each AI engine 20 acquires the video frame metadata decoded by the decoder 10 from the message queue, pulls the video frame corresponding to the video frame metadata from the target database, generates an intermediate frame according to the pulled video frame, generates frame insertion metadata corresponding to the intermediate frame, stores the intermediate frame in the target database, and stores the frame insertion metadata in the message queue; the encoder 30 acquires the audio data decoded by the decoder 10 and the interpolation frame metadata generated by the AI engine 20 from the message queue, pulls the video frame and the intermediate frame corresponding to the interpolation frame metadata from the target database, merges and encodes the video frame, the intermediate frame, and the audio data, and outputs the target video stream. Through the mode, the plurality of distributed AI engines 20 are used for executing the frame interpolation tasks of the high-quality videos, multi-frame interpolation data can be generated in a short time, the requirement for the frame interpolation quantity in a live scene is met, the live frame interpolation efficiency is improved, and the live video smoothness is improved.

Referring to fig. 1, in the second embodiment of the distributed live broadcast frame insertion system according to the present invention, the encoder 30 is further configured to delete the intermediate frame in the target database when the video frame and the intermediate frame corresponding to the frame insertion metadata are obtained through local buffering, record the first frame number corresponding to the intermediate frame in a preset shaping array, forward search a previous intermediate frame number corresponding to the first frame number in the preset shaping array based on the first frame number, and delete the video frame in the target database if the previous intermediate frame number corresponding to the first frame number is found to be recorded in the preset shaping array.

It is understood that the encoder 30 obtains metadata from the message queue, pulls video frames from the target database according to the metadata, buffers the video frames locally, and performs merging encoding on the audio data, the video frames, and the intermediate frames after a certain amount of buffering or a certain time of buffering to generate an output target video stream. The encoder 30 extracts the video frame and the intermediate frame from the target database in the form of a frame pair, and deletes the corresponding intermediate frame of the frame pair in the target database directly when the frame pair is obtained by local buffering.

It should be noted that, when the decoder 10 codes an odd frame number for the video frame obtained by decoding and codes an even number for the generated intermediate frame by the AI engine 20, the first frame number is an even number; when the decoder 10 codes an even frame number for the decoded video frame and codes an odd number for the generated intermediate frame by the AI engine 20, the first frame number is an odd number. The preset shaping array is a frame set in advance for recording deletion, for example, a 512-element array is defined as a type of a deleted frame, named as frameslot [512], recorded deletion in frame Z is equivalent to frameslot [ X% 512] ═ X, (not completely equivalent to conventionally understood frame X deleted in order to save resources). Since frames outside this range are often due to timeouts, a direct clean-up is sufficient.

It should be understood that, deleted frames are recorded in the preset shaping array, since the distributed frame interpolation method is implemented in this embodiment, the generation process of the intermediate frames is not necessarily performed in sequence, the video frames in this embodiment are used as the reference data for generating the interpolated frames of the intermediate frames and the previous intermediate frames, and when the first frame number of the intermediate frame and the previous intermediate frame number of the previous intermediate frame are simultaneously recorded in the preset shaping array, the video frames in the target database are deleted, and the forward search is continued, and the video frames between two adjacent intermediate frames that have been cleaned are searched and cleaned.

The method of this embodiment is described below with reference to an example, and referring to fig. 2, fig. 2 is a schematic diagram of video frame cleaning in an embodiment of a distributed live frame insertion system according to the present invention, in this example, frame numbers corresponding to video frames are odd frame numbers, and frame numbers corresponding to intermediate frames are even frame numbers, where a rectangle represents video frames arranged in sequence, an arrow below the rectangle is used to indicate that corresponding frames are marked for deletion, a frame pair that has been locally buffered currently is (7,8), according to a starting frame number 7 of the current frame pair, an even frame 8 corresponding to the current frame pair in a target database can be directly deleted, and the deleted frame slot [ 8% 512] ═ 8 of the frame number 8 is recorded. Looking forward with the frame number corresponding to the start frame number or the intermediate frame, deleting and recording the frame 7 in the target database when the last even frame 6 is recorded and deleted, and deleting and recording the frame 5 in the target database if the frame 5 is not recorded and deleted when the frame 4 is recorded and deleted.

The encoder 30 is further configured to search a next inter frame number corresponding to the first frame number backward in the preset shaping array based on the first frame number, and delete a next video frame corresponding to the inter frame in the target database if the next inter frame number corresponding to the first frame number is recorded in the preset shaping array.

It should be understood that, when the frame number of the intermediate frame and the frame number of the next intermediate frame are recorded in the preset shaping array at the same time, the video frame next to the intermediate frame in the target database is deleted. Referring to fig. 2, looking backward with the frame number corresponding to the start frame number or the intermediate frame, when the next even frame 10 is recorded and deleted, the frame 9 in the target database is deleted and recorded, and when the frame 12 is recorded and deleted, if the frame 11 is not recorded and deleted, the frame 11 in the target database is deleted and recorded.

In a specific implementation, the cleaning frames are executed in parallel, but parallel access abnormality occurs with an extremely low probability, which results in frame missing cleaning.

The encoder 30 is further configured to delete the video frames and the intermediate frames with frame numbers smaller than the first frame number in the target database after the target video stream is output.

It should be noted that, in this embodiment, a cleaning process of the timeframe is set, and after the current frame is encoded into the target video stream and output, all frames with frame numbers smaller than the current frame in the target database are cleaned.

In this embodiment, when the encoder 30 obtains the video frame and the intermediate frame corresponding to the frame interpolation metadata through local buffering, the intermediate frame in the target database is deleted, the first frame number corresponding to the intermediate frame is recorded in the preset shaping array, the previous intermediate frame number corresponding to the first frame number is searched forward in the preset shaping array based on the first frame number, and if the previous intermediate frame number corresponding to the first frame number is recorded in the preset shaping array, the video frame in the target database is deleted. By the method, the video frames and the intermediate frames which are output in a coded mode are deleted, memory resources are saved, the utilization efficiency of the memory resources is improved, the influence of abnormal parallel access in the distributed live broadcast frame inserting process on the frame inserting efficiency is avoided, the live broadcast frame inserting efficiency is further improved, and the live broadcast video fluency is improved.

Referring to fig. 1, in a third embodiment of the distributed live broadcast frame insertion system according to the present invention, the decoder 10 is further configured to number a video frame obtained by decoding, generate a first odd frame number corresponding to the video frame, and generate video frame metadata corresponding to the video frame according to the first odd frame number;

each AI engine 20 is further configured to determine a second odd frame number corresponding to the pulled video frame, generate a first even frame number corresponding to the intermediate frame according to an even number adjacent to the second odd frame number, and generate frame insertion metadata corresponding to the intermediate frame according to the second odd frame number and the first even frame number.

It should be understood that the decoder 10 in this embodiment encodes odd frame numbers for each frame, for example: 1,3,5 … …, the decoder 10 generates video frame metadata according to the frame numbers corresponding to two adjacent video frames, for example: and generating { frameno1:1, frameno2:3, pts1:1234567 and pts2:1234569} according to the frame number 1 and the frame number 3 of the two adjacent video frames.

It should be noted that, after generating an intermediate frame according to two adjacent frames of video frames, the AI engine 20 in this embodiment uses the intermediate even frame number of two odd frame numbers respectively corresponding to the two adjacent frames of video frames as the frame number of the intermediate frame, and generates the frame interpolation metadata according to the odd frame number of the previous video frame in the two adjacent frames of video frames and the even frame number of the intermediate frame. For example, the video frame metadata obtained by AI engine 20 from the message queue is { frameno1:1, frameno2:3, pts1:1234567, pts2:1234569}, and 2 is stored in the message queue as the even frame number of the generated intermediate frame, and { frameno1:1, frameno2:2, pts1:1234567, pts2:1234568 }. In this embodiment, the video frames in the original video stream are distinguished from the intermediate frames generated by the AI engine 20 by odd and even numbers, so that the encoder 30 can distinguish and encode and sort the data conveniently, the data processing efficiency is further improved, and the fluency of the live video is improved.

Referring to fig. 1, in a fourth embodiment of the distributed live broadcast frame insertion system according to the present invention, the decoder 10 is further configured to number a video frame obtained by decoding, generate a second even frame number corresponding to the video frame, and generate video frame metadata corresponding to the video frame according to the second even frame number;

each AI engine 20 is further configured to determine a third even frame number corresponding to the pulled video frame, generate a third odd frame number corresponding to the intermediate frame according to an odd number adjacent to the third even frame number, and generate frame insertion metadata corresponding to the intermediate frame according to the third even frame number and the third odd frame number.

It should be understood that the decoder 10 in this embodiment encodes an even frame number for each frame, such as: 0,2,4 … …, the decoder 10 generates video frame metadata according to frame numbers corresponding to two adjacent video frames, for example: and generating { frameno1:2, frameno2:4, pts1:1234568 and pts2:1234570} according to the frame number 2 and the frame number 4 of the two adjacent video frames.

It should be noted that, after generating an intermediate frame according to two adjacent frames of video frames, the AI engine 20 in this embodiment uses the intermediate odd frame number of two even frame numbers respectively corresponding to the two adjacent frames of video frames as the frame number of the intermediate frame, and generates the frame interpolation metadata according to the even frame number of the previous frame and the odd frame number of the intermediate frame in the two adjacent frames of video frames. For example, the video frame metadata obtained by the AI engine 20 from the message queue is { frameno1:2, frameno2:4, pts1:1234568, pts2:1234570}, and stores { frameno1:2, frameno2:3, pts1:1234568, pts2:1234569} in the message queue, as the odd frame number of the generated intermediate frame, 3. In this embodiment, the video frames in the original video stream are distinguished from the intermediate frames generated by the AI engine 20 by the even numbers and the odd numbers, so that the encoder 30 can distinguish and encode and sort the data conveniently, the data processing efficiency is further improved, and the fluency of the live video is improved.

Referring to fig. 1, in a fifth embodiment of the distributed live broadcast frame insertion system according to the present invention, the encoder 30 is further configured to obtain a video absolute time and an audio absolute time, determine whether the video absolute time is greater than the audio absolute time, obtain, from the message queue, audio data decoded by the decoder 10 when the video absolute time is greater than the audio absolute time, and update the audio absolute time according to a display timestamp corresponding to the audio data.

It should be understood that two timers are defined in advance in the present embodiment: the video recording method comprises the following steps of vtime and atime, wherein vtime is used for recording the video absolute time of a current video, atime is used for recording the audio absolute time of the current video, and the initial stage vtime is 0 and the atime is 0. And when vtime > atime, acquiring audio data decoded by the decoder 10 from the message queue according to the queue sequence, calculating corresponding time according to a display timestamp corresponding to the audio data, assigning the corresponding time to the atime, updating the atime, and returning to execute the step of judging whether the video absolute time is greater than the audio absolute time.

In a specific implementation, the time corresponding to the audio data is calculated according to the following formula:

atime＝packet.pts*timebase；

pts is a display timestamp corresponding to the audio data, and timebase is a time reference set in advance.

The encoder 30 is further configured to, when the video absolute time is smaller than the audio absolute time, obtain the frame interpolation metadata generated by the AI engine 20 from the message queue, pull the corresponding video frame and intermediate frame from the target database according to the frame interpolation metadata, and update the video absolute time according to the display timestamp corresponding to the video frame or the intermediate frame.

It should be noted that, when the atime > vtime, the frame interpolation metadata generated by the AI engine 20 is obtained from the message queue according to the queue order, and the corresponding video frame and intermediate frame are pulled from the target database, and the video frame and intermediate frame with the larger value in the display timestamps are selected as the basis for updating the absolute time of the video, and the time corresponding to the video data is calculated by the following formula:

vtime＝frame.pts*timebase；

pts is a display timestamp corresponding to the video frame or the intermediate frame, and timebase is a time reference set in advance.

And updating the vtime according to the time corresponding to the video data obtained by calculation, and returning to the step of judging whether the video absolute time is greater than the audio absolute time. In this way, the scheduling of audio data and video data in the distributed live frame insertion process is realized, and the encoding efficiency of the encoder 30 is improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a distributed live frame insertion method according to a first embodiment of the present invention.

The method of the distributed live broadcast frame insertion method of the embodiment is applied to the distributed live broadcast frame insertion system, and the distributed live broadcast frame insertion system comprises: a decoder, an encoder, and a plurality of AI engines respectively deployed on different hosts, the method comprising:

step S10: the decoder decodes the acquired video stream to obtain audio data, video frames and video frame metadata, stores the video frames in a target database, and stores the audio data and the video frame metadata in a message queue.

Step S20: and each AI engine acquires video frame metadata obtained by decoding of the decoder from the message queue, pulls a video frame corresponding to the video frame metadata from the target database, generates an intermediate frame according to the pulled video frame, generates frame insertion metadata corresponding to the intermediate frame, stores the intermediate frame into the target database and stores the frame insertion metadata into the message queue.

Step S30: and the encoder acquires the audio data decoded by the decoder and the frame insertion metadata generated by the AI engine from the message queue, pulls the video frame and the intermediate frame corresponding to the frame insertion metadata from the target database, combines and encodes the video frame, the intermediate frame and the audio data, and outputs a target video stream.

It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited thereto.

The distributed live broadcast frame insertion system in the embodiment includes: the decoder, the encoder and the AI engines are respectively arranged on different hosts; the decoder decodes the acquired video stream to obtain audio data, video frames and video frame metadata, stores the video frames into a target database and stores the audio data and the video frame metadata into a message queue; each AI engine acquires video frame metadata obtained by decoding of a decoder from the message queue, pulls a video frame corresponding to the video frame metadata from the target database, generates an intermediate frame according to the pulled video frame, generates interpolation frame metadata corresponding to the intermediate frame, stores the intermediate frame into the target database and stores the interpolation frame metadata into the message queue; and the encoder acquires the audio data decoded by the decoder and the interpolation frame metadata generated by the AI engine from the message queue, pulls the video frame and the intermediate frame corresponding to the interpolation frame metadata from the target database, combines and encodes the video frame, the intermediate frame and the audio data, and outputs the target video stream. Through the mode, the distributed deployed AI engines are used for executing the frame interpolation task of the high-quality video, multi-frame interpolation data can be generated in a short time, the requirement for the frame interpolation quantity in a live scene is met, the live frame interpolation efficiency is improved, and the live video smoothness is improved.

It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.

In addition, the technical details that are not described in detail in this embodiment may refer to the distributed live frame insertion system provided in any embodiment of the present invention, and are not described herein again.

Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A distributed live frame insertion system, comprising: the decoder, the encoder and the AI engines are respectively arranged on different hosts;

2. The distributed live broadcast frame insertion system according to claim 1, wherein the encoder is further configured to delete the intermediate frame in the target database when the video frame and the intermediate frame corresponding to the frame insertion metadata are obtained through local buffering, record a first frame number corresponding to the intermediate frame in a preset shaping array, forward search a previous intermediate frame number corresponding to the first frame number in the preset shaping array based on the first frame number, and delete the video frame in the target database if the previous intermediate frame number corresponding to the first frame number is found to be recorded in the preset shaping array.

3. The distributed live broadcast frame insertion system according to claim 2, wherein the encoder is further configured to search a next inter frame number corresponding to the first frame number backward in the preset shaping array based on the first frame number, and delete a next video frame corresponding to the inter frame in the target database if the next inter frame number corresponding to the first frame number is found in the preset shaping array.

4. The distributed live frame insertion system of claim 2, wherein the encoder is further configured to delete video frames and intervening frames in the target database having frame numbers less than the first frame number after outputting the target video stream.

5. The distributed live broadcast frame insertion system according to claim 1, wherein the decoder is further configured to number a video frame obtained by decoding, generate a second frame number corresponding to the video frame, generate video frame metadata corresponding to the video frame according to the second frame number, store the video frame including the second frame number in the target database, and store the audio data and the video frame metadata in a message queue;

each AI engine is further configured to determine a third frame number corresponding to the pulled video frame, generate a fourth frame number corresponding to the intermediate frame according to the third frame number, generate interpolation frame metadata corresponding to the intermediate frame according to the third frame number and the fourth frame number, store the intermediate frame including the fourth frame number in the target database, and store the interpolation frame metadata in the message queue;

6. The distributed live broadcast frame insertion system according to claim 5, wherein the decoder is further configured to number a video frame obtained by decoding, generate a first odd frame number corresponding to the video frame, and generate video frame metadata corresponding to the video frame according to the first odd frame number;

7. The distributed live frame insertion system according to claim 5, wherein the decoder is further configured to number the decoded video frame, generate a second even frame number corresponding to the video frame, and generate video frame metadata corresponding to the video frame according to the second even frame number;

8. The distributed live frame insertion system according to claim 1, wherein the encoder is further configured to obtain a video absolute time and an audio absolute time, determine whether the video absolute time is greater than the audio absolute time, obtain, from the message queue, audio data decoded by the decoder when the video absolute time is greater than the audio absolute time, and update the audio absolute time according to a display timestamp corresponding to the audio data.

9. The distributed live frame insertion system of claim 8, wherein the encoder is further configured to, when the absolute time of the video is less than the absolute time of the audio, obtain frame insertion metadata generated by the AI engine from the message queue, pull corresponding video frames and intermediate frames from the target database according to the frame insertion metadata, and update the absolute time of the video according to display timestamps corresponding to the video frames or the intermediate frames.

10. A distributed live frame insertion method, applied to a distributed live frame insertion system according to any one of claims 1-9, the distributed live frame insertion system comprising: a decoder, an encoder, and a plurality of AI engines respectively deployed on different hosts, the method comprising: