CN116866665A

CN116866665A - Video playing method and device, electronic equipment and storage medium

Info

Publication number: CN116866665A
Application number: CN202311135574.5A
Authority: CN
Inventors: 李剑戈; 焦阳; 曹震; 周能; 赵天远
Original assignee: China Securities Co Ltd
Current assignee: China Securities Co Ltd
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-10-10
Anticipated expiration: 2043-09-05
Also published as: CN116866665B

Abstract

The embodiment of the invention provides a video playing method, a video playing device, electronic equipment and a storage medium, and relates to the technical field of image processing. The video playing method comprises the following steps: extracting frames from the video according to the frame rate of the video, and obtaining frame data and frame sequence numbers of extracted video frames; determining a frame classification corresponding to the extracted video frame based on the obtained frame data; if the frame classification characterizes that the extracted video frame is a video frame needing to be repaired, determining a reference video frame needing to be referred to for repairing the extracted video frame from normal frames of the video; obtaining a first image feature of the reference video frame; performing feature diffusion processing on the first image features based on a cross attention mechanism, and generating a repair frame of the extracted video frame based on a processing result; and playing the repair frame according to the frame sequence number of the extracted video frame. The scheme provided by the embodiment of the invention can completely and continuously play the video.

Description

Video playing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a video playing method, a video playing device, an electronic device, and a storage medium.

Background

With the development of internet technology, people's habit of obtaining information is rapidly changing from image-text reading to video viewing. Various long video and short video APP (Application) layers are endless, and various financial, news and self-media fields begin to be converted into video. Due to various reasons such as network fluctuation and storage medium performance, video frames may be lost or noise frames may appear in the video data transmission and storage processes, and when the damaged video is played, blocking, frame skipping, irregular noise areas appear in the picture, and the playing effect of video playing and the watching experience of users are seriously affected. In view of this, it is necessary to repair lost video frames and noise frames in the video to play the complete, continuous video for the user.

Disclosure of Invention

The embodiment of the invention aims to provide a video playing method, a video playing device, electronic equipment and a storage medium, so as to realize complete and continuous video playing. The specific technical scheme is as follows:

according to a first aspect of an embodiment of the present invention, there is provided a video playing method, including:

extracting frames from the video according to the frame rate of the video, and obtaining frame data and frame sequence numbers of extracted video frames;

Determining a frame classification corresponding to the extracted video frame based on the obtained frame data;

if the frame classification characterizes that the extracted video frame is a video frame needing to be repaired, determining a reference video frame needing to be referenced for repairing the extracted video frame from normal frames of the video, wherein the normal frames are: video frames that do not require repair;

obtaining a first image feature of the reference video frame;

performing feature diffusion processing on the first image features based on a cross attention mechanism, and generating a repair frame of the extracted video frame based on a processing result;

and playing the repair frame according to the frame sequence number of the extracted video frame.

Optionally, if the frame classification characterizes that the extracted video frame is a video frame that needs to be repaired, determining, from normal frames of the video, a reference video frame that needs to be referenced for repairing the extracted video frame includes:

if the frame classification characterizes that the extracted video frame is a noise frame, selecting a leading video frame nearest to the extracted video frame from normal frames of the video;

and determining a reference video frame to be referred to for repairing the extracted video frame based on the preamble video frame.

Optionally, the determining, based on the preamble video frame, a reference video frame to be referred to for repairing the extracted video frame includes:

Detecting whether scene switching exists between the preamble video frame and the extracted video frame;

if so, selecting a subsequent video frame which is nearest to the extracted video frame and has no scene switching from the normal frames of the video, and taking the subsequent video frame as a reference video frame which needs to be referred to for repairing the extracted video frame.

Optionally, the performing feature diffusion processing on the first image feature based on the cross-attention mechanism, and generating a repair frame of the extracted video frame based on a processing result includes:

obtaining a second image feature of the extracted video frame;

performing cross attention calculation on the first image feature and the second image feature;

and taking the reference video frame as an auxiliary control condition, denoising the extracted video frame based on the cross attention calculation result, and generating a repair frame of the extracted video frame.

Optionally, the obtaining the first image feature of the reference video frame includes:

inputting the reference video frame as control path input data of a first frame repair model trained in advance, inputting a first self-coding sub-network in the first frame repair model, and obtaining a first image characteristic of the reference video frame output by the first self-coding sub-network, wherein the first frame repair model further comprises: a first forward diffusion layer, a first cross-attention computation layer, a first reverse diffusion layer, a first decoder network;

The obtaining a second image feature of the extracted video frame includes:

inputting the extracted video frames serving as main path input data of the first frame repair model into the first self-coding sub-network to obtain second image features of the extracted video frames output by the first self-coding sub-network;

the cross attention computing for the first image feature and the second image feature uses the reference video frame as an auxiliary control condition, and based on the cross attention computing result, denoising the extracted video frame to generate a repair frame of the extracted video frame, including:

inputting the second image features into the first forward diffusion layer to obtain first noise features of the extracted video frames output by the first forward diffusion layer;

inputting the first noise feature and the first image feature into the first cross attention computing layer to obtain a first cross attention computing result output by the first cross attention computing layer;

inputting the first cross attention calculation result into the first reverse diffusion layer to obtain a first denoising feature which is output after the first reverse diffusion layer performs noise reduction treatment on the first cross attention calculation result;

And inputting the first denoising characteristic into the first decoder network to obtain a repair frame of the extracted video frame output by the first decoder network.

if the frame classification characterizes that the extracted video frame is a lost frame, selecting a leading video frame which is nearest to the extracted video frame and a trailing video frame which is nearest to the extracted video frame from normal frames of the video;

and determining a reference video frame to be referred to for repairing the extracted video frame based on the preceding video frame and the subsequent video frame.

Optionally, the determining, based on the preceding video frame and the subsequent video frame, a reference video frame to be referred to for repairing the extracted video frame includes:

detecting whether scene switching exists between the preceding video frame and the subsequent video frame;

if not, determining the preceding video frame and the subsequent video frame as reference video frames to be referred for repairing the extracted video frames;

and if the video frame exists, determining the preceding video frame or the subsequent video frame as a reference video frame which needs to be referred to for repairing the extracted video frame.

obtaining a third image feature of the random noise image;

performing cross attention calculation on the first image feature and the third image feature;

and denoising the random noise image based on the cross attention calculation result, and generating a supplementary frame based on the denoising result as a repair frame of the extracted video frame.

inputting the reference video frame as control path input data of a pre-trained second frame repair model, and inputting the control path input data into a second self-coding sub-network in the second frame repair model to obtain a first image characteristic of the reference video frame output by the second self-coding sub-network, wherein the second frame repair model further comprises: a second forward diffusion layer, a second cross-attention computation layer, a second reverse diffusion layer, a second decoder network;

the obtaining a third image feature of the random noise image includes:

inputting a random noise image serving as main road input data of the second frame repair model into the second self-coding sub-network to obtain a third image characteristic of the random noise image output by the second self-coding sub-network;

The cross attention calculation is performed on the first image feature and the third image feature, the denoising processing is performed on the random noise image based on the cross attention calculation result, and a complementary frame is generated based on the denoising result, and is used as a repair frame of the extracted video frame, and the method comprises the following steps:

inputting the third image feature into the second forward diffusion layer to obtain a second noise feature of the extracted video frame output by the second forward diffusion layer;

inputting the second noisy feature and the first image feature into the second cross attention computing layer to obtain a second cross attention computing result output by the second cross attention computing layer;

inputting the second cross attention calculation result into the second reverse diffusion layer to obtain a second denoising feature which is output after the second reverse diffusion layer performs noise reduction processing on the second cross attention calculation result;

and inputting the second denoising feature into the second decoder network to obtain a repair frame of the extracted video frame output by the second decoder network.

Optionally, the extracting frames from the video according to the frame rate of the video to obtain frame data and frame sequence numbers of the extracted video frames includes:

Determining whether image data of a video frame exists at a corresponding position in the video according to the frame rate of the video frame;

if the video frame exists, extracting the image data recorded at the corresponding position, taking the image data as the frame data of the extracted video frame, and obtaining the frame number of the extracted video frame;

if not, determining the empty frame identification as the frame data of the extracted video frame, and obtaining the frame number of the extracted video frame.

According to a second aspect of an embodiment of the present invention, there is provided a video playing device, the device including:

the video frame extraction module is used for extracting frames of the video according to the frame rate of the video to obtain frame data and frame sequence numbers of extracted video frames;

the video frame classification module is used for determining frame classification corresponding to the extracted video frames based on the obtained frame data;

the reference video frame determining module is configured to determine, from normal frames of the video, a reference video frame to be referred to for repairing the extracted video frame, where the normal frames are: video frames that do not require repair;

a first image feature obtaining module, configured to obtain a first image feature of the reference video frame;

The repair frame generation module is used for carrying out feature diffusion processing on the first image features based on a cross attention mechanism and generating a repair frame of the extracted video frame based on a processing result;

and the video frame playing module is used for playing the repair frame according to the frame sequence number of the extracted video frame.

Optionally, the reference video frame determining module includes:

a first video frame selection sub-module, configured to select, in a case where the frame classification characterizes the extracted video frame as a noise frame, a leading video frame that is most adjacent to the extracted video frame from normal frames of the video;

and the first reference video frame determining submodule is used for determining a reference video frame which needs to be referred to for repairing the extracted video frame based on the preamble video frame.

Optionally, the first reference video frame determining submodule includes:

a first scene switching detection unit, configured to detect whether scene switching exists between the preamble video frame and the extracted video frame;

the first reference video frame determining unit is used for selecting a subsequent video frame which is nearest to the extracted video frame and does not have scene switching from normal frames of the video, and is used as a reference video frame to be referred for repairing the extracted video frame under the condition that scene switching exists between the preceding video frame and the extracted video frame.

Optionally, the repair frame generation module includes:

a second image feature obtaining unit configured to obtain a second image feature of the extracted video frame;

a first feature calculation unit configured to perform cross-attention calculation on the first image feature and the second image feature;

and the first repair frame generation unit is used for denoising the extracted video frame based on the cross attention calculation result by taking the reference video frame as an auxiliary control condition to generate a repair frame of the extracted video frame.

Optionally, the first image feature obtaining module is specifically configured to: inputting the reference video frame as control path input data of a first frame repair model trained in advance, inputting a first self-coding sub-network in the first frame repair model, and obtaining a first image characteristic of the reference video frame output by the first self-coding sub-network, wherein the first frame repair model further comprises: a first forward diffusion layer, a first cross-attention computation layer, a first reverse diffusion layer, a first decoder network;

the second image feature obtaining unit is specifically configured to: inputting the extracted video frames serving as main path input data of the first frame repair model into the first self-coding sub-network to obtain second image features of the extracted video frames output by the first self-coding sub-network;

The first feature calculating unit and the first repair frame generating unit are specifically configured to: inputting the second image features into the first forward diffusion layer to obtain first noise features of the extracted video frames output by the first forward diffusion layer; inputting the first noise feature and the first image feature into the first cross attention computing layer to obtain a first cross attention computing result output by the first cross attention computing layer; inputting the first cross attention calculation result into the first reverse diffusion layer to obtain a first denoising feature which is output after the first reverse diffusion layer performs noise reduction treatment on the first cross attention calculation result; and inputting the first denoising characteristic into the first decoder network to obtain a repair frame of the extracted video frame output by the first decoder network.

Optionally, the reference video frame determining module includes:

a second video frame selection sub-module, configured to select, in a case where the frame classification characterizes the extracted video frame as a lost frame, a leading video frame that is nearest to the extracted video frame and a trailing video frame that is nearest to the extracted video frame from normal frames of the video;

And the second reference video frame determining submodule is used for determining a reference video frame which needs to be referred to for repairing the extracted video frame based on the preceding video frame and the subsequent video frame.

Optionally, the second reference video frame determining submodule includes:

the second scene switching detection unit is used for detecting whether scene switching exists between the preceding video frame and the subsequent video frame;

a second reference video frame determining unit, configured to determine, when there is no scene switching between the preceding video frame and the subsequent video frame, the preceding video frame and the subsequent video frame as reference video frames to be referred for repairing the extracted video frame; and under the condition that scene switching exists between the front video frame and the rear video frame, determining the front video frame or the rear video frame as a reference video frame to be referred for repairing the extracted video frame.

Optionally, the repair frame generation module includes:

a third image feature obtaining unit configured to obtain a third image feature of the random noise image;

a second feature calculation unit configured to perform cross-attention calculation on the first image feature and the third image feature;

the second repair frame generation unit performs denoising processing on the random noise image based on the cross attention calculation result, and generates a complementary frame based on the denoising result as a repair frame of the extracted video frame.

Optionally, the first image feature obtaining module is specifically configured to: inputting the reference video frame as control path input data of a pre-trained second frame repair model, and inputting the control path input data into a second self-coding sub-network in the second frame repair model to obtain a first image characteristic of the reference video frame output by the second self-coding sub-network, wherein the second frame repair model further comprises: a second forward diffusion layer, a second cross-attention computation layer, a second reverse diffusion layer, a second decoder network;

the third image feature obtaining unit is specifically configured to: inputting a random noise image serving as main road input data of the second frame repair model into the second self-coding sub-network to obtain a third image characteristic of the random noise image output by the second self-coding sub-network;

the second feature calculating unit and the second repair frame generating unit are specifically configured to: inputting the third image feature into the second forward diffusion layer to obtain a second noise feature of the extracted video frame output by the second forward diffusion layer; inputting the second noisy feature and the first image feature into the second cross attention computing layer to obtain a second cross attention computing result output by the second cross attention computing layer; inputting the second cross attention calculation result into the second reverse diffusion layer to obtain a second denoising feature which is output after the second reverse diffusion layer performs noise reduction processing on the second cross attention calculation result; and inputting the second denoising feature into the second decoder network to obtain a repair frame of the extracted video frame output by the second decoder network.

Optionally, the video frame extraction module includes:

an image data determining unit, configured to determine whether image data of a video frame exists at a corresponding position in the video according to a frame rate of the video frame;

a frame data obtaining unit configured to extract, in the presence of image data, the image data recorded at the corresponding position as frame data of the extracted video frame, and obtain a frame number of the extracted video frame; in the absence of image data, a null frame identification is determined as frame data of the extracted video frame, and a frame number of the extracted video frame is obtained.

According to a third aspect of an embodiment of the present invention, there is provided an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory perform communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the video playing method in the first aspect when executing the program stored in the memory.

According to a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the video playing method described in the first aspect.

The embodiment of the invention has the beneficial effects that:

according to the video playing scheme provided by the embodiment of the invention, the frame data and the frame sequence number of the video frame can be obtained according to the frame rate of the video, so that the frame classification of each extracted video frame can be determined based on the frame data, and whether each extracted video frame needs to be repaired or not is determined. For the video frame to be repaired, a reference video frame for reference is determined from the normal frame based on the frame sequence number of the video frame to be repaired, so that the first image feature of the reference video frame which does not need to be repaired can be referenced, and feature diffusion processing is performed based on a cross attention mechanism, thereby generating a repair frame of the video frame to be repaired. Since there is a correlation between the normal frame and the video frame to be repaired, when the repair frame is generated, the image characteristics of the normal frame are referred to, and the generated image content of the repair frame is correlated with, but not completely consistent with, the image content of the normal frame. Therefore, when the video is played, as the image content of the played repair frame is associated with the played reference video frame, and the image content of the played repair frame is not completely consistent with the played reference video frame, the played video frames are continuous, the problems of blocking, frame skipping and the like when the video is played are reduced, and the watching experience of a user is ensured. Therefore, the scheme provided by the embodiment of the invention can completely and continuously play the video.

Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other embodiments may be obtained according to these drawings to those skilled in the art.

Fig. 1 is a flowchart of a first video playing method according to an embodiment of the present invention;

fig. 2 is a flow chart of a method for obtaining frame data according to an embodiment of the present invention;

fig. 3 is a flowchart of a second video playing method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a first frame repair model according to an embodiment of the present invention;

fig. 5 is a flowchart of a third video playing method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a second frame repair model according to an embodiment of the present invention;

fig. 7 is a flowchart of a fourth video playing method according to an embodiment of the present invention;

Fig. 8 is a schematic structural diagram of a video playing device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by the person skilled in the art based on the present application are included in the scope of protection of the present application.

With the development of internet technology, viewing video is called an important way for users to obtain information. The user can watch videos on line on various APP, and can watch pre-stored videos off line on own equipment. Video frames in these videos may be corrupted, resulting in the video not being played in complete succession.

In view of this, it is desirable to provide a video playback scheme.

The execution body of the embodiment of the present application is described below.

The execution main body of the embodiment of the application is electronic equipment, and specifically, can be a computer, a server, a smart phone and the like.

The frame classification in the embodiment of the present invention is described below.

The angle considered in classifying is different for video frames, and the resulting frame classifications are also different. In the embodiment of the invention, according to whether the image content of the video frame is corrupted or not and the corrupted type of the image content of the video frame, the following three frame classifications exist:

1. normal frame

Normal frames represent video frames whose image content is normally intact, i.e. video frames that do not need repair.

2. Noise frame

The noise frame represents a video frame in which a noise region exists in the image content, and is a video frame that needs repair.

3. Lost frame

The lost frame indicates that the video frame in which the image content is supposed to exist and the image content is not actually present is a video frame which needs to be repaired.

The video playing method provided by the embodiment of the invention is explained by a specific embodiment.

Referring to fig. 1, a flow chart of a first video playing method is provided, and the method includes the following steps S101 to S106.

Step S101: and extracting frames from the video according to the frame rate of the video, and obtaining frame data and frame sequence numbers of the extracted video frames.

The frame data is data for describing the content of a video frame.

The frame numbers are used to indicate the playing order of each video frame in the video. For example, the frame number of the video frame may be a number determined according to the playing order of each video frame, specifically, the frame numbers of the video frames may be 0, 1, 2, … …, etc., and the smaller the frame number, the earlier the playing order of the video frames corresponding to the frame number.

The manner in which the frame rate of a video is determined is described below.

The electronic device may parse the obtained video file to parse a frame rate of the video from the video header information of the video file.

The manner of frame extraction is described below.

As described above, the electronic device may determine the frame rate of the video by parsing the video file, thereby frame-extracting the video according to the frame rate of the video. The frame rate of the video affects the number of video frames that are decimated.

For example, assuming a frame rate of 30 for video, the electronic device will extract video frames at 30 frames per second, i.e., 30 video frames per 1 second of video will be extracted. Assuming a frame rate of 60 for video, 60 video frames are extracted every 1 second of video.

In addition, the video frames extracted in the video are different according to the frame rate of the video. For example, for a video with a frame rate of 30, it is necessary to extract 30 video frames for every 1 second of video, taking the 0 th to 1 st second portion of video as an example, the first extracted video frame is the 0 th second image of the video, the second extracted video frame is the 1/30 th second image of the video, and so on. If the frame rate of the video is 60, taking the portion of the video from 0 th second to 1 st second as an example, the first video frame extracted is the 0 th second image of the video, and the second video frame extracted is the 1/60 th second image of the video.

In this case, for the normal frame and the noise frame, the electronic device can recognize the basic information such as the stream information included in the video file, the pixel information of each video frame, the frame rate, and the like, so that the electronic device can extract all the pixel information of each frame in order as the image data of the extracted video frame. In one case, the electronic device may identify the underlying information of the video file through an FFprobe or the like tool. In addition, the electronic device may further identify basic information such as a package format of the video file, a color space of the frame image, a pixel width and height of the frame image, a video duration, a code rate, and the like, which is not limited in the embodiment of the present invention.

The extraction of missing frames will be described in more detail later.

The frame data acquisition method will be described below.

As described above for the extraction of frames, the electronic device may extract image data of video frames, so that, for normal frames and noise frames in which image content exists, the image data of the normal frames may be regarded as frame data of the normal frames, and the image data of the noise frames may be regarded as frame data of the noise frames.

For a missing frame, the missing frame does not have image content, in which case the electronic device may generate a null frame identification as frame data for the missing frame. The null frame identifier may be a preset character string.

The specific manner of obtaining the frame data will be further described in the embodiment corresponding to fig. 2, which will not be described in detail here.

The following describes a frame number acquisition method.

As described above with respect to the extraction of frames, the electronic device may extract image data for each video frame, the extracted image data including header data for the image. Thus, the electronic device can determine the frame number of the extracted video frame according to the sequence number information carried in the image header data.

Specifically, for the normal frame and the missing frame, in one case, the electronic device may directly determine the sequence number information carried in the header data of the video frame as the frame sequence number of the video frame.

In another case, the sequence number information may be a set of sequence numbers that are repeatedly used, for example, the sequence number information may be 0 to 65534 that are repeatedly used, and 65535 in total. That is, the sequence number information of the first video frame in the video is 0, the sequence number information of the second video frame is 1, the sequence number information of the 65535 th video frame is 65534, the sequence number information of the 65536 th video frame is 0, and the sequence number information of the 65537 th video frame is 1, and the cycle repeats. In this case, the electronic device may determine the frame sequence number of the video frame according to sequence number information carried in the header data of the video frame and sequence number information carried in the header data of the video frame preceding the video frame. For example, if the sequence number information carried in the header data of a certain video frame is 0 and 65535 previous video frames are total, and the sequence number information carried in the header data of the previous video frame is 65534, then it may be determined that the frame number of the video frame is 65535.

For the missing frame, the electronic device may acquire the header data of the missing frame, and in this case, the acquiring manner of the frame sequence number of the missing frame is the same as that of the frame sequence number of the normal frame or the noise frame, and since the foregoing description has been given of the acquiring manner of the frame sequence number of the normal frame or the noise frame, the description is omitted here.

In another case, the electronic device cannot acquire the image header data of the missing frame. In this case, the electronic device may determine the frame number of the missing frame according to the sequence number information and the frame number carried in the header data of the acquired video frame. For example, when the missing frame is the 3 rd frame, neither the previous frame nor the next frame is the missing frame, the sequence number information of the previous frame is 2, the sequence number of the frame is 2, and the sequence number information of the next frame is 4, at this time, the electronic device may extract the previous frame and the next frame to the missing frame, and detect that the sequence number information of the two frames is discontinuous, so as to determine that the missing frame exists in the two frames, and thus, the sequence number of the missing frame is 3 can be determined according to the determined sequence number of the previous frame of the missing frame, that is, 2.

Step S102: based on the obtained frame data, a frame classification corresponding to the extracted video frame is determined.

As explained above, the frame classification includes: normal frames, noise frames, and lost frames, wherein both normal and noise frames are video frames in which image content is present, and lost frames are video frames in which image content is not present. Accordingly, a lost frame in the extracted video frame may be determined based on whether the frame data of the extracted video frame is a null frame identification.

In one embodiment of the present invention, as described in the description of step S101C, the frame data of the lost frame is a null frame identification. Therefore, in this embodiment, it may be determined that the video frame whose frame data is the null frame identification is a lost frame.

The manner in which noise frames are determined from video frames in which image content exists is described below.

In a first implementation, a video frame with a noise area may be identified as a noise frame according to pixel values of each pixel point in the video frame. For example, whether white noise exists in the image content of the video frame may be detected according to the gray value of the pixel point and the average gray value of the area near the pixel point; it is also possible to detect whether or not there is Green curtain noise or black curtain noise in the image content of the video frame based on the RGB (Red Green Blue) value of the pixel point and the average RGB value of the region around the pixel point. Specifically, the video frame may be divided into different regions, so as to detect whether a noise region exists in the image content of the video frame according to the gray value and/or RGB value of the pixel point in each region and the average gray value and/or RGB value of the pixel point in the region near the region. The vicinity area may be an area having a distance from the detected area less than a predetermined distance.

Noise region detection for video frames is described below with a specific example.

When noise region detection is performed on video frames, a sliding detection block can be created, and sliding traversal video frames are detected. The size of the sliding detection block may be determined by the size of the video frame and a preset detection coefficient, where the detection coefficient may represent the duty cycle of the maximum noise region tolerable for viewing in the image content of the video frame. In one case, assuming that the detection coefficient is k, the pixel width of the video frame is a, the pixel height is B, the pixel width of the created slide detection block is a, and the pixel height is B, then the following relationship exists between the pixel width height of the video frame and the pixel width height of the created slide detection block:

；

for example, when the detection coefficient is 16, the maximum noise block duty cycle that indicates that viewing may be tolerated isAt this time, the size of the slide detection block determined based on the detection coefficient is +.>Assuming that the pixel width of the video frame is 1280 and the pixel height is 720, the pixel width of the sliding detection block can be determined to be 320 and the pixel height to be 180 by the above relation.

When a video frame is detected through sliding traversal of the sliding detection block, extracting information of pixel points in the sliding detection block at each traversal position, calculating an average gray value of the extracted pixel points, comparing the average gray value of the extracted pixel points with a global gray value of image content of the whole video frame and an average gray value of an adjacent block, and if a difference between the average gray value of the extracted pixel points and the global gray value or the average gray value of the adjacent block is larger than a preset gray value threshold, considering the block as a white noise block and the video frame as a noise frame; similarly, the RGB triplet value of the extracted pixel point may be calculated and compared with the global RGB triplet value of the image content of the whole video frame and the average RGB triplet value of the neighboring block, and if the difference between the average RGB triplet value of the extracted pixel point and the global RGB triplet value or the average RGB triplet value of the neighboring block is greater than the preset RGB triplet value threshold, the block is considered to be a green curtain noise block or a black curtain noise block, and the video frame is a noise frame. The adjacent block may be a block adjacent to the detected block and having the same size.

In a second implementation manner, the image features of each video frame may be extracted and matched with the image features of various noise images obtained in advance, so as to determine that the video frame with successfully matched image features is a noise frame.

In a third implementation, a pre-trained noise frame discrimination model may be used to determine noise frames from normal frames and noise frames. The noise frame identification model is an image classification model, and can classify images according to input image data, namely, output the classification result of the image data according to the input image data of the extracted video frames, so as to determine the type of the video frames corresponding to the image data according to the classification result.

The noise frame discrimination model body may adopt a residual neural network structure. In order to obtain a trained noise frame identification model, the neural network model can be trained by taking a normal frame marked by people as a positive sample and various noise frames as negative samples, the number of the positive and negative samples is approximately 1:1, and the negative samples can specifically comprise a white noise frame, a regional white noise frame, a green curtain noise shielding frame, a local irregular black block noise shielding frame and the like, and the number of the samples of various noise frames in the negative samples is approximately.

Step S103: if the frame classification characterizes that the extracted video frame is a video frame needing to be repaired, determining a reference video frame needing to be referred to for repairing the extracted video frame from normal frames of the video.

As described above, the video frames to be repaired include both noise frames and missing frames. The reference video frames determined are also different for different video frames requiring repair.

In one case, for a noise frame, its reference video frame is a video frame determined based on the nearest preceding normal frame, and for a missing frame, its reference video frame is a video frame determined based on the nearest preceding normal frame and the nearest following normal frame. The manner in which the reference video frames are specifically determined will be described in the embodiments below, respectively, and will not be described in detail here.

In another case, for a noise frame, the reference video frame is a plurality of preamble normal frames in the same scene as the noise frame; for a missing frame, the reference video frame is a plurality of preceding normal frames and subsequent normal frames in the same scene as the missing frame. The specific manner of determination will be further described later and will not be described in detail herein.

Step S104: a first image feature of a reference video frame is obtained.

The first image feature may be used to describe the image content of the reference video frame, and in particular, the first image feature may be in a vector form or may be in another form. The electronic device may perform feature extraction through a plurality of algorithms to obtain the first image feature, for example, an LBP (Local Binary Patterns, local binary pattern) algorithm, a HOG (Histogram of Oriented Gradient, gradient direction histogram) feature extraction algorithm, a SIFT (Scale-invariant feature transform, scale invariant feature transform) operator, and the like, which is not limited by the embodiment of the present invention.

In one implementation, a reference video frame may be input to a trained self-encoder model to obtain a first image feature output by the trained self-encoder. The manner in which the first image features are obtained by the trained self-encoder model will be described in the embodiments corresponding to fig. 3 and 5, which will not be described in detail herein.

In another implementation, the case where there are a plurality of reference video frames corresponds to the explanation of step S103. In this case, the image feature of each reference video frame may be extracted first, and then the extracted image features may be fused based on the playing order of each reference video frame to obtain the first image feature. The specific manner of obtaining it will be further described later and will not be described in detail here.

Step S105: and performing feature diffusion processing on the first image features based on the cross attention mechanism, and generating a repair frame of the extracted video frame based on the processing result.

The characteristic diffusion treatment comprises forward diffusion and reverse diffusion. Specifically, forward diffusion is to continuously add random noise features to image features of a video frame to be repaired, and reverse diffusion is to extract noise features from the image features of the video frame to be repaired to which the random noise features are added according to the first image features, so as to obtain a repair frame.

In one implementation, the feature diffusion process described above can be performed using a fully convolutional neural network, for example, using a U-Net self-encoder with coefficient correlation. The diffusion operation using the U-Net self-encoder is a reverse Markov chain with a length T, wherein T is a time step, and can also be the number of denoising self-encoders, and T can be set according to the image width and height information of a video frame.

Specifically, for the forward diffusion process, the following relationship exists:

，

wherein t represents the step number of gradually adding noise in the diffusion process, and the value range is [0, T ]，Representing a frame image diffused to the t-th step, a>Representing the frame image when diffusing to step t-1,/for the frame image>Mapping function representing diffusion process, I representing identity matrix, N () representing gaussian distribution calculation function, +.>Representing the noise deviation coefficient added in the t step, wherein the value range is [0,1]And in the 0 th to the T th steps, the value is gradually increased.

The manner of training to obtain the U-Net self-encoder is the same as the manner of training to obtain the reverse diffusion layer in the following explanation of step E3 and step K3, and will not be described in detail here.

As described above, the video frames to be repaired include two types of noise frames and missing frames, and the manner of generating the repair frames is different for different video frames to be repaired. The specific manner in which the repair frame is generated will be further described in the embodiments that follow and will not be described in detail here.

Step S106: and playing the repair frame according to the frame sequence number of the extracted video frame.

As described above, the frame numbers of the extracted video frames represent the playing order of each video frame in the video, and therefore, based on the frame numbers of the extracted video frames, the playing order of the repair frames in the video can be determined, so that the repair frames can be played in the order of the frame numbers, and the video images can be played normally.

By applying the scheme provided by the embodiment of the invention, the frame data and the frame sequence number of the video frame can be obtained according to the frame rate of the video, so that the video frame to be repaired is determined based on the frame data, the reference video frame for reference is determined from the normal frame based on the frame sequence number of the video frame to be repaired, the determined reference video frame has relevance with the video frame to be repaired, the image content of the repair frame can be generated based on the image content of the reference video frame, and the generated image content of the repair frame has relevance with the image content of the reference video frame. Therefore, when the video is played, the repair frames are played according to the frame sequence numbers, and the played image contents have relevance but are not completely consistent, so that the problems of blocking, frame skipping and the like when the video is played are reduced, and the watching experience of a user is ensured. Therefore, the scheme provided by the embodiment of the invention can ensure that the video can be completely and continuously played.

In an embodiment of the present invention, referring to fig. 2, a flow chart of a frame data obtaining method is provided, in which step S101 may be completed through the following steps S101A to S101C.

Step S101A: and determining whether image data of the video frame exists at a corresponding position in the video according to the frame rate of the video. If so, step S101B is performed, and if not, step S101C is performed.

If image data is present, then the video frame is interpreted as a normal frame or a noisy frame; if no image data is present, the video frame is indicated as a missing frame.

As described above for step S101, since it may be possible to read the image header data of the missing frame or it may not be possible to read it, when determining whether or not there is image data of a video frame at a corresponding position in the video, there are the following two cases.

In one implementation, it may be determined whether data between header data of a video frame and header data of a next frame of the video frame is image data. In this case, the number of bytes of data between the header data of the video frame and the header data of the next frame of the video frame may be compared with a preset number of bytes, and if the number of bytes is smaller than the preset number of bytes, the data is considered to be corrupted, and the image content of the video frame cannot be obtained from the data, that is, the image data of the video frame does not exist at the corresponding position in the video.

In another case, the electronic device may only obtain the frame number of the missing frame corresponding to the missing frame where the header data is not obtained, and may not obtain other data. For video frames having only a frame number and no other data, it can be considered that there is no image data of the video frame at a corresponding position in the video.

Step S101B: image data recorded at the corresponding position is extracted as frame data of the extracted video frame, and a frame number of the extracted video frame is obtained.

In the foregoing description of step S101, the extracted image data is used as the frame data of the extracted video frame, and the obtained frame number is specifically described, which is not described herein.

Step S101C: and determining the empty frame identification as frame data of the extracted video frame, and obtaining a frame sequence number of the extracted video frame.

When there is no image data, the extracted video frame is a missing frame, and as described above for step S101, the electronic device may generate a null frame identifier as frame data of the missing frame. The above-mentioned null frame identifier may be a preset character string, which is used to indicate that the extracted video frame does not have image data at a corresponding position in the video, that is, indicates that the extracted video frame is a lost frame.

The manner in which the frame number of the missing frame is obtained is also described in detail above and will not be described in detail here.

The frame data and the sequence number of the extracted video frame are obtained in this way, so that whether the image data of the video frame exist or not can be ensured, and the frame data of the video frame can be obtained. And, the empty frame identification is used as the frame data of the video frame without image data, so that the lost frame can be distinguished from the normal frame or the noise frame, and the corresponding repair frame can be generated for each lost frame.

As described above for step S103, the video frames to be repaired include both noise frames and missing frames.

The case where the extracted video frame is a noise frame, and the reference video frame is a video frame determined by the nearest preceding normal frame, will be described below.

In one embodiment of the present invention, if the extracted video frame is a noise frame, the step S103 may be completed by the following steps S103A-S103B.

Step S103A: from the normal frames of the video, the leading video frame that is most adjacent to the extracted video frame is selected.

The preamble video frame is a video frame with a frame number smaller than the frame number of the extracted video frame. Specifically, the electronic device may select, from the frame numbers of the normal frames, a normal frame that is smaller than the frame number of the extracted video frame and that corresponds to the frame number closest to the frame number of the extracted video frame as the selected leading video frame.

Step S103B: based on the preamble video frames, a reference video frame to be referenced for repairing the extracted video frame is determined.

Under the condition that a plurality of scenes exist in the video to be repaired, scene switching possibly exists between the extracted video frame and the previous video frame, at this time, a large difference exists between the image content of the previous video frame and the image content of the extracted video frame, and a reference effect cannot be achieved when the video frame is repaired. In this case, the above step S103B may be completed by the following steps a to B.

Step A: a detection is made as to whether a scene cut exists between the leading video frame and the extracted video frame. And if scene switching exists between the preamble video frame and the extracted video frame, executing the step B.

Specifically, whether or not there is scene change can be detected in the following manner.

In one implementation, image features of the extracted video frames and image features of the preceding video frames may be extracted and a similarity between the extracted image features calculated. When the calculated similarity is below a threshold, a scene cut is considered to exist between the leading video frame and the extracted video frame.

In another implementation manner, the image features of the video frames may be clustered, and a clustering category to which each video frame belongs is determined according to a clustering result, so that when the clustering categories to which the video frames with adjacent frame numbers of the video frames belong are different, it is considered that video scene switching exists between the adjacent video frames. In one case, an unsupervised clustering algorithm may be used to cluster image features of the video frame, where the algorithm used may be K-means (K-means clustering algorithm ) or DBSCAN (Density-Based Spatial Clustering of Applications with Noise, density-based clustering algorithm), which is not limited in this embodiment of the present invention.

And (B) step (B): from the normal frames of the video, the subsequent video frame which is nearest to the extracted video frame and has no scene switching is selected as the reference video frame to be referred for repairing the extracted video frame.

The normal frame which is nearest to the extracted video frame and has no scene switching is selected as the reference video frame, so that the highest association degree between the selected reference video frame and the extracted video frame can be ensured, and the selected reference video frame can enable the image content of the generated repair frame to have high association with the image content of the reference video frame.

In another case, the video to be repaired has only one scene, and at this time, the selected nearest preamble video frame is determined as the reference video frame.

And selecting the reference video frame from the normal frames according to the frame sequence number, wherein the selected reference video frame has a correlation with the extracted video frame, so that the image content of the repair frame generated according to the reference video frame has a correlation with the image content of the reference video frame, thereby ensuring the continuity of the video images to be played.

The following describes a case where the extracted video frame is a noise frame, and the reference video frame is a plurality of preamble normal frames in the same scene as the noise frame.

In one embodiment of the present invention, if the frame classification indicates that the extracted video frame is a noise frame, the above step S103 may be completed by the following step S103 A1.

Step S103A1: and selecting the leading video frame which does not have scene switching with the extracted video frame as a reference video frame which needs to be referred to for repairing the extracted video frame.

Specifically, the electronic device may use all the preamble normal frames, which do not have scene switching between the extracted video frames, as the reference video frames to be referred to for repairing the extracted video frames, or may select, according to the order of playing from back to front, a preset number of preamble normal frames, which do not have scene switching between the extracted video frames, as the reference video frames to be referred to for repairing the extracted video frames.

The manner of determining whether a scene change exists between the preceding video frame and the extracted video frame has already been described in the previous step a, and is not repeated here.

In this way, the multi-frame before the extracted video frame is used as the reference video frame, the determined reference video frame has relevance with the extracted video frame, and the reference video frame has time sequence relevance, so that the video images to be played are ensured to be continuous due to the relevance between the image content of the repair frame generated according to the reference video frame and the image content of the reference video frame.

In one embodiment of the present invention, if the frame classification indicates that the extracted video frame is a noise frame and the reference video frame is a plurality of preamble normal frames, the step S104 may be completed by the following steps S104A-S104B.

Step S104A: image features of each of the selected leading video frames are obtained.

The image features of each preceding video frame may be in a vector form or in other forms. The electronic device may perform feature extraction through various algorithms to obtain image features of each preamble video frame, for example, an LBP (Local Binary Patterns, local binary pattern) algorithm, a HOG (Histogram of Oriented Gradient, gradient direction histogram) feature extraction algorithm, a SIFT (Scale-invariant feature transform, scale invariant feature transform) operator, and the like, which is not limited in this embodiment of the present invention.

In one case, the image features of each of the preceding video frames may be extracted by a trained self-encoder model.

Step S104B: and according to the playing sequence of the selected preamble video frames, fusing the image characteristics of the obtained preamble video frames to obtain the first image characteristics of the reference video frames.

Specifically, the first image feature of the reference video frame may be obtained through a trained feature fusion model. For example, the LSTM (Long Short Term Memory) model is used to fuse the image features of each preceding video frame to obtain the first image feature of the reference video frame.

The first image features thus obtained take into account the features of a plurality of preceding normal frames preceding the extracted video frame and take into account the timing relationship between these preceding normal frames, so that it is possible to ensure that there is a correlation between the first image features thus obtained and the extracted video frame.

In an embodiment of the present invention, if the extracted video frame is a noise frame, referring to fig. 3, a flowchart of a second video playing method is provided, and in this embodiment, the step S105 may be completed through the following steps S105A-S105C.

Step S105A: a second image feature of the extracted video frame is obtained.

Similar to the above description of step S104, the above second image feature may describe image data of the reference video frame, and in particular, the second image feature may be in a vector form or may be in another form. The electronic device may perform feature extraction through a plurality of algorithms to obtain the second image feature, for example, an LBP (Local Binary Patterns, local binary pattern) algorithm, HOG (Histogram of Oriented Gradient, gradient direction histogram) feature extraction algorithm, SIFT (Scale-invariant feature transform, scale invariant feature transform) operator, and the like, which is not limited by the embodiment of the present invention.

Step S105B: a cross-attention calculation is performed on the first image feature and the second image feature.

The first image features may characterize respective features of image content of the reference video frame and the second image features may characterize respective features of image content of the extracted video frame. As described above for steps S103A-S103B, since the reference video frame and the extracted video frame have a high similarity in image content, the first image feature and the second image feature are subjected to cross-attention computation, and thus each feature in the repair frame to be generated can be obtained.

Specifically, as described above, both the first image feature and the second image feature may be in vector form. In this way, the electronic device may determine weights of components of each dimension of the vector of the first image feature and weights of components of each dimension of the vector of the second image feature based on the attention mechanism, and fuse the first image feature and the second image feature according to the determined weights, to obtain the image feature of the repair frame to be generated.

Step S105C: and taking the reference image as an auxiliary control condition, denoising the extracted video frame based on the cross attention calculation result, and generating a repair frame of the extracted video frame.

In step S105B, the image features of the repair frame to be generated are obtained, and the obtained image features of the generated repair frame may be used to describe the image content of the repair frame to be generated, so that the electronic device may perform denoising processing on the extracted video frame based on the image content of the region corresponding to the noise region described by the obtained image features of the repair frame to be generated.

Since the generation of the repair frame is based on the cross-attention calculation result, the cross-attention calculation result is the fusion of the first image feature and the second image feature. In addition, the first image feature is an image feature of a reference video frame, and the reference video frame and the extracted video frame have an association, and when the repair frame is generated, not only the extracted video frame but also a normal frame having an association with the extracted video frame are considered. The repair frames so generated are natural and have an association with the reference video frames.

In one embodiment of the present invention, if the frame classification characterizes the extracted video frame as a noise frame, a corresponding repair frame may be generated using a pre-trained first frame repair model.

Referring to fig. 4, a schematic structural diagram of a first frame repair model is provided. It can be seen that the first frame repair model comprises a first self-encoding sub-network, a first forward diffusion layer, a first cross-attention computation layer, a first reverse diffusion layer, and a first decoder network.

Specifically, in this embodiment, the step S104 may be completed by the following step C, the step S105A may be completed by the following step D, and the steps S105B to S105C may be completed by the following steps E1 to E4.

Step C: and taking the reference video frame as control path input data of a pre-trained first frame repair model, inputting the control path input data into a first self-coding sub-network in the first frame repair model, and obtaining a first image characteristic of the reference video frame output by the first self-coding sub-network.

The first self-coding sub-network can be obtained by training the self-coder model by using a noise-free sample video frame, specifically, the sample video frame can be input into the self-coder model to obtain a first sample image characteristic output by the self-coder model, the first sample image characteristic is compared with a sample target image characteristic, and parameters of the self-coder model are adjusted based on a comparison result to obtain the first self-coding sub-network.

The self-encoder model may be a general self-encoder model.

Step D: and taking the extracted video frames as main path input data of the first frame repair model, inputting the main path input data into the first self-coding sub-network, and obtaining second image characteristics of the extracted video frames output by the first self-coding sub-network.

Step E1: and inputting the second image characteristic into the first forward diffusion layer to obtain a first noise characteristic of the extracted video frame output by the first forward diffusion layer.

The first forward diffusion layer performs a forward diffusion operation on the second image feature, which has been described in the previous step S105 and will not be repeated here.

Step E2: and inputting the first noise feature and the first image feature into a first cross attention computing layer to obtain a first cross attention computing result output by the first cross attention computing layer.

The first cross-attention computing layer performs cross-attention computation on the first noisy feature and the first image feature, and the computation process is described in the foregoing step S105B and is not repeated here.

Step E3: and inputting the first cross attention calculation result into the first reverse diffusion layer to obtain a first denoising feature which is output after the first reverse diffusion layer performs noise reduction processing on the first cross attention calculation result.

The first reverse diffusion layer can be obtained by training a diffusion model, specifically, two continuous frames of a noiseless video can be extracted, the former frame is used as a first sample reference video frame, the latter frame is used as a first sample target video frame, and noise is randomly added in the latter frame to obtain a sample noise frame. And inputting the first sample reference video frame and the sample noise frame into a diffusion model to obtain a first sample repair frame output by the diffusion model, comparing the first sample repair frame with the first sample target video frame, and adjusting diffusion model parameters based on a comparison result to obtain a first reverse diffusion layer. Corresponding to the previous description of step S105, the trained first reverse diffusion layer is the U-Net self-encoder.

Specifically, when the diffusion model is trained, the loss function is as follows:

;/>

wherein, the liquid crystal display device comprises a liquid crystal display device,representing a calculated mean square error, Z represents noise information in the first sample target video frame,noise information in a first sample repair frame representing a diffusion model output.

As described above, the noise frames may include a plurality of types such as white noise frames, regional white noise frames, green curtain noise mask frames, and local irregular black block noise mask frames, and when training the self-encoder, it is necessary to equalize the number of the various types of noise frames, that is, keep the number of the various types of noise frames close.

Step E4: and inputting the first denoising characteristic into a first decoder network to obtain a repair frame of the extracted video frame output by the first decoder network.

In this way, the first frame repair model is used for repairing the noise frame, so that the repair frame of the extracted noise frame can be efficiently generated, and when the repair frame is generated, not only the extracted video frame but also the normal frame with relevance with the extracted video frame are considered. The repair frames so generated are natural and have an association with the reference video frames.

The following describes a case where the extracted video frame is a lost frame, and the reference video frame is a video frame determined by the nearest preceding normal frame and the following normal frame.

In one embodiment of the present invention, if the frame classification indicates that the extracted video frame is a lost frame, the above step S103 may be completed by the following steps S103C-S103D.

Step S103C: then from the normal frames of the video, the leading video frame that is nearest to the extracted video frame and the trailing video frame that is nearest to the extracted video frame are selected.

Step S103D: based on the leading video frame and the trailing video frame, a reference video frame to be referenced for repairing the extracted video frame is determined.

Under the condition that a plurality of scenes exist in the video to be repaired, scene switching possibly exists between the front video frame and the rear video frame, at this time, a large difference exists between the image content of the front video frame and the image content of the rear video frame, and a reference effect cannot be achieved when the video frames are repaired. In this case, the above step S103D may be completed by the following steps F to G.

Step F: detecting whether scene switching exists between a preceding video frame and a subsequent video frame. If scene switching does not exist between the front video frame and the rear video frame, executing the step G; and if scene switching exists between the front video frame and the rear video frame, executing the step H.

In one implementation, image features of a preceding video frame and image features of a subsequent video frame may be extracted and a similarity between the extracted image features calculated. When the calculated similarity is below a threshold, a scene cut is considered to exist between the preceding video frame and the following video frame.

In another implementation manner, similar to the foregoing description of step a, the image features of the video frames may be clustered, and the clustering category to which each video frame belongs is determined according to the clustering result, so that when the clustering categories to which the video frames with adjacent frame numbers of the video frames belong are different, it is considered that there is video scene switching between the adjacent video frames. In one case, an unsupervised clustering algorithm may be used to cluster image features of the video frame, where the algorithm used may be K-means (K-means clustering algorithm ) or DBSCAN (Density-Based Spatial Clustering of Applications with Noise, density-based clustering algorithm), which is not limited in this embodiment of the present invention.

Step G: the leading video frame and the following video frame are determined as reference video frames to be referenced for repairing the extracted video frames.

Step H: the leading video frame or the following video frame is determined as a reference video frame to which the extracted video frame needs to be referenced.

The method comprises the steps of selecting the leading normal frame and/or the following normal frame which are nearest to the extracted video frame as the reference video frame, ensuring that scene switching does not exist between the selected reference video frames, wherein the association degree between the image contents of the two video frames is high because the selected leading normal frame and the following normal frame are nearest to the extracted video frame, and the extracted video frame is between the leading normal frame and the following normal frame, so that the image content of the repair frame generated according to the selected reference video frame and the image content of the reference video frame have high association.

In another case, the video to be repaired has only one scene, and at this time, the selected nearest preceding video frame and nearest subsequent video frame are determined as reference video frames to be referred to for repairing the extracted video frames.

And selecting the reference video frames from the normal frames according to the frame sequence numbers, wherein the selected reference video frames have correlation with the extracted video frames, and the extracted reference video frames have correlation between the image content of the repair frames generated according to the reference video frames and the image content of the reference video frames, so that the continuity of video images to be played is ensured.

The following describes a case where the extracted video frame is a lost frame, and the reference video frame is a plurality of preceding normal frames and subsequent normal frames in the same scene as the noise frame.

Step S103A1: and selecting a leading video frame and a following video frame which have no scene switching between the leading video frame and the extracted video frame from normal frames of the video, and taking the leading video frame and the following video frame as reference video frames which need to be referred to for repairing the extracted video frames.

The manner of determining whether a scene change exists before the preceding video frame, the following normal frame, and the extracted video frame has been described in the previous step F, and is not repeated here.

In this way, the frames before and after the extracted video frame are used as reference video frames, the determined reference video frames have relevance with the extracted video frames, and time sequence relevance exists between the reference video frames, so that the relevance between the image content of the repair frame generated according to the reference video frames and the image content of the reference video frames ensures that video images to be played are continuous.

In one embodiment of the present invention, if the frame classification indicates that the extracted video frame is a lost frame and the reference video frame is a plurality of preceding normal frames and subsequent normal frames, the step S104 may be completed by the following steps S104C-S104F.

Step S104C: image features of each selected preceding video frame and each subsequent video frame are obtained.

The image features of each preceding video frame and each subsequent video frame may be in a vector form or may be in other forms. The electronic device may perform feature extraction through various algorithms to obtain image features of each preamble video frame, for example, an LBP (Local Binary Patterns, local binary pattern) algorithm, a HOG (Histogram of Oriented Gradient, gradient direction histogram) feature extraction algorithm, a SIFT (Scale-invariant feature transform, scale invariant feature transform) operator, and the like, which is not limited in this embodiment of the present invention.

In one case, image features of each preceding video frame and each subsequent video frame may be extracted by a trained self-encoder model.

Step S104D: and fusing the image characteristics of the obtained precursor video frames according to the play sequence of the selected precursor video frames.

Specifically, the image features of each preamble video frame can be obtained through a trained feature fusion model. For example, image features of each preceding video frame are fused using a LSTM (Long Short Term Memory) model.

Step S104E: and fusing the image characteristics of the obtained subsequent video frames according to the play sequence of the selected subsequent video frames.

Similar to step S104D, the image features of each preceding video frame may be obtained through a trained feature fusion model. For example, image features of each preceding video frame are fused using a LSTM (Long Short Term Memory) model.

Step S104F: and determining the fused image characteristic as a first image characteristic of the reference video frame.

The first image features obtained in this way take the features of a plurality of preceding normal frames before the extracted video frame into account, and also take the features of a plurality of subsequent normal frames after the extracted video frame into account, and take the time sequence relationship between the preceding normal frames and the time sequence relationship between the subsequent normal frames into account, so that the correlation between the first image features obtained and the extracted video frame can be ensured.

In an embodiment of the present invention, if the frame classification indicates that the extracted video frame is a lost frame, referring to fig. 5, a flow chart of a third video playing method is provided, and in this embodiment, the step S105 may be completed through the following steps S105D-S105F.

Step S105D: a third image feature of the random noise image is obtained.

The random noise image is a noise image corresponding to the size of the video frame. In one case, the size of the video frame can be carried in the video basic information, so that the electronic equipment can determine the size of the random noise image according to the video basic information; in another case, the electronic device may parse the video, and determine the size of the video frame according to the parsing result, thereby determining the size of the random noise image.

Similar to the above description of step S104, the third image feature may describe image data of the reference video frame, and in particular, the third image feature may be in a vector form or may be in another form. The electronic device may perform feature extraction through a plurality of algorithms to obtain a third image feature, for example, an LBP (Local Binary Patterns, local binary pattern) algorithm, HOG (Histogram of Oriented Gradient, gradient direction histogram) feature extraction algorithm, SIFT (Scale-invariant feature transform, scale invariant feature transform) operator, and the like, which is not limited by the embodiment of the present invention.

Step S105E: a cross-attention calculation is performed on the first image feature and the third image feature.

Similar to the description of step S105B, the first image feature and the third image feature may be in vector form. In this way, the electronic device may determine weights of components of each dimension of the vector of the first image feature and weights of components of each dimension of the vector of the third image feature based on the attention mechanism, and fuse the first image feature and the third image feature according to the determined weights, to obtain the image feature of the complementary frame to be generated.

Step S105F: and denoising the random noise image based on the cross attention calculation result, and generating a supplementary frame based on the denoising result as a repair frame of the extracted video frame.

In step S105E, the image features of the to-be-generated supplemental frame are obtained, and the obtained image features of the to-be-generated supplemental frame may describe the image content of the to-be-generated supplemental frame, so that the electronic device may generate the supplemental frame by denoising processing based on the obtained image content described by the image features of the to-be-generated supplemental frame.

Since the generation of the supplemental frame is based on the cross-attention computation result, the cross-attention computation is based on the first image feature and the third image feature, and thus the computation result is a fusion of the first image feature and the third image feature. In addition, the first image feature is an image feature of a reference video frame, the reference video frame having an association with the extracted missing frame, and a normal frame having an association with the extracted video frame is considered when generating the supplemental frame. There is an association between the supplemental frames so generated and the reference video frames.

In one embodiment of the present invention, if the frame classification characterizes the extracted video frame as a lost frame, a pre-trained second frame repair model may be used to generate a corresponding repair frame.

Referring to fig. 6, a schematic structural diagram of a second frame repair model is provided. It can be seen that the second frame repair model includes a second self-encoding sub-network, a second forward diffusion layer, a second cross-attention computation layer, a second reverse diffusion layer, and a second decoder network.

Specifically, in this embodiment, the step S104 may be completed by the following step I, the step S105D may be completed by the following step J, and the steps S105E to S105F may be completed by the following steps K1 to K4.

Step I: and taking the reference video frame as control path input data of a pre-trained second frame repair model, inputting the control path input data into a second self-coding sub-network in the second frame repair model, and obtaining the first image characteristic of the reference video frame output by the second self-coding sub-network.

Similar to the description of step C, the second self-encoding sub-network may be obtained by training the self-encoder model using a noise-free sample video frame, specifically, the sample video frame may be input into the self-encoder model to obtain a second sample image feature output from the self-encoder model, and compared with the sample target image feature, and the self-encoder model parameter is adjusted based on the comparison result to obtain the second self-encoding sub-network. In one case, a trained first self-encoding sub-network may be used as the second self-encoding sub-network.

The self-encoder model may be a general self-encoder model.

Step J: and taking the random noise image as main path input data of the second frame repair model, inputting the main path input data into a second self-coding sub-network, and obtaining a third image characteristic of the random noise image output by the second self-coding sub-network.

Step K1: and inputting the third image characteristic into the second forward diffusion layer to obtain a second noise characteristic of the extracted video frame output by the second forward diffusion layer.

The second forward diffusion layer performs a forward diffusion operation on the third image feature, which has been described at step S105, and is not repeated here.

Step K2: and inputting the second noisy feature and the first image feature into a second cross attention computing layer to obtain a second cross attention computing result output by the second cross attention computing layer.

The second cross-attention computing layer performs cross-attention computation on the second noisy feature and the first image feature, and the computation process is described in the foregoing step S105E and is not repeated here.

Step K3: and inputting the second cross attention calculation result into a second inverse diffusion layer to obtain a second denoising feature which is output after the second inverse diffusion layer performs noise reduction processing on the second cross attention calculation result.

The second reverse diffusion layer can be obtained by training a diffusion model, specifically, three continuous frames of a noiseless video can be extracted, an intermediate frame is used as a second sample target video frame, the other two frames are used as second sample reference video frames, and random noise is added into the intermediate frame to obtain a sample noise image. And inputting the sample noise image and the second sample reference video frame into a diffusion model to obtain a second sample repair frame output by the diffusion model, comparing the second sample repair frame with the second sample target video frame, and adjusting diffusion model parameters based on a comparison result to obtain a second reverse diffusion layer. Corresponding to the previous description of step S105, the trained first reverse diffusion layer is the U-Net self-encoder.

;

wherein, the liquid crystal display device comprises a liquid crystal display device,representing a calculated mean square error, Z represents noise information in the second sample target video frame,representing noise information in a second sample repair frame output by the diffusion model.

Step K4: and inputting the second denoising feature into a second decoder network to obtain a repair frame of the extracted video frame output by the second decoder network.

Thus, the lost frame is repaired by the second frame repair model, the complementary frame of the extracted lost frame can be efficiently generated, and the normal frame with the correlation with the extracted lost frame is considered when the complementary frame is generated. There is an association between the supplemental frames so generated and the reference video frames.

The overall flow of video playback is described below by way of a specific embodiment.

Referring to fig. 7, a flowchart of a fourth video playing method is provided. In this embodiment, the following steps S701 to S707 are included.

Step S701: and obtaining the video to be played.

Step S702: and extracting frames from the video according to the frame rate of the video, and obtaining frame data and frame sequence numbers of the extracted video frames.

Step S703: based on the obtained frame data, a frame classification corresponding to the extracted video frame is determined.

When the frame is classified as a noise frame, step S704 is performed; when the frame is classified as a missing frame, step S705 is performed; when the frame is classified as a normal frame, step S706 is performed.

Step S704: a repair frame of the noise frame is generated.

Step S705: and generating a supplementary frame of the missing frame as a repair frame.

Step S706: and combining the normal frame and the repair frame according to the frame sequence number of the extracted video frame.

Step S707: and playing the repaired video.

Specifically, after step S706, the combined normal frame and repair frame may be combined with the audio track of the video to regenerate the video, which is the repaired video played in step S707.

Corresponding to the video playing method, the embodiment of the invention also provides a video playing device.

Referring to fig. 8, there is provided a schematic structural diagram of a video playing device, the device comprising:

a video frame extraction module 801, configured to extract frames from a video according to a frame rate of the video, and obtain frame data and a frame sequence number of the extracted video frames;

A video frame classification module 802, configured to determine a frame classification corresponding to the extracted video frame based on the obtained frame data;

a reference video frame determining module 803, configured to determine, from normal frames of the video, a reference video frame to be referred to for repairing the extracted video frame, where the normal frames are: video frames that do not require repair;

a first image feature obtaining module 804, configured to obtain a first image feature of the reference video frame;

a repair frame generation module 805, configured to perform feature diffusion processing on the first image feature based on a cross-attention mechanism, and generate a repair frame of the extracted video frame based on a processing result;

the video frame playing module 806 is configured to play the repair frame according to the frame number of the extracted video frame.

Optionally, the reference video frame determining module 803 includes:

Optionally, the first reference video frame determining submodule includes:

Optionally, the repair frame generating module 805 includes:

Optionally, the first image feature obtaining module 804 is specifically configured to: inputting the reference video frame as control path input data of a first frame repair model trained in advance, inputting a first self-coding sub-network in the first frame repair model, and obtaining a first image characteristic of the reference video frame output by the first self-coding sub-network, wherein the first frame repair model further comprises: a first forward diffusion layer, a first cross-attention computation layer, a first reverse diffusion layer, a first decoder network;

Optionally, the reference video frame determining module 803 includes:

Optionally, the second reference video frame determining submodule includes:

Optionally, the repair frame generating module 805 includes:

Optionally, the first image feature obtaining module 804 is specifically configured to: inputting the reference video frame as control path input data of a pre-trained second frame repair model, and inputting the control path input data into a second self-coding sub-network in the second frame repair model to obtain a first image characteristic of the reference video frame output by the second self-coding sub-network, wherein the second frame repair model further comprises: a second forward diffusion layer, a second cross-attention computation layer, a second reverse diffusion layer, a second decoder network;

Optionally, the video frame extraction module includes:

The embodiment of the present invention also provides an electronic device, as shown in fig. 9, including a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where the processor 901, the communication interface 902, and the memory 903 perform communication with each other through the communication bus 904,

A memory 903 for storing a computer program;

the processor 901 is configured to implement the video playing method described in the foregoing method embodiment when executing the program stored in the memory 903.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-ProgrammableGate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, where a computer program is stored, where the computer program, when executed by a processor, implements the video playing method according to the foregoing method embodiment.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus, electronic device and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and references to the parts of the description of the method embodiments are only needed.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A video playing method, the method comprising:

obtaining a first image feature of the reference video frame;

2. The method of claim 1, wherein if the frame classification characterizes the extracted video frame as a video frame in need of repair, determining a reference video frame to be referenced for repairing the extracted video frame from normal frames of the video, comprising:

3. The method of claim 2, wherein the determining, based on the preamble video frames, a reference video frame to be referred to for repairing the extracted video frame comprises:

4. The method of claim 2, wherein the performing feature diffusion processing on the first image feature based on the cross-attention mechanism and generating a repair frame of the extracted video frame based on the processing result comprises:

obtaining a second image feature of the extracted video frame;

5. The method of claim 4, wherein said obtaining a first image feature of the reference video frame comprises:

the obtaining a second image feature of the extracted video frame includes:

6. The method of claim 1, wherein if the frame classification characterizes the extracted video frame as a video frame in need of repair, determining a reference video frame to be referenced for repairing the extracted video frame from normal frames of the video, comprising:

7. The method of claim 6, wherein the determining, based on the leading video frame and the trailing video frame, a reference video frame to be referred to for repairing the extracted video frame comprises:

8. The method of claim 6, wherein performing feature diffusion processing on the first image feature based on the cross-attention mechanism and generating a repair frame for the extracted video frame based on the processing result comprises:

obtaining a third image feature of the random noise image;

9. The method of claim 8, wherein the obtaining the first image feature of the reference video frame comprises:

the obtaining a third image feature of the random noise image includes:

10. The method according to any one of claims 1-9, wherein the extracting frames of the video according to the frame rate of the video to obtain frame data and frame numbers of the extracted video frames comprises:

11. A video playback device, the device comprising:

12. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-10 when executing a program stored on a memory.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-10.