WO2023193521A1

WO2023193521A1 - Video inpainting method, related apparatus, device and storage medium

Info

Publication number: WO2023193521A1
Application number: PCT/CN2023/075576
Authority: WO
Inventors: 钟立耿; 朱允全; 谯睿智
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2022-04-06
Filing date: 2023-02-13
Publication date: 2023-10-12
Also published as: CN115170400A

Abstract

The present application discloses a video inpainting method based on artificial intelligence, with application scenarios at least comprising different types of terminal, such as a mobile phone, a computer, and a vehicle-mounted terminal. The present application comprises: acquiring a video sample sequence; acquiring a target mask sample sequence according to the video sample sequence; acquiring an optical flow data sequence according to the video sample sequence; on the basis of each piece of optical flow data in the optical flow data sequence, performing clustering processing on pixel points comprised by a target mask region in each target mask frame, to obtain an optical flow clustering result for each target mask frame; determining an optical flow quality score according to the optical flow clustering result of each target mask frame; and performing inpainting processing on a video to be inpainted using a video inpainting mode matching the optical flow quality score. The present application also provides a related apparatus. According to the present application, the optical flow quality is used as a basis for selecting a video inpainting mode, enabling different video inpainting modes to complement each other, thereby helping to obtain a video picture having a better inpainting effect.

Description

A video repair method, related devices, equipment and storage media

This application requests the priority of the Chinese patent application submitted to the China Patent Office on April 6, 2022, with the application number 2022103555942 and the application title "A video repair method, related devices, equipment and storage media", and its entire content incorporated herein by reference.

Technical field

This application relates to the field of data processing technology, especially to video repair technology.

Background technique

Video inpainting is a task used to fill in the missing areas in the video frame with reasonable content. It mainly uses the unmasked area information in the video to repair the masked areas. For example, repair damaged videos, remove unwanted objects, relocate videos, repair underexposed images, etc.

At present, video restoration technology is mainly divided into two types. One is the technology that uses optical flow propagation and image restoration. This technology first propagates available pixels to the corresponding areas through optical flow, and then uses image restoration to fill isolated pixel blocks. The other is to use an end-to-end neural network method to fill the occluded area using a generative model.

However, the above video repair technology has at least the following problems. The content based on optical flow filling has higher definition, but relies too much on optical flow. The optical flow itself is easily disturbed and the optical flow estimation may be inaccurate. Therefore, it is easy to Distortions and incorrect fillings occur. The end-to-end neural network method takes into account semantic information and usually does not cause distortions and serious errors. However, due to the complex background, it is easy to cause the filling content to be blurred.

Contents of the invention

Embodiments of the present application provide a video repair method, related devices, equipment, and storage media. This application uses optical flow quality as the basis for selecting a video repair method, so that different video repair methods can learn from each other's strengths and weaknesses, thereby helping to obtain video images with better repair effects.

In view of this, on the one hand, this application provides a video repair method, which is executed by a computer device, including:

Obtain a video sample sequence corresponding to the video to be repaired, where the video sample sequence includes K video frame pairs, each video frame pair includes two adjacent video frames, and K is an integer greater than or equal to 1;

Obtain the target mask sample sequence according to the video sample sequence, where the target mask sample sequence includes K target mask frames, and each target mask frame includes a target mask area obtained by expanding the original mask area, and, There is a one-to-one correspondence between K target mask frames and K video frame pairs;

Obtain an optical flow data sequence according to the video sample sequence, where the optical flow data sequence includes K optical flow data, and there is a one-to-one correspondence between the K optical flow data and K video frame pairs;

Based on each optical flow data in the optical flow data sequence, the target mask area in each target mask frame is The pixels included in the domain are clustered to obtain the optical flow clustering results of each target mask frame;

Determine the optical flow quality score based on the optical flow clustering results of each target mask frame;

Use a video repair method that matches the optical flow quality score to repair the video to be repaired.

Another aspect of this application provides a video repair device, including:

An acquisition module is used to obtain a video sample sequence for the video to be repaired, where the video sample sequence includes K video frame pairs, each video frame pair includes two adjacent video frames, and K is an integer greater than or equal to 1;

The acquisition module is also used to obtain a target mask sample sequence according to the video sample sequence, wherein the target mask sample sequence includes K target mask frames, and each target mask frame includes a target obtained by expanding the original mask area. Mask area, and there is a one-to-one correspondence between K target mask frames and K video frame pairs;

The acquisition module is also used to obtain the optical flow data sequence according to the video sample sequence, wherein the optical flow data sequence includes K optical flow data, and there is a one-to-one correspondence between the K optical flow data and the K video frame pairs;

The processing module is used to cluster the pixels included in the target mask area in each target mask frame based on each optical flow data in the optical flow data sequence, and obtain the optical flow clustering of each target mask frame. result;

A determination module used to determine the optical flow quality score based on the optical flow clustering results of each target mask frame;

The repair module is used to repair the video to be repaired using a video repair method that matches the optical flow quality score.

Another aspect of the present application provides a computer device, including a memory and a processor. The memory stores a computer program. When the processor executes the computer program, the methods of the above aspects are implemented.

Another aspect of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the methods of the above aspects are implemented.

Another aspect of the present application provides a computer program product, including a computer program, which implements the methods of the above aspects when executed by a processor.

It can be seen from the above technical solutions that the embodiments of the present application have the following advantages:

In the embodiment of the present application, a video repair method is provided. First, a video sample sequence corresponding to the video to be repaired is obtained, and then a target mask sample sequence can be obtained according to the video sample sequence, wherein each target mask frame includes a pair of original The target mask area obtained after the mask area is expanded. Furthermore, the optical flow data sequence is obtained according to the video sample sequence, and then based on each optical flow data in the optical flow data sequence, the pixels included in the target mask area in each target mask frame are clustered to obtain each target. Optical flow clustering results of mask frames. Based on this, the optical flow quality score can be determined based on the optical flow clustering results of each target mask frame, and the video to be repaired can be repaired using a video repair method that matches the optical flow quality score. Through the above method, the optical flow clustering results of the masked area are used to predict the optical flow quality. When the optical flow quality is good, the optical flow method can be used as the video Repair methods to obtain filler content with higher clarity and credibility. When the optical flow quality is poor, the generative model can be used as a video repair method to obtain a more stable filling effect. It can be seen that this application uses optical flow quality as the basis for selecting a video repair method, so that different video repair methods can learn from each other's strengths and weaknesses, thereby helping to obtain a video picture with better repair effects.

Description of drawings

Figure 1 is an architectural schematic diagram of a video repair system in an embodiment of the present application;

Figure 2 is an effect diagram of video frame filling based on the optical flow method in the embodiment of the present application;

Figure 3 is an effect diagram of video frame filling based on the model method in the embodiment of the present application;

Figure 4 is a schematic flow chart of a video repair method in an embodiment of the present application;

Figure 5 is a schematic diagram of generating a target mask frame in an embodiment of the present application;

Figure 6 is another schematic diagram of generating a target mask frame in an embodiment of the present application;

Figure 7 is another schematic diagram of generating a target mask frame in an embodiment of the present application;

Figure 8 is another schematic diagram of generating a target mask frame in an embodiment of the present application;

Figure 9 is a schematic diagram of determining a two-dimensional optical flow value based on forward optical flow in an embodiment of the present application;

Figure 10 is a schematic diagram of determining a two-dimensional optical flow value based on backward optical flow in an embodiment of the present application;

Figure 11 is a schematic diagram of the effect of removing a mark based on a video repair application in an embodiment of the present application;

Figure 12 is a schematic diagram of the effect of removing subtitles based on a video repair application in an embodiment of the present application;

Figure 13 is a schematic diagram of the effect of object removal based on video repair application in the embodiment of the present application;

Figure 14 is a schematic diagram comparing the effects of video frame restoration based on the optical flow method and the model method in the embodiment of the present application;

Figure 15 is a schematic diagram of the video repair device in the embodiment of the present application;

Figure 16 is a schematic structural diagram of a terminal in an embodiment of the present application;

Figure 17 is a schematic structural diagram of a server in an embodiment of the present application.

Detailed ways

With the advent of the era of multimedia and artificial intelligence (AI), video has gradually become the mainstream method of information exchange, and massive videos pose more challenges to video quality management. Currently, videos may be defective due to some reasons. For example, there is a mosaic pattern in the video screen. The mosaic pattern will affect the user's viewing experience. For another example, there may be station logos or advertising patterns during the video creation process. Based on this, this application proposes a video repair method, aiming to remove unnecessary objects in the video or restore damaged pictures.

Video repair methods specifically involve AI-based computer vision (CV) technology and machine learning (ML). That is, repairable objects (for example, station logos, subtitles, etc.) are identified from the video through CV technology. The video image is repaired through the neural network trained by ML.

In order to improve the effect of video picture repair, this application proposes a video repair method, which is applied to the video repair system shown in Figure 1. As shown in the figure, the video repair system includes services Server and terminal, and a client is deployed on the terminal. The client can run on the terminal in the form of a browser, or can also run on the terminal in the form of an independent application (application, APP). Here, for The specific presentation form of the client is not limited. The server involved in this application can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides cloud services. The terminal can be a mobile phone, computer, intelligent voice interaction device, smart home appliance, vehicle terminal, aircraft, etc., but is not limited to this. The terminal and the server can be connected directly or indirectly through wired or wireless communication methods, which is not limited in this application. There is no limit on the number of servers and terminals. The solution provided by this application can be completed independently by the terminal, or independently by the server, or it can also be completed by the terminal and the server in cooperation. This application does not specifically limit this.

The following will introduce the workflow of two video repair scenarios based on the architecture shown in Figure 1.

For example, in one case, the user can upload the video to the server through the terminal, and the server can directly call the video repair function. That is, the selected video repair algorithm (ie, optical flow method or model method) is first determined, and based on this, the corresponding video repair algorithm is used to repair the video. Finally, the repaired video is stored in the database. When the terminal requests the server to play a video, the server can obtain the corresponding video from the database and feed it back to the terminal.

For example, in another case, the user can upload the video to the server through the terminal, and the server stores the video uploaded by the terminal into the database. When a video needs to be repaired, the corresponding video can be selected from the database and then the video repair function can be called. That is, the selected video repair algorithm (ie, optical flow method or model method) is first determined. Based on this, the video is repaired using the corresponding video repair algorithm. Finally, the repaired video is stored in the database.

There are certain differences in the effects of video repair using optical flow method and model method. The following will be introduced with illustrations.

1. Fill the mask area based on the optical flow method;

For ease of introduction, please refer to Figure 2. Figure 2 is an effect diagram of video frame filling based on the optical flow method in the embodiment of the present application. As shown in (a) of Figure 2, the mask object is detected in the video frame. . After filling with the optical flow method, the video frame shown in (b) in Figure 2 can be obtained. It can be seen that in the case of object occlusion and complex movement of the background, the filling effect based on the optical flow method will be greatly affected, and the erroneous pixels caused by the optical flow estimation error will gradually expand as it propagates, resulting in incorrect filling content. .

2. Fill the mask area based on the model method;

For ease of introduction, please refer to Figure 3. Figure 3 is an rendering of video frame filling based on the model method in an embodiment of the present application. As shown in (a) of Figure 3, a mask object is detected in the video frame. After filling with the model method, the video frame shown in (b) in Figure 3 can be obtained. It can be seen that the filling part is blurry, and it is difficult to process high-resolution input due to limitations such as video memory, but the overall effect is relatively stable and obvious errors with strong contrast are not prone to occur.

Based on the above introduction, limited by the repair quality of the optical flow method and the model method, this application proposes a visual The frequency repair method can determine in advance which video repair method to choose for picture repair, and then use a more appropriate video repair method to repair the video picture, so as to achieve a more robust filling effect. The video repair method in the present application will be introduced below. Please refer to Figure 4. The video repair method in the embodiment of the present application can be executed by a computer device, and the computer device can be a terminal or a server. The embodiment of the present application includes:

110. Obtain the video sample sequence corresponding to the video to be repaired, where the video sample sequence includes K video frame pairs, each video frame pair includes two adjacent video frames, and K is an integer greater than or equal to 1.

In one or more embodiments, the computer device can obtain the video to be repaired, and then extract K video frame pairs from the video to be repaired to form a video sample sequence. Each video frame pair includes two adjacent video frames, and each video frame pair includes two adjacent video frames. A video frame has a corresponding video frame number. For example, if no normalization is performed, the video sample sequence can be expressed as x _s , x _s ={(x ₁ , x ₂ ), (x ₁₁ , x ₁₂ ),...}. ; Among them, the video sample sequence includes K video frame pairs, that is, the first video frame pair is expressed as (x ₁ , x ₂ ), the second video frame pair is expressed as (x ₁₁ , x ₁₂ ), and so on. . _For example, _if normalization _has been _performed , _the video sample sequence is _expressed as A video frame pair, that is, the first video frame pair is expressed as (x _r1 , x _r2 ), the second video frame pair is expressed as (x _r11 , x _r12 ), and so on.

120. Obtain the target mask sample sequence according to the video sample sequence, where the target mask sample sequence includes K target mask frames, and each target mask frame includes a target mask area obtained by expanding the original mask area, Moreover, there is a one-to-one correspondence between K target mask frames and K video frame pairs.

In one or more embodiments, after the computer device obtains the video sample sequence, for each video frame pair, the corresponding at least one original mask frame can be obtained. For each original mask frame, the corresponding original mask area is marked, and then the original mask area is expanded according to a certain number of pixels to obtain the target mask area. Then the target mask frame is obtained according to the target mask area. Thus, a target mask sample sequence including K target mask frames is obtained.

130. Obtain the optical flow data sequence according to the video sample sequence, where the optical flow data sequence includes K optical flow data, and there is a one-to-one correspondence between the K optical flow data and K video frame pairs.

In one or more embodiments, the computer device may generate corresponding optical flow data respectively according to K video frame pairs in the video sample sequence, thereby obtaining an optical flow data sequence including K optical flow data. Among them, the optical flow data can be expressed as an optical flow matrix of two channels. One optical flow matrix is used to record the horizontal offset of all pixels in the video frame pair, and the other optical flow matrix is used to record the vertical offset of all pixels in the video frame pair.

It should be noted that in practical applications, the computer device can perform step 120 first and then step 130, or it can perform step 130 first and then step 120, or it can also perform steps 120 and 130 at the same time. In the embodiment of the present application, steps The execution order of step 120 and step 130 is not limited in any way.

140. Based on each optical flow data in the optical flow data sequence, calculate the target mask in each target mask frame. The pixels included in the mask area are clustered to obtain the optical flow clustering results of each target mask frame.

In one or more embodiments, the optical flow data sequence and the target mask sample sequence are aligned, that is, the optical flow data in the optical flow data sequence corresponds to the target mask frame in the target mask sample sequence. Based on this, for each target mask frame, the corresponding optical flow data is used to assign corresponding two-dimensional optical flow values to each pixel in the target mask area. Then, based on the two-dimensional optical flow value of each pixel, a clustering algorithm is used to cluster these pixels, thereby obtaining the optical flow clustering result of each target mask frame.

It can be understood that this application can use density-based spatial clustering of applications with noise (DBSCAN), or mean shift clustering method (meanshift), or other clustering methods for Pixel points are clustered, which is not limited here.

150. Determine the optical flow quality score based on the optical flow clustering results of each target mask frame.

In one or more embodiments, the computer device can combine the optical flow clustering results of each target mask frame to comprehensively determine the quality of the optical flow, thereby generating a corresponding optical flow quality score. For example, the optical flow quality score in this application is the first score or the second score. Wherein, when the optical flow quality score is the first score, it means that the optical flow quality is good. For example, the first score may be "1". When the optical flow quality score is the second score, it means that the optical flow quality is poor. For example, the second score may be "0".

It can be understood that in actual applications, other values can also be set for the first score and the second score respectively. This is only an illustration and should not be understood as a limitation of the present application.

160. Use a video repair method that matches the optical flow quality score to repair the video to be repaired.

In one or more embodiments, the computer device may select a corresponding video repair method based on the optical flow quality score. That is, if the optical flow quality score is the first score, the optical flow method is used to repair the video to be repaired. If the optical flow quality score is the second score, the neural network is called to repair the video to be repaired.

Specifically, the process of using the optical flow method to repair the video to be repaired mainly includes: using adjacent frames for optical flow estimation, then filling the original mask area in each frame with optical flow, and using optical flow to fill the unmasked area with optical flow. Pixel gradients are propagated to the original mask area. Then Poisson reconstruction is performed on the pixel gradient to generate red green blue (RGB) pixels. Finally, image inpainting is performed on areas that cannot be filled by optical flow.

The process of calling the neural network to repair the video to be repaired is to receive the frame sequence information as input, and then output the repaired video frame after being processed by the neural network. It is understandable that neural networks mostly use an encoder-decoder structure. The neural network used in this application can be a fine-grained visual categorization (FGVC) network, or a spatial-temporal transformer network (STTN), or a decoupled spatial-temporal attention network (decoupled spatial- Temporal attention network, DSTT), etc., are not limited here.

The embodiment of this application provides a video repair method. In the above way, using the masked The regional optical flow clustering results predict the optical flow quality. When the optical flow quality is good, the optical flow method can be used as a video repair method to obtain filling content with higher clarity and credibility. When the optical flow quality is poor, the generative model can be used as a video repair method to obtain a more stable filling effect. It can be seen that this application uses the optical flow quality as the basis for selecting a video repair method, so as to achieve the purpose of making different video repair methods complement each other, thereby conducive to obtaining a video picture with better repair effect.

Optionally, based on the above-mentioned embodiment corresponding to Figure 4, in another optional embodiment provided by the embodiment of the present application, obtaining a video sample sequence corresponding to the video to be repaired may specifically include:

Obtain a video sequence from the video to be repaired, where the video sequence includes T original video frames, each original video frame displays a target object, and T is an integer greater than 1;

Extract K video frame pairs to be processed from the video sequence, where each video frame pair to be processed includes two adjacent original video frames;

The size of each original video frame in the K video frame pairs to be processed is normalized respectively to obtain K video frame pairs, and the K video frame pairs are used as a video sample sequence.

In one or more embodiments, a way of generating a sequence of video samples is introduced. As can be seen from the foregoing embodiments, the video sample sequence originates from the video to be repaired, and the video to be repaired is expressed as x={x _t }(t=1,2,...,T). It can be seen that the video to be repaired includes T original video frames, That is, x _t represents the t-th original video frame.

Specifically, adjacent original video frames can be extracted at certain intervals. For example, if a group of adjacent original video frames is extracted every 10 frames, the extracted sequence can be expressed as x _s , x _s = {(x ₁ ,x ₂ ), (x ₁₁ ,x ₁₂ ),…}, Among them, the sequence includes K video frame pairs to be processed, that is, the first video frame pair to be processed is expressed as (x ₁ , x ₂ ), and the second video frame pair to be processed is expressed as (x ₁₁ , x ₁₂ ) , and so on. Based on this, each original video frame in the pair of video frames to be processed is subjected to size normalization processing to obtain the corresponding video frame. Adjacent video frames constitute a video frame pair, and K video frame pairs constitute a video sample sequence, where the video sample sequence can be expressed as x _sr , x _sr = {(x _r1 ,x _r2 ), (x _r11 ,x _r12 ),…}.

It should be noted that the video frame after size normalization has a fixed size, for example, 512×288.

Secondly, embodiments of this application provide a way to generate a video sample sequence. Through the above method, on the one hand, extracting several video frame pairs to be processed from the video sequence for subsequent processing can reduce the amount of data processing and save data processing resources. On the other hand, normalizing the size of the original video frame can not only align the statistics of each video frame, but also reduce the size of the video frame, thereby improving processing efficiency.

Optionally, based on the above-mentioned embodiment corresponding to Figure 4, in another optional embodiment provided by the embodiment of the present application, obtaining the target mask sample sequence according to the video sample sequence may specifically include:

For each video frame pair in the video sample sequence, obtain the original mask frame corresponding to the video frame pair according to any video frame in the video frame pair, where the original mask frame includes the corresponding The original mask area obtained after masking the target object in the video frame;

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair to obtain the target mask frame corresponding to the video frame pair;

The corresponding target mask frames of the K video frame pairs are used as the target mask sample sequence.

In one or more embodiments, a method of generating a target mask frame based on a single video frame is introduced. It can be known from the foregoing embodiments that the target object is displayed in the video to be repaired. For this purpose, the target object needs to be masked to obtain the corresponding original mask area. Then the original mask area is expanded according to a certain number of pixels to obtain the target mask area.

It should be noted that the target object can be a logo, subtitle, object, etc. It can be understood that the method of identifying target objects includes but is not limited to manual annotation and model recognition. For example, using a fully convolutional network (fully convolution network, FCN) to identify the target object.

For example, one processing method is that the video to be repaired is x={x _t }(t=1,2,...,T). Mask processing is performed on each original video frame in the video to be repaired, thereby obtaining m={m _t } (t=1, 2,...,T). Assuming that a group of adjacent original video frames is extracted every 10 frames, the extracted video sample sequence is expressed as x _s = {(x ₁ , x ₂ ), (x ₁₁ , x ₁₂ ),...}, thus, we get The corresponding mask frame sequence is expressed as m _s = {(m ₁ ,m ₂ ), (m ₁₁ ,m ₁₂ ),…}, where (m ₁ ,m ₂ ) corresponds to (x ₁ ,x ₂ ), (m ₁₁ ,m ₁₂ ) corresponds to (x ₁₁ ,x ₁₂ ), and so on. If forward optical flow is used, the previous video frame of each video frame pair is extracted based on m _s , and m _sF = {m ₁ , m ₁₁ ,…} is obtained. Then m _sF is normalized to obtain the original mask. The membrane frame sequence is expressed as m _sr = {m _r1 , m _r11 ,…}; where m _r1 corresponds to (x ₁ , x ₂ ), m _r11 corresponds to (x ₁₁ , x ₁₂ ), and so on. If backward optical flow is used, the next video frame of each video frame pair is extracted based on m _s , and m _sB = {m ₂ , m ₁₂ ,…} is obtained. Then m _sB is normalized to obtain the original mask. The membrane frame sequence is expressed as m _sr = {m _r2 , m _r12 ,…}, where m _r2 corresponds to (x ₁ , x ₂ ), m _r12 corresponds to (x ₁₁ , x ₁₂ ), and so on. Wherein, the original mask frame sequence includes K original mask frames.

For example, one processing method is that the video to be repaired is x={x _t }(t=1,2,...,T). Assuming that a group of adjacent original video frames is extracted every 10 frames, the extracted sequence is expressed as x _s ={(x ₁ ,x ₂ ), (x ₁₁ ,x ₁₂ ),…}. Each original video frame in x _s is normalized, and the video sample sequence is obtained as x _sr = {(x _r1 ,x _r2 ), (x _r11 ,x _r12 ),…}. If forward optical flow is used, the previous video frame of each video frame pair is extracted based on x _sr , and x _srF = {x _r1 , x _r11 ,...} is obtained. Then x _srF is masked to obtain the original mask. The membrane frame sequence is expressed as m _sr ={m _r1 ,m _r11 ,…}; where m _r1 corresponds to (x _r1 ,x _r2 ), m _r11 corresponds to (x _r11 ,x _r12 ), and so on. If backward optical flow is used, the next video frame of each video frame pair is extracted based on x _sr , and x _srB = {x _r2 , x _r12 ,…} is obtained. Then x _srB is masked to obtain the original mask. The membrane frame sequence is expressed as m _sr ={m _r2 ,m _r12 ,…},; where m _r2 corresponds to (x _r1 ,x _r2 ), m _r12 corresponds to (x _r11 ,x _r12 ), and so on. Wherein, the original mask frame sequence includes K original mask frames.

Specifically, for ease of understanding, please refer to Figure 5. Figure 5 is a schematic diagram of generating a target mask frame in an embodiment of the present application, taking the original mask frame shown in (a) of Figure 5 as an example, where the mark The 15 pixels of "1" constitute the original mask area. Assume that the original mask is mapped according to the number of 2 pixels The area is expanded to obtain the target mask area (ie, the gray area composed of pixels marked "1"). Based on this, the target mask frame shown in (b) in Figure 5 is obtained.

By analogy, each original mask frame is processed until the target mask sample sequence is obtained. The target mask sample sequence can be expressed as {m _dst } (t=1,2,...,K). Among them, m _dst represents the t-th target mask frame.

Secondly, the embodiment of the present application provides a way to generate a target mask frame based on a single video frame. Through the above method, considering that there is little difference between the original mask areas of the two video frames before and after the video frame is centered, only one of the original mask frames can be subjected to area expansion processing, thereby reducing the complexity of the operation.

Optionally, based on the above-mentioned embodiment corresponding to Figure 4, in another optional embodiment provided by the embodiment of the present application, for each video frame pair in the video sample sequence, the original corresponding to the video frame pair The original mask area in the mask frame is expanded to obtain the target mask frame corresponding to the video frame, which may include:

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair according to the number of first pixels to obtain the first mask area corresponding to the video frame pair. ;

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair according to the number of second pixels to obtain the second mask area corresponding to the video frame pair. , where the number of second pixels is greater than the number of first pixels;

For each video frame pair in the video sample sequence, an XOR operation is performed on the first mask area and the second mask area corresponding to the video frame pair to obtain the target mask frame corresponding to the video frame pair.

In one or more embodiments, a way of expanding the original mask area is introduced. It can be known from the foregoing embodiments that for each original mask frame in the original mask frame sequence, the original mask area can also be expanded to obtain a target mask area. In this way, the target mask frame containing the target mask area is obtained.

Specifically, for ease of understanding, please refer to Figure 6, which is another schematic diagram of generating a target mask frame in an embodiment of the present application, taking the original mask frame shown in (a) of Figure 6 as an example, where, The 15 pixels marked "1" constitute the original mask area. Assume that the original mask area is expanded according to the first number of pixels (for example, 2 pixels) to obtain the first mask area (that is, a gray area composed of pixels marked "1"), that is, we get The mask frame shown in (b) of Figure 6. Assume that the original mask area is expanded according to the second number of pixels (for example, 4 pixels) to obtain the second mask area (that is, the gray area composed of pixels marked "1"), that is, we get The mask frame shown in (c) of Figure 6. Furthermore, an XOR operation is performed on the first mask area and the second mask area to obtain a target mask frame as shown in (d) of Figure 6 , where the target mask frame includes the target mask area (i.e., The gray area consisting of pixels marked "1").

By analogy, each original mask frame is processed until the target mask sample sequence is obtained. The target mask sample sequence can be expressed as {m _dst =m _da ^m _db } (t=1,2,...,K). Among them, m _dst table represents the t-th target mask frame, m _da represents the mask frame including the first mask area, a represents the number of first pixels, m _db represents the mask frame including the second mask area, and b represents the second pixel number, "^" represents the XOR operator.

In practical applications, the number of first pixels can be 7, and the number of second pixels can be 9. Therefore, the target mask sample sequence can be expressed as {m _dst =m _d7 ^m _d9 }(t=1,2, …,K). It should be noted that the number of first pixels and the number of second pixels can also be adjusted according to the situation, and are not limited here.

Thirdly, the embodiment of the present application provides a way to expand the original mask area. Through the above method, the optical flow inside the original mask area is obtained from the peripheral optical flow. If the peripheral optical flow is relatively chaotic, the optical flow inside the original mask area cannot be well filled. Considering that there may be some noise in the pixels close to the original mask area, the target mask area obtained by deviating from the original mask area has less noise, which is beneficial to improving the judgment effect of optical flow quality.

For each video frame pair in the video sample sequence, obtain the first original mask frame corresponding to the video frame pair based on the previous video frame in the video frame pair, and obtain the first original mask frame corresponding to the video frame pair based on the next video frame in the video frame pair. , obtain the second original mask frame corresponding to the video frame pair, wherein the first original mask frame and the second original mask frame respectively include masking the target object in the previous video frame and the next video frame. The original mask area obtained after;

For each video frame pair in the video sample sequence, perform a union process on the first original mask frame and the second original mask frame corresponding to the video frame pair to obtain the original mask frame corresponding to the video frame pair;

In one or more embodiments, a method of generating a target mask frame based on multiple video frames is introduced. It can be known from the foregoing embodiments that the target object is displayed in the video to be repaired. For this purpose, the target object needs to be masked to obtain the corresponding original mask area. Then the original mask area is expanded according to a certain number of pixels to obtain the target mask area.

It should be noted that the target object can be a logo, subtitle, object, etc. It can be understood that the method of identifying the target object includes but is not limited to manual annotation and model recognition. For example, FCN is used to identify the target object.

For example, one processing method is that the video to be repaired is x={x _t }(t=1,2,...,T). Each original video frame in the video to be repaired can be masked, thereby obtaining m={m _t } (t=1,2,...,T). Assuming that a group of adjacent original video frames is extracted every 10 frames, the extracted video sample sequence is expressed as x _s = {(x ₁ , x ₂ ), (x ₁₁ , x ₁₂ ),...}, thus, we get The corresponding mask frame sequence is expressed as m _s ={(m ₁ ,m ₂ ), (m ₁₁ ,m ₁₂ ),…}. Then m _s is normalized, and the obtained original mask frame sequence is expressed as m _sr = {(m _r1 ,m _r2 ), (m _r11 ,m _r12 ),…}. The original mask frame sequence includes K first original mask frames (ie, {m _r1 , m _r11 ,...}) and K second original mask frames (ie, m _sr ={m _r2 , m _r12 ,...}). }); In the first original mask frame, m _r1 corresponds to (x ₁ , x ₂ ), m _r11 corresponds to (x ₁₁ , x ₁₂ ), and so on; in the second original mask frame, m _r2 corresponds to (x ₁ ,x ₂ ), m _r12 corresponds to (x ₁₁ ,x ₁₂ ), and so on.

For example, one processing method is that the video to be repaired is x={x _t }(t=1,2,...,T). Assuming that a group of adjacent original video frames is extracted every 10 frames, the extracted sequence is expressed as x _s ={(x ₁ ,x ₂ ), (x ₁₁ ,x ₁₂ ),…}. Each original video frame in x _s is normalized, and the video sample sequence is obtained as x _sr = {(x _r1 ,x _r2 ), (x _r11 ,x _r12 ),…}. Then mask processing is performed on x _sr , and the obtained original mask frame sequence is expressed as m _sr = {(m _r1 ,m _r2 ), (m _r11 ,m _r12 ),…}. Wherein, the original mask frame sequence includes K first original mask frames (ie, {m _r1 , m _r11 ,...}) and K second original mask frames (ie, m _sr ={m _r2 , m _r12 ,...}); In the first original mask frame, m _r1 corresponds to (x _r1 ,x _r2 ), m _r11 corresponds to (x _r11 ,x _r12 ), and so on; in the second original mask frame , m _r2 corresponds to (x _r1 ,x _r2 ), m _r12 corresponds to (x _r11 ,x _r12 ), and so on.

Specifically, for ease of understanding, please refer to Figure 7. Figure 7 is another schematic diagram of generating a target mask frame in an embodiment of the present application. Figure 7(a) illustrates the first original mask frame, where, The 13 pixel points marked "1" constitute the original mask area of the first original mask frame. (b) of FIG. 7 illustrates the second original mask frame, in which the 13 pixel points marked “1” constitute the original mask area of the second original mask frame. After the union processing of the first original mask frame and the second original mask frame, the original mask frame is obtained as shown in (c) of Figure 7, in which 15 pixels marked as "1" are formed. The original mask area of this original mask frame. Assume that the original mask area is expanded according to the number of 2 pixels to obtain the target mask area (ie, the gray area composed of pixels marked "1"). Based on this, the target mask frame shown in (d) of Figure 7 is obtained.

Secondly, embodiments of the present application provide a method of generating target mask frames based on multiple video frames. This method takes into account that the original mask areas of the two video frames before and after the video frame pair may be different. Therefore, the original mask areas of the two frames before and after are first combined to obtain a more accurate original mask area. As a result, the processing effect of video frames is improved.

Optionally, based on the above-mentioned embodiment corresponding to Figure 4, in another optional embodiment provided by the embodiment of the present application, for each video frame pair in the video sample sequence, the original mask corresponding to the video pair is The original mask area in the film frame is expanded to obtain the target mask frame corresponding to the video pair, which may include:

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video pair according to the number of first pixels to obtain the first mask area corresponding to the video pair;

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video pair according to the number of second pixels to obtain the second mask area corresponding to the video pair, where , the number of second pixels is greater than the number of first pixels;

For each video frame pair in the video sample sequence, an XOR operation is performed on the first mask area and the second mask area corresponding to the video pair to obtain the target mask frame corresponding to the video pair.

Specifically, for ease of understanding, please refer to Figure 8. Figure 8 is another schematic diagram of generating a target mask frame in an embodiment of the present application. Figure 8 (a) illustrates the first original mask frame, wherein, The 13 pixel points marked "1" constitute the original mask area of the first original mask frame. (b) in FIG. 8 illustrates the second original mask frame, in which the 13 pixel points marked “1” constitute the original mask area of the second original mask frame. After performing a union process on the first original mask frame and the second original mask frame, an original mask frame is obtained as shown in (c) of Figure 8, in which 15 pixels marked as "1" are formed. The original mask area of this original mask frame. Assume that the original mask area is expanded according to the first number of pixels (for example, 2 pixels) to obtain the first mask area (that is, a gray area composed of pixels marked "1"), that is, we get The mask frame shown in (d) in Figure 8. Assume that the original mask area is expanded according to the second number of pixels (for example, 4 pixels) to obtain the second mask area (that is, the gray area composed of pixels marked "1"), that is, we get The mask frame shown in (e) of Figure 8. Based on this, an XOR operation is performed on the first mask area and the second mask area to obtain the target mask frame as shown in (f) of Figure 8, where the target mask frame includes the target mask area (i.e. , the gray area consisting of pixels marked "1").

By analogy, each original mask frame is processed until the target mask sample sequence is obtained. The target mask sample sequence can be expressed as {m _dst =m _da ^m _db } (t=1,2,...,K). Among them, m _dst represents the t-th target mask frame, m _da represents the mask frame including the first mask area, a represents the number of first pixels, m _db represents the mask frame including the second mask area, b Indicates the number of second pixels, and "^" indicates the XOR operator.

Optionally, based on the above-mentioned embodiment corresponding to Figure 4, in another optional embodiment provided by the embodiment of the present application, obtaining the optical flow data sequence according to the video sample sequence may specifically include:

For each video frame pair in the video sample sequence, according to the horizontal offset and vertical offset of each pixel in the next video frame in the video pair relative to each pixel in the previous video frame, Determine the optical flow data corresponding to the video pair;

The optical flow data corresponding to each of the K video pairs is used as an optical flow data sequence;

or,

Obtaining the optical flow data sequence based on the video sample sequence may include:

For each video frame pair in the video sample sequence, determine the video based on the horizontal offset and vertical offset of each pixel in the previous video frame relative to each pixel in the next video frame. To the corresponding optical flow data;

The optical flow data corresponding to each of the K video pairs is used as an optical flow data sequence.

In one or more embodiments, two ways of determining optical flow data based on video frame pairs are introduced. As can be seen from the foregoing embodiments, the video sample sequence includes K video frame pairs, and each video frame pair includes two video frames. If the video frame has been size normalized, the video sample sequence can be expressed as x _sr = {(x _r1 ,x _r2 ), (x _r11 ,x _r12 ),…}. Assuming that the size of the video frame is 512×288, then the optical flow data F _t can represent an optical flow matrix with a channel number of 2 and a size of 512×288. The optical flow data sequence is expressed as {Fl _t } (t=1,...,K). Therefore, the two-dimensional optical flow value (w′, h′) corresponding to each pixel can be determined by combining the optical flow data, where w ^′ represents the horizontal offset of the pixel and h′ represents the vertical offset of the pixel. quantity.

The following will take a pixel as an example and illustrate how to determine optical flow data.

1. Determine optical flow data based on forward optical flow;

Specifically, if forward optical flow is used, the video needs to be determined based on the horizontal offset and vertical offset of each pixel in the next video frame relative to each pixel in the previous video frame. The optical flow data corresponding to the frame pair. For ease of understanding, please refer to Figure 9. Figure 9 is a schematic diagram of determining a two-dimensional optical flow value based on forward optical flow in an embodiment of the present application. In the previous video frame, the pixel point coordinates are (3,4). In the next video frame, the pixel coordinates are (4,5). The horizontal offset of this pixel from the previous video frame to the next video frame is 1 (ie, 4-3), and the vertical offset is 1 (ie, 5-4). It can be seen that the two-dimensional The optical flow value is (1,1).

2. Determine optical flow data based on backward optical flow;

Specifically, if backward optical flow is used, the video frame needs to be determined based on the horizontal offset and vertical offset of each pixel in the previous video frame relative to each pixel in the next video frame. The optical flow data corresponding to the frame pair. For ease of understanding, please refer to Figure 10. Figure 10 is a schematic diagram of determining a two-dimensional optical flow value based on backward optical flow in an embodiment of the present application. In the previous video frame, the pixel point coordinates are (1,3). In the next video frame, the pixel coordinates are (4,5). The horizontal offset of this pixel from the next video frame to the previous video frame is -4 (i.e., 1-4), and the vertical offset is -2 (i.e., 3-5). It can be seen that this pixel The two-dimensional optical flow value is (-4,-2).

Secondly, the embodiments of this application provide two ways of determining optical flow data based on video frame pairs. Through the above method, it is supported to generate optical flow data based on forward optical flow or backward optical flow, thus improving the flexibility of the solution.

Optionally, on the basis of the above-mentioned embodiment corresponding to Figure 4, another embodiment of the present application provides In an optional embodiment, based on each optical flow data in the optical flow data sequence, the pixels included in the target mask area in each target mask frame are clustered to obtain the optical flow of each target mask frame. Clustering results may include:

For each target mask frame, determine the two-dimensional optical flow values of X pixels in the target mask area in the target mask frame based on the optical flow data corresponding to the target mask frame in the optical flow data sequence, where, The optical flow data corresponding to the target mask frame and the target mask frame correspond to the same video frame pair, and X is an integer greater than 1;

For each target mask frame, cluster the X pixels according to the two-dimensional optical flow values of the X pixels in the target mask area to obtain the optical flow clustering result of the target mask frame.

In one or more embodiments, a method of clustering pixels in a target mask area is introduced. As can be seen from the foregoing embodiments, the target mask sample sequence includes K target mask frames, and it is necessary to perform optical flow clustering on the pixels in the target mask area in each target mask frame. It is understandable that in actual situations, the number of pixels included in the target mask area may be large. Therefore, the pixels in the target mask area may also be randomly sampled in advance to obtain X pixels. Among them, X is an integer greater than 1. For example, X can be set to 15000.

Specifically, the target mask sample sequence is {m _dst }(t=1,2,...,K), and the optical flow data sequence is {Fl _t }(t=1,...,K). Based on this, it can be calculated Among them, "*" means element multiplication, whereby the target mask frame can be retained The two-dimensional optical flow value corresponding to the pixel marked "1" inside, while the rest is set to 0. Therefore, the DBSCAN algorithm can be used to map the target mask frame X pixels within are clustered, and the clustering is based on the two-dimensional optical flow value of each pixel, so as to obtain the optical flow clustering result of the target mask frame.

It should be noted that the optical flow clustering result of each target mask frame includes the category label corresponding to each pixel after clustering. Among them, pixels with a category label of "0" belong to noise pixels and need to be eliminated. After elimination, the total number of categories corresponding to the target mask frame is obtained. Taking the t-th target mask frame as an example, the corresponding total number of categories can be expressed as C _t , that is, there are C _t clusters. A cluster may include N _ct pixels.

Secondly, the embodiment of the present application provides a way to cluster pixels in the target mask area. Through the above method, the DBSCAN algorithm can be used to cluster pixels. On the one hand, adaptive clustering can be achieved without setting the number of categories in advance. On the other hand, the DBSCAN algorithm can better judge outliers and can find clusters of arbitrary shapes.

Optionally, based on the above-mentioned embodiment corresponding to Figure 4, in another optional embodiment provided by the embodiment of the present application, the optical flow quality score is determined according to the optical flow clustering result of each target mask frame. , specifically can include:

Based on the optical flow clustering results of each target mask frame, determine the total number of categories for each target mask frame;

Count the number of target mask frames in which the total number of categories is less than or equal to the category number threshold;

According to the ratio between the number of target mask frames and the K value, determine the single proportion of the category;

If the single proportion of a category is greater than the proportion threshold, the optical flow quality score is determined to be the first score;

If the category single ratio is less than or equal to the ratio threshold, the optical flow quality score is determined to be the second score.

In one or more embodiments, a method of determining the optical flow quality score based on a category single ratio (clean rate, CR) is provided. As can be seen from the foregoing embodiments, the optical flow clustering result of each target mask frame includes the category label corresponding to each pixel after clustering. By eliminating pixels with a category label of "0", the total number of categories corresponding to the target mask frame can be obtained.

Specifically, for the optical flow clustering results For example, the following method can be used to calculate the single proportion of a category:

Among them, CR represents the category single ratio. t represents the frame number of the target mask frame, and K represents the total number of target mask frames. c represents the category label, and C _t represents the total number of categories in the t-th target mask frame. i represents the pixel number, and N _ct represents the number of pixels corresponding to the c-th category label in the t-th target mask frame. Represents an indicator function. If the input is 1, it returns 1, otherwise it returns 0.

Based on this, the proportion of frames among the K target mask frames in which the total number of categories is less than or equal to the category number threshold (for example, 1) can be calculated, that is, a single category ratio is obtained.

Combined with the single proportion of categories, the criterion for determining optical flow quality can be defined as:

Among them, Q represents the optical flow quality score. CR represents category single ratio. CR _threshold represents a proportional threshold. For example, the proportional threshold can be set to 0.8, or other reasonable values, which are not limited here.

Secondly, the embodiment of the present application provides a way to determine the optical flow quality score based on a single proportion of categories. The above method takes into account that the larger the proportion of a single category, the smaller the total number of categories and the more stable the video optical flow is. Therefore, a single ratio of categories is used to filter out videos with disturbed optical flow, which is used as a basis for judging optical flow quality, thus improving the feasibility and operability of the solution.

For the optical flow clustering result of each target mask frame, the moving average of each cluster is determined based on the two-dimensional optical flow value of each pixel in each cluster, where the optical flow clustering result Used to characterize one or more clusters;

For the optical flow clustering result of each target mask frame, determine the moving average of the target mask frame based on the moving average of each cluster;

The moving average of each target mask frame is accumulated to obtain the total moving distance;

If the total distance moved is greater than or equal to the distance threshold, the optical flow quality score is determined to be the first score;

If the total distance moved is less than the distance threshold, the optical flow quality score is determined to be the second score.

In one or more embodiments, a method for determining the optical flow quality score based on the total distance moved is introduced. The way. As can be seen from the foregoing embodiments, the optical flow clustering result of each target mask frame includes the category label corresponding to each pixel after clustering. By eliminating pixels with a category label of "0", the total number of categories corresponding to the target mask frame can be obtained.

Specifically, for the optical flow clustering results Specifically, the following method can be used to calculate the total moving distance accumulated by K target mask frames:

Among them, D represents the total distance moved. D _t represents the moving average of the t-th target mask frame. t represents the frame number of the target mask frame, and K represents the total number of target mask frames.

The moving average of the target mask frame can be calculated as follows:

Among them, D _t represents the moving average of the t-th target mask frame. D _tc represents the moving average of the c-th cluster in the t-th target mask frame. c represents the category label, and C _t represents the total number of categories in the t-th target mask frame.

The moving average of a cluster can be calculated as follows:

in, Represents the moving average of the c-th cluster in the t-th target mask frame. Represents the two-dimensional optical flow value of the i-th pixel in the c-th cluster in the t-th target mask frame. i represents the pixel number, and N _ct represents the number of pixels corresponding to the c-th category label in the t-th target mask frame. ||.|| represents the Euclidean distance.

Based on this, the total moving distance of K target mask frames can be calculated. Combined with the total distance moved, the criterion for determining optical flow quality can be defined as:

Among them, Q represents the optical flow quality score. D represents the total distance moved. D _threshold represents a distance threshold. For example, the distance threshold can be set to 4, or other reasonable values, which are not limited here.

Secondly, the embodiment of the present application provides a way to determine the optical flow quality score based on the total distance moved. The above method takes into account that the larger the total distance moved, the more obvious the frame motion is, which is beneficial to optical flow estimation. Therefore, the total moving distance is used to filter out relatively stationary videos, which is used as a basis for determining the optical flow quality, thereby improving the feasibility and operability of the solution.

For the optical flow clustering result of each target mask frame, determine the moving average of the target mask frame based on the moving average of each cluster cluster;

If the single proportion of a category is greater than the proportion threshold, and the total distance moved is greater than or equal to the distance threshold, then the optical flow quality score is determined to be the first score;

If the single proportion of a category is less than or equal to the proportion threshold, and the total distance moved is less than the distance threshold, then the optical flow quality score is determined to be the second score.

In one or more embodiments, a method of jointly determining the optical flow quality score based on a single proportion of a category and the total distance moved is introduced. As can be seen from the foregoing embodiments, on the one hand, the proportion of frames in which the total number of categories is less than or equal to the category number threshold (for example, 1) among the K target mask frames can be counted, that is, a single category proportion is obtained. On the other hand, the total moving distance of K target mask frames can be calculated. It can be understood that the method for determining the single proportion of a category and the total distance of movement may refer to the foregoing embodiments, and will not be described in detail here.

Specifically, combining the single proportion of categories and the total distance moved, the criterion for determining optical flow quality can be defined as:

Among them, Q represents the optical flow quality score. D represents the total distance moved. D _threshold represents a distance threshold. For example, the distance threshold can be set to 4, or other reasonable values, which are not limited here. CR represents category single ratio. CR _threshold represents a proportional threshold. For example, the proportional threshold can be set to 0.8, or other reasonable values, which are not limited here.

Secondly, the embodiment of the present application provides a way to jointly determine the optical flow quality score based on a single proportion of categories and the total distance moved. Through the above method, on the one hand, the single ratio of the category can be used to filter out the video whose optical flow is disturbed. On the other hand, the total distance of movement can be used to filter out the relatively static video. Therefore, the combination of the two can reflect the optical flow quality more comprehensively and accurately, thus improving the reliability of the optical flow quality score.

Optionally, based on the above-mentioned embodiment corresponding to Figure 4, in another optional embodiment provided by the embodiment of the present application, a video repair method matching the optical flow quality score is used to repair the video to be repaired, Specifics may include:

If the optical flow quality score is the first score, the optical flow method is used to repair the video to be repaired.

If the optical flow quality score is the second score, the neural network is called to repair the video to be repaired.

In one or more embodiments, a method for video repair based on optical flow quality scores is introduced. As can be seen from the foregoing embodiments, the optical flow quality score can be a first score or a second score. The following uses the first score as "1" and the second score as "0" as an example for introduction.

Specifically, the video repair method can be selected as follows:

Among them, F ₁ (x, m) indicates that the optical flow method is used for video repair processing. F ₂ (x,m) means calling the neural network for video repair processing. Q represents the optical flow quality score.

It should be noted that the goal of this application is to solve the video sequence y={y _t } (t=0,1,2,...,T). The video sequence differs from the video to be repaired only in the original mask area, and makes the video sequence natural and consistent in time and space. Since naturalness and consistency are difficult to define in a formula, when training a neural network, it is hoped that the filled video sequence is close to the real video sequence y _gt . Among them, y _gt represents the true value of the video sequence without the original mask area. Based on this, by constructing algorithm F, the solution of video sequence y can be defined as y=F(x,m).

Secondly, embodiments of this application provide a method for video repair based on optical flow quality scores. Through the above method, before video repair, if the optical flow quality is judged to be good, clear and reliable filling content can be obtained directly by using the optical flow method. If the optical flow is unreliable, the model method is used to fill the content, thereby avoiding erroneous filling caused by inaccurate optical flow estimation and obtaining an overall more stable filling effect.

Optionally, based on the above-mentioned embodiment corresponding to Figure 4, another optional embodiment provided by the embodiment of this application may also include:

Display the video to be repaired and a list of repairable objects, where the list of repairable objects includes at least one repairable object;

In response to the selection operation on the target object, perform the step of obtaining a video sample sequence for the video to be repaired, wherein the target object belongs to at least one repairable object;

Using a video repair method that matches the optical flow quality score, after repairing the video to be repaired, it can also include:

Plays the repaired video in response to a playback operation on the repaired video.

In one or more embodiments, a method of intelligently repairing videos is introduced. As can be seen from the foregoing embodiments, the present application can be applied to various video repair tasks, such as removing logos, removing subtitles, removing objects, etc. If the user wants to use videos from certain platforms, but the videos carry the logo of that platform, which affects the look and feel, they can use a video repair application to remove the logo. Similarly, users can erase subtitles from some videos, or remove certain moving objects from videos. The following will be introduced separately with the illustrations.

Exemplarily, please refer to Figure 11. Figure 11 is a schematic diagram of the effect of removing a flag based on a video repair application in an embodiment of the present application. As shown in the figure, the video to be repaired and a list of repair objects are displayed on the interface provided by the video repair application. , where the list of repairable objects shows that there is at least one repairable object (e.g. logo, subtitles, boats, clouds, etc.). Assume that the user selects the "one-click removal" control corresponding to the "logo", thereby triggering a selection operation on the target object (ie, the logo). Then, in response to the selection operation, the video repair function is called. Based on this, a suitable video repair method is used to repair the video to obtain a repaired video. There is no mark in the repaired video. The repaired video can be played when the user triggers the playback action on the repaired video.

Exemplarily, please refer to Figure 12. Figure 12 is a schematic diagram of the effect of removing subtitles based on a video repair application in an embodiment of the present application. As shown in the figure, the video to be repaired and a list of repair objects are displayed on the interface provided by the video repair application. , wherein the repair object list shows that there is at least one repairable object (for example, a sign, a subtitle, a boat, a cloud, etc.). Assume that the user selects the "one-click removal" control corresponding to "subtitles", thereby triggering a selection operation on the target object (ie, subtitles). Then, in response to the selection operation, the video repair function is called. Based on this, a suitable video repair method is used to repair the video to obtain a repaired video. There are no subtitles in the repaired video. The repaired video can be played when the user triggers the play command for the repaired video.

Exemplarily, please refer to Figure 13. Figure 13 is a schematic diagram of the effect of removing objects based on a video repair application in an embodiment of the present application. As shown in the figure, the video to be repaired and a list of repair objects are displayed on the interface provided by the video repair application. , wherein the repair object list shows that there is at least one repairable object (for example, a sign, a subtitle, a boat, a cloud, etc.). Assume that the user selects the "one-click removal" control corresponding to "boat", thereby triggering a selection operation on the target object (ie, boat). Then, in response to the selection operation, the video repair function is called. Based on this, a suitable video repair method is used to repair the video to obtain a repaired video. There is no object "ship" in the repaired video. The repaired video can be played when the user triggers the play command for the repaired video.

It should be noted that the interface elements, interface arrangement, interface copy, etc. shown in Figures 11, 12 and 13 are all schematic and should not be understood as limitations of this application.

Secondly, embodiments of this application provide a method of intelligently repairing videos. Through the above method, users can use the video repair application to choose to repair one or more objects in the video to achieve the purpose of intelligent repair. This not only improves the practicality of the solution, but also improves the efficiency of video repair.

It can be seen that this application can accurately and efficiently judge the quality of optical flow in video clips, and then select the optical flow method or the model method before calling the video repair method, that is, use a better video repair method to repair, so that the repair The effect is better than the two used alone. The following will introduce the effect of video frame restoration based on optical flow method and model method with examples. Please refer to Figure 14. Figure 14 is a schematic diagram comparing the effects of video frame repair based on the optical flow method and the model method in the embodiment of the present application. As shown in the figure, in one example, (a) in Figure 14 shows the effect based on the optical flow method and the model method. The effect of optical flow filling. (b) in Figure 14 shows the effect of filling based on the model method. Among them, the original mask area is located in the lower left corner of the video frame (ie, the area enclosed by the rectangular frame). In the example, the lens movement is smooth and the optical flow estimation is good. Therefore, this application chooses to use the optical flow method for filling. In another example, (c) in Figure 14 shows the effect of filling based on the optical flow method, and (d) in Figure 14 shows the effect of filling based on the model method. Among them, the original mask area is located in the lower left corner of the video frame (i.e., the area enclosed by the rectangular frame domain), example: Since the optical flow is affected by the character's watch, this application chooses to use the model method for filling.

The video repair device in the present application will be described in detail below. Please refer to Figure 15. Figure 15 is a schematic diagram of a video repair device in the embodiment of the present application. The video repair device 20 includes:

The acquisition module 210 is used to acquire a video sample sequence corresponding to the video to be repaired, where the video sample sequence includes K video frame pairs, each video frame pair includes two adjacent video frames, and K is an integer greater than or equal to 1. ;

The acquisition module 210 is also used to obtain a target mask sample sequence according to the video sample sequence, wherein the target mask sample sequence includes K target mask frames, and each target mask frame includes a target mask obtained by expanding the original mask area. The target mask area, and there is a one-to-one correspondence between K target mask frames and K video frame pairs;

The acquisition module 210 is also used to obtain an optical flow data sequence according to the video sample sequence, where the optical flow data sequence includes K optical flow data, and there is a one-to-one correspondence between the K optical flow data and the K video frame pairs. ;

The processing module 220 is configured to perform clustering processing on the pixels included in the target mask area in each target mask frame based on each optical flow data in the optical flow data sequence, and obtain the optical flow cluster of each target mask frame. class result;

The determination module 230 is used to determine the optical flow quality score according to the optical flow clustering result of each target mask frame;

The repair module 240 is used to repair the video to be repaired using a video repair method that matches the optical flow quality score.

Optionally, based on the above embodiment corresponding to Figure 15, in another embodiment of the video repair device 20 provided by the embodiment of the present application,

The acquisition module 210 is specifically used to acquire a video sequence from the video to be repaired, where the video sequence includes T original video frames, each original video frame displays a target object, and T is an integer greater than 1;

The acquisition module 210 is specifically configured to obtain, for each video frame pair in the video sample sequence, an original mask frame corresponding to the video frame pair according to any video frame in the video frame pair, where the original mask frame includes a pair of The original mask area obtained after masking the target object in any video frame;

For each video frame pair in the video sample sequence, the original mask frame corresponding to the video frame pair The original mask area in is expanded to obtain the target mask frame corresponding to the video frame;

The acquisition module 210 is specifically configured to, for each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair according to the first number of pixels to obtain the video frame pair. The corresponding first mask area;

The acquisition module 210 is specifically configured to, for each video frame pair in the video sample sequence, obtain the first original mask frame corresponding to the video frame pair according to the previous video frame in the video frame pair, and obtain the first original mask frame corresponding to the video frame pair according to the video frame pair. Obtain the second original mask frame corresponding to the video frame pair in the latter video frame, where the first original mask frame and the second original mask frame respectively include the target object in the previous video frame and the next video frame. The original mask area obtained after mask processing;

The acquisition module 210 is specifically used for each video frame pair in the video sample sequence, according to the The horizontal offset and vertical offset of each pixel in the next video frame relative to each pixel in the previous video frame determines the optical flow data corresponding to the video frame pair;

The optical flow data corresponding to each of the K video frame pairs is used as an optical flow data sequence;

or,

The acquisition module 210 is specifically used for each video frame pair in the video sample sequence, according to the horizontal offset and vertical offset of each pixel point in the previous video frame relative to each pixel point in the next video frame in the video frame pair. Straight offset to determine the optical flow data corresponding to the video frame pair;

The optical flow data corresponding to each of the K video frame pairs is used as an optical flow data sequence.

The processing module 220 is specifically configured to determine, for each target mask frame, the two pixels of X pixels in the target mask area in the target mask frame according to the optical flow data corresponding to the target mask frame in the optical flow data sequence. dimensional optical flow value, where the optical flow data corresponding to the target mask frame and the target mask frame correspond to the same video frame pair, and X is an integer greater than 1;

The determination module 230 is specifically configured to determine the total number of categories of each target mask frame based on the optical flow clustering results of each target mask frame;

The determination module 230 is specifically configured to determine the moving average of each cluster cluster according to the two-dimensional optical flow value of each pixel in each cluster cluster based on the optical flow clustering result of each target mask frame. , where the optical flow clustering results are used to characterize one or more clusters;

The determination module 230 is specifically configured to determine each target mask frame according to the optical flow clustering result. The total number of categories of target mask frames;

If the single proportion of a category is less than or equal to the proportion threshold, or the total distance moved is less than the distance threshold, then the optical flow quality score is determined to be the second score.

The repair module 240 is specifically configured to use the optical flow method to repair the video to be repaired if the optical flow quality score is the first score.

Optionally, based on the above embodiment corresponding to Figure 15, in another embodiment of the video repair device 20 provided by the embodiment of the present application, the video repair device 20 further includes a display module 250;

The display module 250 is used to display the video to be repaired and a repair object list, where the repair object list includes at least one repairable object;

The acquisition module 210 is also configured to respond to the selection operation on the target object, and execute the step of acquiring the video sample sequence for the video to be repaired, wherein the target object belongs to at least one repairable object;

The display module 250 is also configured to use a video repair method that matches the optical flow quality score. After repairing the video to be repaired, respond to the playback operation of the repaired video and play the repaired video.

The embodiment of the present application also provides a terminal, as shown in Figure 16. For convenience of explanation, only the parts related to the embodiment of the present application are shown. If the specific technical details are not disclosed, please refer to the method part of the embodiment of the present application. In the embodiment of this application, the terminal is a mobile phone as an example for explanation:

Figure 16 shows a block diagram of a partial structure of a mobile phone related to the terminal provided by the embodiment of the present application. Referring to Figure 16, the mobile phone includes: a radio frequency (RF) circuit 310, a memory 320, an input unit 330 (which includes a touch panel 331 and other input devices 332), a display unit 340 (which includes a display panel 341), and a sensor 350 , audio circuit 360 (which is connected to a speaker 361 and a microphone 362), a wireless fidelity (wireless fidelity, WiFi) module 370, a processor 380, As well as power supply 390 and other components. Those skilled in the art can understand that the structure of the mobile phone shown in FIG. 16 does not limit the mobile phone, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components.

The memory 320 can be used to store software programs and modules, and the processor 380 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 320 . The memory 320 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store a program based on Data created by the use of mobile phones (such as audio data, phone books, etc.), etc. In addition, memory 320 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

Among them, the processor 380 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing software programs and/or modules stored in the memory 320, and calling data stored in the memory 320. , perform various functions of the phone and process data. Optionally, the processor 380 may include one or more processing units; optionally, the processor 380 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface and application programs. etc., the modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 380 .

The steps performed by the terminal in the above embodiment may be based on the terminal structure shown in FIG. 16 .

Figure 17 is a schematic structural diagram of a server provided by an embodiment of the present application. The server 400 may vary greatly due to different configurations or performance, and may include one or more central processing units (CPUs) 422 (for example, , one or more processors) and memory 432, one or more storage media 430 (eg, one or more mass storage devices) that stores applications 442 or data 444. Among them, the memory 432 and the storage medium 430 may be short-term storage or persistent storage. The program stored in the storage medium 430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server. Furthermore, the central processor 422 may be configured to communicate with the storage medium 430 and execute a series of instruction operations in the storage medium 430 on the server 400 .

Server 400 may also include one or more power supplies 426, one or more wired or wireless network interfaces 450, one or more input and output interfaces 458, and/or, one or more operating systems 441, such as Windows Server ^™ , Mac OS X ^TM , Unix ^TM , Linux ^TM , FreeBSD ^TM and more.

The steps performed by the server in the above embodiment may be based on the server structure shown in FIG. 17 .

An embodiment of the present application also provides a computer device, including a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the steps of the methods described in the foregoing embodiments.

Embodiments of the present application also provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of the method described in each of the foregoing embodiments are implemented.

The embodiments of the present application also provide a computer program product, which includes a computer program. When the computer program is executed by a processor, the steps of the method described in each of the foregoing embodiments are implemented.

It can be understood that in the specific implementation of this application, user information and other related data are involved. When the above embodiments of this application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection of relevant data , use and processing need to comply with relevant laws, regulations and standards of relevant countries and regions.

As mentioned above, the above embodiments are only used to illustrate the technical solution of the present application, but not to limit it. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still make the foregoing technical solutions. The technical solutions described in each embodiment may be modified, or some of the technical features may be equivalently replaced; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions in each embodiment of the present application.

Claims

A method of video repair, performed by computer equipment, including:

Obtain a video sample sequence corresponding to the video to be repaired, wherein the video sample sequence includes K video frame pairs, each video frame pair includes two adjacent video frames, and the K is an integer greater than or equal to 1;

Obtain a target mask sample sequence according to the video sample sequence, wherein the target mask sample sequence includes K target mask frames, and each target mask frame includes a target mask obtained by expanding the original mask area. area, and there is a one-to-one correspondence between the K target mask frames and the K video frame pairs;

Obtain an optical flow data sequence according to the video sample sequence, wherein the optical flow data sequence includes K optical flow data, and there is a one-to-one correspondence between the K optical flow data and the K video frame pairs. relation;

Based on each optical flow data in the optical flow data sequence, cluster the pixels included in the target mask area in each target mask frame to obtain the optical flow cluster of each target mask frame. class result;

Determine the optical flow quality score according to the optical flow clustering result of each target mask frame;

The video to be repaired is repaired using a video repair method that matches the optical flow quality score.
The method according to claim 1, obtaining the video sample sequence corresponding to the video to be repaired includes:

Obtain a video sequence from the video to be repaired, wherein the video sequence includes T original video frames, each original video frame displays a target object, and T is an integer greater than 1;

Extract K video frame pairs to be processed from the video sequence, where each video frame pair to be processed includes two adjacent original video frames;

The size of each original video frame in the K video frame pairs to be processed is respectively normalized to obtain the K video frame pairs, and the K video frame pairs are used as the video sample sequence.
The method according to claim 1, said obtaining a target mask sample sequence according to the video sample sequence, including:

For each video frame pair in the video sample sequence, obtain the original mask frame corresponding to the video frame pair according to any video frame in the video frame pair, wherein the original mask frame includes a pair of The original mask area obtained after masking the target object in any of the video frames;

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair to obtain the target mask frame corresponding to the video frame pair;

The target mask frames corresponding to each of the K video frame pairs are used as the target mask sample sequence.
The method according to claim 3, wherein for each video frame pair in the video sample sequence, the original mask area in the original mask frame corresponding to the video frame pair is expanded to obtain the video The target mask frame corresponding to the frame pair includes:

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair according to the first number of pixels to obtain the first corresponding video frame pair. a mask area;

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair according to the second number of pixels to obtain the third corresponding video frame pair. Two mask areas, wherein the second number of pixels is greater than the first number of pixels;

For each video frame pair in the video sample sequence, perform an XOR operation on the first mask area and the second mask area corresponding to the video frame pair to obtain the target mask corresponding to the video frame pair. Membrane frame.
The method according to claim 1, said obtaining a target mask sample sequence according to the video sample sequence, including:

For each video frame pair in the video sample sequence, obtain the first original mask frame corresponding to the video frame pair based on the previous video frame in the video frame pair, and center the video frame pair based on the video frame pair. of the latter video frame, obtain the second original mask frame corresponding to the video frame pair, wherein the first original mask frame and the second original mask frame respectively include the pair of the previous video frame and The original mask area obtained after masking the target object in the latter video frame;

For each video frame pair in the video sample sequence, perform union processing on the first original mask frame and the second original mask frame corresponding to the video frame pair to obtain the corresponding original mask frame;

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair to obtain the target mask frame corresponding to the video frame pair;

The target mask frames corresponding to each of the K video frame pairs are used as the target mask sample sequence.
The method according to claim 5, wherein for each video frame pair in the video sample sequence, the original mask area in the original mask frame corresponding to the video frame pair is expanded to obtain the video The target mask frame corresponding to the frame pair includes:

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair according to the first number of pixels to obtain the first corresponding video frame pair. a mask area;

For each video frame pair in the video sample sequence, expand the original mask area in the original mask frame corresponding to the video frame pair according to the second number of pixels to obtain the third corresponding video frame pair. Two mask areas, wherein the second number of pixels is greater than the first number of pixels;

For each video frame pair in the video sample sequence, perform an XOR operation on the first mask area and the second mask area corresponding to the video frame pair to obtain the target mask corresponding to the video frame pair. Membrane frame.
The method according to claim 1, obtaining an optical flow data sequence according to the video sample sequence includes:

For each video frame pair in the video sample sequence, according to the latter of the video frame pairs The horizontal offset and vertical offset of each pixel in the video frame relative to each pixel in the previous video frame determines the optical flow data corresponding to the video frame pair;

Use the optical flow data corresponding to each of the K video frame pairs as the optical flow data sequence;

or,

Obtaining the optical flow data sequence according to the video sample sequence includes:

For each video frame pair in the video sample sequence, according to the horizontal offset and vertical offset of each pixel point in the previous video frame relative to each pixel point in the next video frame in the video frame pair , determine the optical flow data corresponding to the video frame pair;

The optical flow data corresponding to each of the K video frame pairs is used as the optical flow data sequence.
The method according to claim 1, performing clustering processing on the pixels included in the target mask area in each target mask frame based on each optical flow data in the optical flow data sequence to obtain the The optical flow clustering results of each target mask frame are described, including:

For each target mask frame, determine the two-dimensional image of X pixels in the target mask area in the target mask frame based on the optical flow data corresponding to the target mask frame in the optical flow data sequence. Optical flow value, wherein the optical flow data corresponding to the target mask frame and the target mask frame correspond to the same video frame pair, and the X is an integer greater than 1;

For each target mask frame, cluster the X pixels according to the two-dimensional optical flow values of X pixels in the target mask area to obtain the target mask frame. Optical flow clustering results.
The method according to claim 1, wherein determining the optical flow quality score based on the optical flow clustering results of each target mask frame includes:

Determine the total number of categories for each target mask frame according to the optical flow clustering results of each target mask frame;

Count the number of target mask frames in which the total number of categories is less than or equal to the category number threshold;

According to the ratio between the number of target mask frames and the K value, determine the single proportion of the category;

If the single proportion of the category is greater than the proportion threshold, the optical flow quality score is determined to be the first score;

If the single proportion of the category is less than or equal to the proportion threshold, the optical flow quality score is determined to be the second score.
The method according to claim 1, wherein determining the optical flow quality score based on the optical flow clustering results of each target mask frame includes:

For the optical flow clustering result of each target mask frame, the moving average of each cluster cluster is determined according to the two-dimensional optical flow value of each pixel in each cluster cluster, where, The above optical flow clustering results are used to characterize one or more clusters;

For the optical flow clustering result of each target mask frame, determine the moving average of the target mask frame according to the moving average of each cluster cluster;

Accumulate the moving average of each target mask frame to obtain the total moving distance;

If the total moving distance is greater than or equal to the distance threshold, then determine the optical flow quality score as the first score;

If the total moving distance is less than the distance threshold, the optical flow quality score is determined to be the second score.
The method according to claim 1, wherein determining the optical flow quality score based on the optical flow clustering results of each target mask frame includes:

Determine the total number of categories for each target mask frame according to the optical flow clustering results of each target mask frame;

Count the number of target mask frames in which the total number of categories is less than or equal to the category number threshold;

According to the ratio between the number of target mask frames and the K value, determine the single proportion of the category;

For the optical flow clustering result of each target mask frame, the moving average of each cluster cluster is determined according to the two-dimensional optical flow value of each pixel in each cluster cluster, where, The above optical flow clustering results are used to characterize one or more clusters;

For the optical flow clustering result of each target mask frame, determine the moving average of the target mask frame according to the moving average of each cluster cluster;

Accumulate the moving average of each target mask frame to obtain the total moving distance;

If the single proportion of the category is greater than the proportion threshold, and the total moving distance is greater than or equal to the distance threshold, then the optical flow quality score is determined to be the first score;

If the single category proportion is less than or equal to the proportion threshold, or the total movement distance is less than the distance threshold, then the optical flow quality score is determined to be the second score.
The method according to any one of claims 9 to 11, using a video repair method that matches the optical flow quality score to perform repair processing on the video to be repaired, including:

If the optical flow quality score is the first score, the optical flow method is used to repair the video to be repaired.

If the optical flow quality score is the second score, a neural network is called to perform repair processing on the video to be repaired.
The method of claim 1, further comprising:

Display the video to be repaired and a repair object list, wherein the repair object list includes at least one repairable object;

In response to a selection operation for a target object, performing the step of obtaining a video sample sequence for the video to be repaired, wherein the target object belongs to the at least one repairable object;

After the video repair method is used to match the optical flow quality score and the video to be repaired is repaired, the method further includes:

In response to a play operation for the repaired video, the repaired video is played.
A video repair device including:

An acquisition module is used to acquire a video sample sequence corresponding to the video to be repaired, wherein the video sample sequence includes K video frame pairs, each video frame pair includes two adjacent video frames, and the K is an integer greater than or equal to 1;

The acquisition module is also configured to acquire a target mask sample sequence according to the video sample sequence, wherein the target mask sample sequence includes K target mask frames, and each target mask frame includes a pair of original mask regions. The target mask area obtained after expansion, and there is a one-to-one correspondence between the K target mask frames and the K video frame pairs;

The acquisition module is also configured to acquire an optical flow data sequence according to the video sample sequence, wherein the optical flow data sequence includes K optical flow data, and the K optical flow data is consistent with the K video There is a one-to-one correspondence between frame pairs;

A processing module configured to cluster the pixels included in the target mask area in each target mask frame based on each optical flow data in the optical flow data sequence to obtain each target mask. Optical flow clustering results of frames;

A determination module, configured to determine the optical flow quality score according to the optical flow clustering result of each target mask frame;

A repair module, configured to use a video repair method that matches the optical flow quality score to repair the video to be repaired.
A computer device includes a memory and a processor. The memory stores a computer program. When the processor executes the computer program, the steps of the method according to any one of claims 1 to 13 are implemented.
A computer-readable storage medium having a computer program stored thereon, which implements the steps of the method according to any one of claims 1 to 13 when executed by a processor.
A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 1 to 13.