CN113344807A

CN113344807A - Image restoration method and device, electronic equipment and storage medium

Info

Publication number: CN113344807A
Application number: CN202110576846.XA
Authority: CN
Inventors: 刘睿; 邓瀚铭; 黄洋逸; 石晓宇; 卢乐炜; 孙文秀; 王晓刚; 代季峰; 李鸿升
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-09-03

Abstract

The embodiment of the application provides an image restoration method and device, electronic equipment and a storage medium, wherein a first image is acquired; segmenting the first image into at least two blocks of pixels having overlapping regions; performing dimension conversion on each pixel block to obtain a first conversion characteristic of each pixel block; performing at least one fusion conversion processing on the first conversion characteristic to obtain a second conversion characteristic after the first conversion characteristic is repaired; and synthesizing the pixel blocks corresponding to the second conversion characteristics to obtain a second image.

Description

Image restoration method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image restoration, and relates to but is not limited to an image restoration method and device, an electronic device and a storage medium.

Background

The video image restoration can generate a vivid and natural video by restoring the video with damaged partial pixels. In the related art, in the vision task, it is common to hard-divide the image feature into many small pixel blocks and model between the small pixel blocks for the consideration of the amount of calculation, thus making the restoration of the image less effective.

Disclosure of Invention

The embodiment of the application provides an image restoration technical scheme.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an image restoration method, which comprises the following steps: acquiring a first image; segmenting the first image into at least two blocks of pixels having overlapping regions; performing dimension conversion on each pixel block to obtain a first conversion characteristic of each pixel block; performing at least one fusion conversion processing on the first conversion characteristic to obtain a second conversion characteristic after the first conversion characteristic is repaired, so that the second conversion characteristic has more detailed information; and synthesizing the pixel blocks corresponding to the second conversion characteristics to obtain a second image. In this way, the first image is divided into a plurality of pixel blocks with overlapping regions, and the plurality of pixel blocks with overlapping regions are combined into one feature map, so that the second conversion feature has a fine-grained feature, and the effect of repairing the first image can be improved.

In some embodiments, the segmenting the first image into at least two blocks of pixels having overlapping regions comprises: performing feature extraction on the first image to obtain a first feature map; and dividing the first feature map into the at least two pixel blocks with overlapping areas. Thus, the obtained pixel blocks have an overlapping region, and the characteristics of finer granularity in the image can be focused.

In some embodiments, the dividing the first feature map into the at least two pixel blocks having overlapping regions comprises: determining a segmentation step size and a segmentation size based on the size of the first feature map; wherein the segmentation step size is smaller than the segmentation size; and traversing the first feature map according to the segmentation step length, and segmenting the first feature map into at least two pixel blocks meeting the segmentation size. In this way, the first feature map is divided in a manner that the division step size is smaller than the division size, so that the obtained pixel block can retain the detail information of the first feature map.

In some embodiments, the performing the dimension conversion on each pixel block to obtain the first conversion characteristic of each pixel block includes: performing linear conversion on each pixel block to obtain a dimensionality reduction pixel block; and performing linear projection on the dimensionality reduction pixel block by adopting a first preset numerical value to obtain the first conversion characteristic. Therefore, the high-dimensional pixel block is subjected to linear conversion to obtain the first conversion characteristic with low dimensionality, so that the subsequent processing is facilitated, and the calculation amount of the subsequent processing can be reduced.

In some embodiments, the performing at least one fusion transformation process on the first transformation feature to obtain a second transformation feature after repairing the first transformation feature includes: performing attention fusion processing on the first conversion characteristic for the jth time to obtain a first fusion characteristic; wherein j is 1, 2 … … M, and M is the preset number of times of fusion conversion processing; and performing the j-th feedforward fusion processing on the first fusion characteristic to obtain the second conversion characteristic. In this way, by performing the fusion conversion processing on the first conversion feature a plurality of times, the fineness of the obtained second conversion feature can be improved.

In some embodiments, the performing the jth attention fusion process on the first converted feature to obtain a first fused feature includes: regularizing the first conversion feature to obtain a first regularization feature; processing the first regularization feature based on a multi-head attention mechanism to obtain a third conversion feature; and fusing the third conversion characteristic and the first conversion characteristic to obtain the first fused characteristic. In this way, the regularized features are processed based on a multi-head attention mechanism, and the obtained third conversion features and the first conversion features are fused, so that the first fused features can include more key information in the first image.

In some embodiments, the performing a j-th feed-forward fusion process on the first fusion feature to obtain the second conversion feature includes: regularizing the first fusion feature to obtain a second regularization feature; performing feedforward conversion on the second regularization characteristic to obtain a feedforward conversion characteristic; fusing the feedforward conversion characteristic and the first fusion characteristic to obtain a second fusion characteristic; determining the second fused feature as the second transformed feature. In this way, more detailed information can be captured, resulting in a finer granularity of the second transformation feature.

In some embodiments, the performing feed-forward conversion on the second regularization feature to obtain a feed-forward conversion feature includes: converting the second regularization features into pixel blocks to obtain at least two converted pixel blocks; synthesizing the at least two conversion pixel blocks based on the position information of each conversion pixel block in the first image to obtain a second feature map; dividing the second feature map into at least two intermediate pixel blocks having overlapping regions; and carrying out dimension conversion on each middle pixel block to obtain the feedforward conversion characteristic of each middle pixel block. In this way, the attention conversion characteristics obtained after the re-segmentation have richer edge information, and more information of adjacent pixel blocks can be fused.

In some embodiments, the converting the second regularization feature into pixel blocks, resulting in at least two converted pixel blocks, comprises: performing linear conversion on the second regularization features to obtain at least two ascending-dimension features; performing linear projection on each liter-dimension characteristic by adopting a second preset numerical value to obtain at least two conversion pixel blocks; the second preset value and the first preset value have a proportional relation. In this way, the second regularization features are converted pixel blocks having the same size as the initial pixel blocks by using an inverse linear conversion method, thereby facilitating aggregation of different second regularization features in the overlap region.

In some embodiments, the synthesizing the pixel blocks corresponding to the second conversion feature to obtain the second image includes: performing linear conversion on each second conversion characteristic to obtain a fourth conversion characteristic; reshaping each fourth conversion characteristic into a pixel block to obtain an updated pixel block set; and synthesizing the updating pixel blocks in the updating pixel block set into the second image based on the position information of each updating pixel block in the first image. In this way, by soft-combining the updated pixel blocks having the overlapping regions, the overlapping regions between different pixel blocks can be better utilized, thereby improving the image restoration effect.

In some embodiments, said synthesizing of the update pixel blocks of the set of update pixel blocks into the second image based on the position information of each update pixel block in the first image comprises: fusing overlapping pixel blocks of the updating pixel blocks in the same overlapping area to obtain a fused pixel block; determining first position information of the fused pixel block and second position information of a non-overlapping pixel block based on position information of each update pixel block in the first image; based on the first position information and the second position information, splicing the fusion pixel block and the non-overlapped pixel block to obtain a third feature map; and decoding the third feature map to obtain the second image. In this way, by fusing information from adjacent pixel blocks, it is helpful to smooth the boundaries of the pixel blocks and to expand the reception range of the pixel blocks.

An embodiment of the present application provides an image restoration apparatus, the apparatus including:

the first acquisition module is used for acquiring a first image;

a first segmentation module for segmenting the first image into at least two blocks of pixels having overlapping regions;

the first conversion module is used for carrying out dimension conversion on each pixel block to obtain a first conversion characteristic of each pixel block;

the first processing module is used for performing at least one fusion conversion processing on the first conversion characteristic to obtain a second conversion characteristic after the first conversion characteristic is repaired;

and the first synthesis module is used for synthesizing the pixel blocks corresponding to the second conversion characteristics to obtain a second image.

Correspondingly, the embodiment of the application provides a computer storage medium, wherein computer-executable instructions are stored on the computer storage medium, and after being executed, the image restoration method can be realized.

The embodiment of the application provides electronic equipment, which comprises a memory and a processor, wherein computer-executable instructions are stored in the memory, and the image restoration method can be realized when the processor runs the computer-executable instructions on the memory.

Drawings

Fig. 1A is a schematic view of an application scenario of an image restoration method according to an embodiment of the present application;

fig. 1B is a schematic flowchart of an implementation process of the image restoration method according to the embodiment of the present application;

fig. 2 is a schematic flowchart of another implementation of the image restoration method according to the embodiment of the present application;

fig. 3 is a schematic diagram of an implementation framework of an image restoration method according to an embodiment of the present application;

fig. 4 is a schematic diagram of an implementation framework of soft segmentation and soft synthesis provided in an embodiment of the present application;

fig. 5 is a schematic view of an application scenario of an image restoration method according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of an image restoration apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, specific technical solutions of the present invention will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar examples and do not denote a particular order of importance to the examples, and it is to be understood that "first \ second \ third" may be interchanged with a particular order or sequence where permissible to enable embodiments of the present application described herein to be practiced otherwise than as specifically illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Convolutional Neural Networks (CNNs) are a class of feed Forward Neural Networks (FNNs) that contain convolution computations and have a deep structure; the convolutional neural network has the characteristic learning ability and can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network.

2) The feedforward neural network is the simplest neural network, and each neuron is arranged in a layered mode. Each neuron is connected to only the neuron in the previous layer. Each layer receives the output of the previous layer and outputs to the next layer.

3) The attention mechanism is an important component of human cognitive function, and when facing massive information, human beings can pay attention to some information and ignore other information. When a neural network is used for processing a large amount of input information, only some key information inputs can be selected for processing by taking the attention mechanism of the human brain as a reference, so that the efficiency of the neural network is improved. In the neural network model, the maximum convergence (max boosting), gating (gating) mechanism can be considered approximately as a bottom-up significance-based attention mechanism.

An exemplary application of the image inpainting device provided by the embodiment of the present application is described below, and the device provided by the embodiment of the present application may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a camera, a mobile device (e.g., a personal digital assistant, a dedicated messaging device, a mobile phone), and the like, and may also be implemented as a server. In the following, an exemplary application will be explained when the device is implemented as a terminal or a server.

The method can be applied to a computer device, and the functions realized by the method can be realized by calling a program code by a processor in the computer device, although the program code can be stored in a computer storage medium, which at least comprises the processor and the storage medium.

Video image restoration is an important problem in the field of computer vision, and requires that a machine can learn from a large number of natural videos, so that a vivid and natural video can be generated by restoring a video with damaged partial pixels. The video image restoration method has more application scenes; for example, the debris removal, the foreground removal, the video image generation and reconstruction, and the like. In the process of realizing image division in the related art, the hard splitting operation used by the pixel block-based conversion model makes it lack of feature learning at a sub-block (sub-patch) level. Since the attention weight is calculated between different vectors (tokens), there is no direct sub-vector level information to interact with. But the lack of accurate sub-pixel block-level feature interaction results in feature inconsistencies between adjacent blocks, and a smooth feature transition is critical to generating vivid content.

Based on this, in the embodiment of the present application, an image restoration method is provided by dividing a first image into a plurality of pixel blocks having an overlapping area; performing dimension conversion on the pixel blocks to obtain a first conversion characteristic; performing fusion conversion processing on the first conversion characteristics of the pixel blocks with the overlapping areas to obtain second conversion characteristics with fine granularity characteristics; in this way, the plurality of second conversion features are combined into the feature map, so that a clearer second image can be generated. As shown in fig. 1A, fig. 1A is a schematic view of an application scenario of an image restoration method provided in an embodiment of the present application, where an image 11 is subjected to hard splitting to obtain 4 pixel blocks, including non-overlapping areas: block (0, 0), block (0, 1), block (1, 0) and block (1, 1), i.e., the set of pixel blocks 12; then, the four blocks in the pixel block set 12 are hard-synthesized, resulting in an image 13 with still insufficiently sharp picture. In the image 14, by using the image inpainting method provided by the embodiment of the present application, the image 14 is subjected to soft segmentation to obtain 4 pixel blocks with overlapping regions, including the overlapping regions: block (0, 0), block (0, 1), block (1, 0) and block (1, 1), i.e. a set of pixel blocks 15; then, soft-combining is performed on the four pixel blocks in the pixel block set 15, so as to obtain an image 16 with clearer picture content.

An embodiment of the present application provides an image restoration method, as shown in fig. 1B, which is described with reference to the steps shown in fig. 1B:

step S101, a first image is acquired.

The first image may be an image or video frame comprising a plurality of or one object, may be an image with a complex appearance, or may be an image with a simple appearance. The first image may be a sharp image or a blurred image that needs to be restored. The first image can be an image acquired by any acquisition device in any scene or a received image transmitted by other devices. For example, an image acquired at a camel at an angle, or an image of an object received from any device.

Step S102, the first image is divided into at least two pixel blocks having an overlapping area.

In some embodiments, the entire first image is traversed at a block size of pixels and a step size, and the first image is partitioned into a plurality of blocks of pixels that satisfy the size. If the step size is set to be smaller than the pixel block size, overlapping regions exist in a plurality of pixel blocks obtained by division, namely, the soft division of the first image is realized. The number of pixel blocks included in the overlap region is related to the proportional relationship between the pixel block size and the step size.

Step S103, performing dimension conversion on each pixel block to obtain a first conversion characteristic of each pixel block.

In some embodiments, the dimensions of each block of pixels are converted, resulting in a first conversion characteristic that facilitates processing by a conversion network with attention mechanism. And performing dimension conversion on each pixel block, wherein the dimension conversion can be to reduce the dimension of the pixel block or to increase the dimension of the pixel block.

In some possible implementations, the first conversion feature of each pixel block is obtained by unfolding the pixel block into a one-dimensional vector and performing linear projection on the one-dimensional vector; that is, step S103 can be realized by the following steps S131 and 132 (not shown in the figure):

step S131, each pixel block is subjected to linear conversion to obtain a dimensionality reduction pixel block.

First, the segmented pixel block, which is a high dimensional feature, e.g., 7 × c in size, where c represents a pixel channel, is input into the fully connected layer of the convolutional neural network. Then, the high-dimensional pixel block is expanded into a one-dimensional vector, and the dimension-reduced pixel block is obtained.

Step S132, adopting a first preset numerical value to perform linear projection on the dimensionality reduction pixel block to obtain the first conversion characteristic.

The first preset value may be any preset value, for example, 64, 32, 512, etc. are set as the first preset value. The first transformation characteristic can be obtained by fixing the dimension-reduced pixel block at a specific size, for example, setting the fixed size to 512, and linearly projecting the dimension-reduced pixel block into a 512-dimensional vector. For example, a pixel block expanded into a one-dimensional vector is subjected to linear mapping to obtain a first conversion feature. Therefore, the high-dimensional pixel block is subjected to linear conversion to obtain the first conversion characteristic with low dimensionality, so that the subsequent processing is facilitated, and the calculation amount of the subsequent processing can be reduced.

And step S104, performing at least one fusion conversion treatment on the first conversion characteristic to obtain a second conversion characteristic after the first conversion characteristic is repaired.

In some embodiments, the converged switching network is used to perform at least one attention process on the first switching feature to predict important features of the first switching feature that need attention; in at least one fusion conversion process, the last output is used as the next input. In this way, by predicting the important feature for the first conversion feature of each pixel block, the image region of interest in the first image can be predicted. In this way, the resulting second transformed features can include features of the attention area in the first image. For example, in the second conversion feature, the attention weight value of the feature representing the attention area of the first image is high.

In some possible implementation manners, firstly, an attention fusion module with an attention mechanism and a fusion feedforward network capable of fusing a plurality of pixel blocks are combined together to form a fusion conversion network; then, connecting a plurality of fusion conversion networks together to form a fusion conversion network set; finally, inputting the first conversion characteristic into a first layer of fusion conversion network in the fusion conversion network set for processing to obtain a processing result; and inputting the processing result into the fusion conversion network of the next layer. And circulating the steps to obtain the processing result output by the last layer, namely obtaining the second conversion characteristic with the image characteristic with finer granularity. The dimension of the second conversion feature is the same as the dimension of the first conversion feature input into the first-layer converged conversion network.

And step S105, synthesizing the pixel blocks corresponding to the second conversion characteristics to obtain a second image.

In some embodiments, each pixel block has a corresponding second conversion characteristic, i.e. the second conversion characteristic has a one-to-one correspondence with the pixel block. By synthesizing the plurality of second conversion features into one feature map, the second feature map synthesized by the plurality of second conversion features sufficiently takes the fine information of each pixel block into consideration since each second conversion feature highlights the finer information in the first image. Therefore, the second image is obtained based on the second feature map, so that the second image is clearer.

In the embodiment of the present application, by first dividing a first image into a plurality of pixel blocks having overlapping regions; performing dimension conversion on the pixel block to obtain a first conversion characteristic which is more beneficial to subsequent processing; then, performing fusion conversion processing on the first conversion features obtained by dimension conversion to obtain second conversion features which can highlight fine pictures in the first image, so that the second conversion features comprise the features with finer granularity of the first image; finally, synthesizing second conversion characteristics containing the updated granularity characteristics to obtain a second image after the first image is repaired; in this way, the first image is divided into a plurality of pixel blocks with overlapping regions, and the plurality of pixel blocks with overlapping regions are combined into a feature map, so that the second conversion feature has a fine-grained feature, and a clearer second image can be generated.

In some embodiments, in order to improve the accuracy of segmenting the first image, the first image is subjected to feature extraction by using a neural network, and the extracted features are segmented, that is, the step S102 may be implemented by the steps shown in fig. 2, where fig. 2 is a schematic flow chart of another implementation of the image repairing method provided by the embodiment of the present application, and the following description is performed in conjunction with the steps shown in fig. 1B and 2:

step S201, performing feature extraction on the first image to obtain a first feature map.

In some embodiments, the first image is input into a convolutional neural network (or other neural network capable of feature extraction), and downsampled through two layers of convolutional layers to obtain a first feature map. For example, an encoder using a convolutional neural network encodes the first image to obtain a convolutional feature map of the frame image, i.e., the first feature map.

Step S202, dividing the first feature map into the at least two pixel blocks having the overlapping area.

In some embodiments, the first feature map is partitioned into a plurality of pixel blocks having overlapping regions by traversing the first feature map with a pixel block size and a step size; the obtained pixel blocks have an overlapping area, so that the characteristics of finer granularity in the image, such as the edge characteristics of the image, can be focused.

In some possible implementations, the first feature map is soft-divided according to a certain step size and size, that is, the step S202 can be implemented by the following steps S221 and 222 (not shown in the figure):

step S221, determining a segmentation step size and a segmentation size based on the size of the first feature map.

In some embodiments, the segmentation step size is set smaller than the segmentation size; the division size is set in proportion to the size of the first feature map by analyzing the size of the first feature map. In a specific example, the division size is set to 7 × 7, and the division step size is set to 3.

Step S222, traversing the first feature map according to the segmentation step, and segmenting the first feature map into at least two pixel blocks meeting the segmentation size.

According to the segmentation step length, traversing the first feature map from left to right from top to bottom in the first feature map; thereby enabling the first feature map to be divided into a plurality of pixel blocks; since the division size is larger than the division step size, an overlapping region exists in the resulting plurality of pixel blocks. As shown in the soft segmentation module 401 in fig. 4, in the soft segmentation module 401, the first feature map is divided into a plurality of pixel blocks having an overlapping area, wherein the size of each pixel block is set to k × k × c (where k denotes a pixel block length and height, and c denotes the number of channels of a pixel). As can be seen from the first characteristic diagram, the first characteristic diagram is divided into 9 pixel blocks from left to right in each row, and from top to bottom, each column is divided into 5 pixel blocks. Wherein the pixel block 411 (block (1, 1)) indicates that the position of the pixel block is located at the coordinate (1, 1) of the first feature map; the pixel block 412 (block (5, 5)) indicates that the position of the pixel block is located at the coordinate (5, 5) of the first feature map; the pixel block 413 (block (9, 9)) indicates that the position of the pixel block is located at the coordinates (9, 9) of the first feature map. After the first feature map is divided into a plurality of pixel blocks, dimension conversion is performed on the plurality of pixel blocks, and therefore the first conversion feature which is convenient for subsequent processing can be obtained. Wherein the pixel block 411 corresponds to the first conversion characteristic 414; the pixel block 412 corresponds to a first conversion characteristic 415; the pixel block 413 corresponds to a first conversion characteristic of 416. In the soft segmentation operation shown in fig. 4, the step s is smaller than the size k of the pixel block, each feature map of the first image is soft segmented into overlapping pixel blocks of size k × k, and the overlapping pixel blocks are expanded into one-dimensional vectors, i.e., first transformed features. Wherein the number of first conversion features is as shown in equation (1):

where p denotes the size of the pixel block, w denotes the width of the first image, and h denotes the height of the first image.

In some possible implementation manners, in the process of performing soft segmentation on the first feature map into a plurality of pixel blocks, the segmentation sizes and the segmentation step lengths of the pixel blocks are different, so that the number of the obtained pixel blocks in the overlapping region is different; in order to reduce the difference between the pixel values of the pixel blocks obtained after the division, normalization processing is performed on a plurality of pixel blocks as shown in formula (2):

wherein, it is made

The normalized T2P can be expressed as equation (2).

In this way, the first feature map is divided in such a manner that the division step size is smaller than the division size, so that the obtained pixel blocks have an overlapping region and detail information of the first feature map can be retained.

In the embodiment of the application, a damaged first image is input into an encoder based on a convolutional neural network to extract a first feature map, and the first feature map is subjected to soft segmentation by adopting a certain segmentation step size and a segmentation size, so that a plurality of pixel blocks with fine granularity are obtained.

In some embodiments, to improve the fineness of the obtained second conversion feature, the first conversion feature is subjected to a plurality of times of fusion conversion processing, so that more detailed information in the first feature map can be focused in the second conversion feature, that is, the above step S104 can be realized by the following steps S131 to 133 (not shown in the figure):

step S131, the jth attention fusion processing is carried out on the first conversion feature, and a first fusion feature is obtained.

In some embodiments, j is 1, 2 … … M, and M is a preset number of fusion transformation processes. During the jth attention fusion process, inputting the first conversion feature into an attention fusion module (such as the attention fusion module 321 in fig. 3) to implement the attention fusion process; in an Attention fusion module, after normalization operation is carried out on a plurality of first conversion features corresponding to a plurality of pixel blocks, the first conversion features are input into a Multi-head Attention network (Multi-head Attention), regions needing Attention in a first image are learned, and therefore the first conversion features are adjusted to be third conversion features; and performing element-by-element summation on the third conversion characteristic and the first conversion characteristic to obtain a first fusion characteristic.

In some possible implementations, to improve the accuracy of predicting the fine-grained feature of the first image in the third transformation feature, the step S131 may be implemented by:

firstly, regularization processing is carried out on the first conversion characteristic to obtain a first regularization characteristic.

And performing data preprocessing on the first conversion characteristics corresponding to each pixel block to enable the plurality of first conversion characteristics to be more normalized. For example, the plurality of first conversion features are normalized by subtracting the mean value of the plurality of first conversion features column by column and dividing by the variance of the plurality of first conversion features; thereby obtaining a normalized first regularization feature. Each pixel block corresponds to a first regularization feature.

And secondly, processing the first regularization feature based on a multi-head attention mechanism to obtain a third conversion feature.

In some embodiments, first, query features and key-value pairs of the regularization features are determined.

For example, for each pixel block of regularized feature, determining a Query (Query) vector of the feature by determining an attention distribution with a key Value (key) and appending the attention distribution to a pixel Value (Value); such information can then be entered into a multi-head attention network to determine the attention value of the first image, i.e. the important region of interest in the first image. In a specific example, the important region may be determined according to the picture content of the first image and the application scene. For example, the first image is subjected to image restoration, the picture content is a damaged building image and a background, and the important area is the building image with the damaged picture in the image.

Then, the third conversion feature is determined based on the query feature and the key-value pair.

Respectively inputting query features and key value pairs of the regularized features into a multi-head attention network, determining attention distribution of the first conversion features by adopting the multi-head attention network, and determining a weighted average value of the input first conversion features based on the attention distribution, so that third conversion features representing important regions of the first image can be obtained. In this way, by using the multi-head attention network, the attention weight value of the input information of the first conversion feature is predicted, so that a third conversion feature which highlights an important region of the first image can be obtained, and the calculation amount for repairing the first image can be reduced.

And thirdly, fusing the third conversion characteristic and the first conversion characteristic to obtain the first fusion characteristic.

For a pixel block, the first conversion feature and the third conversion feature corresponding to the pixel block are fused to obtain a first fusion feature corresponding to the pixel block. The number of first fusion features is the same as the number of pixel blocks, i.e. each pixel block has a corresponding first fusion feature. In this way, the regularized features are processed based on a multi-head attention mechanism, and the obtained third conversion features and the first conversion features are fused, so that the first fused features can include more key information in the first image.

Step S132, the feed-forward fusion processing of the jth time is carried out on the first fusion characteristic to obtain the second conversion characteristic.

In some embodiments, the third conversion feature and the first conversion feature are fused to obtain a first fused feature; taking the first fusion characteristic as the input of a fusion feedforward network (shown as a fusion feedforward network 331 in fig. 3), and finally obtaining an optimized characteristic, namely a second conversion characteristic, by performing soft synthesis and soft segmentation for multiple times in the fusion feedforward network; the second conversion feature may be regarded as a feature obtained by one fusion conversion process. In some possible implementation manners, multiple times of fusion conversion processing are performed based on the second conversion feature obtained by the one-time feedforward fusion processing, so that the feature output by the last fusion conversion processing is obtained, and the second image after the first image is repaired is synthesized.

After step S132, based on the second conversion characteristic, a plurality of times of fusion conversion processing are performed to obtain a second conversion characteristic output by the last fusion conversion processing.

In some embodiments, the output of the last fusion transformation process is used as the input for the next fusion transformation process. The processing procedures of step S131 and step S132 are performed a plurality of times with the first conversion feature and the attention conversion feature as inputs of the fusion conversion process, resulting in a second conversion feature having detail information of the first image. In this way, the fusion conversion processing is executed a plurality of times in a loop, so that the second conversion characteristic in which finer-grained information is fused can be obtained.

Referring to fig. 3, fig. 3 is a schematic diagram of an implementation framework of an image repairing method provided in the embodiment of the present application, where the framework may be regarded as an image repairing network, and an implementation process of the network is as follows:

first, t frames of video are input to the CNN network as a first image.

Secondly, extracting the characteristics of the t frame video by adopting an encoder formed by a CNN network, and dividing the t frame video into a plurality of pixel blocks with overlapping regions in a soft division mode; wherein the plurality of pixel blocks in the first frame image 301 includes: block (1, 1), block (2, 2), block (3, 3), etc. Other pixel blocks are included in the first frame image 301, for example, the pixel block of the first line includes: a pixel block (1, 1), a pixel block (1, 2), a pixel block (1, 9); and a pixel block (1, 1), a pixel block (2, 1), a pixel block (9, 1), etc. of the first column; in the second frame image 302, there are included: blocks (1, 1), blocks (2, 2), blocks (3, 3), etc.; other pixel blocks are also included in the second frame image 302.

Thirdly, for each frame image, performing dimension conversion on the pixel blocks (1, 1), (v · · s), the blocks (5, 5), (v · s), the blocks (9, 9) in the first frame image 301, and the pixel blocks (1, 1), (v · s), the blocks (5, 5), (v · s), the blocks (9, 9), and the like in the second frame image 302, respectively to obtain first conversion characteristics obtained after the pixel blocks in the first frame image 301 are converted: (1, 1), (5, 5), (9, 9); and, a first conversion feature obtained after the pixel block in the second frame image 302 is converted: (1, 1), (5, 5), (9, 9). Namely, pixel blocks in the first frame image 301 and the second frame image 302 are input into the linear projection processing module 303, and the pixel blocks are expanded into one-dimensional vectors in the linear projection processing module 303; then, linear mapping is carried out on the one-dimensional vector to obtain first conversion characteristics of the pixel block, and space position coding is added on the basis of the first conversion characteristics to obtain coding of each first conversion characteristic, wherein the coding comprises the following steps: (1, 1), (1, 2), (1, 9), (2, 1), (9, 1).

In some possible implementations, for corrupted video frames

Coding a video frame by using a coder based on a convolutional neural network to obtain a c-channel convolution characteristic diagram of the video frame

Each video frame X is divided into k X k smaller blocks of pixels of step size s. And linearly embedding all pixel blocks into the feature vector

Where n is the number of feature vectors in an image (as shown in equation (1) above) and d is the token channel.

Again, the plurality of first conversion features are input into the attention conversion block 304 to obtain a second conversion feature subjected to the fusion conversion processing a plurality of times.

In some possible implementations, multiple layers of attention translation block 304 are included in the network. Z is input into the standard multi-layer attention transform block 304 for spatio-temporal information propagation to produce optimized feature vectors

In some embodiments, the attention translation block 304 of each layer is comprised of an attention fusion module 321 and a fusion feed-forward network 331.

The processing procedure of the attention conversion block 304 of any layer is as follows: a first step, in an attention fusion module 321, of regularizing 322 the plurality of first conversion features 305; secondly, determining three inputs of the multi-head attention network, namely query vectors, key values and pixel values, based on the regularized features, and inputting the three inputs into the multi-head attention network 323; thirdly, the output result (i.e. the third conversion characteristic) of the multi-head attention network 323 and the first conversion characteristic 305 are added element by element to obtain a first fusion characteristic; fourthly, regularizing 324 the first fusion feature, and taking the regularized feature as the input of a fusion feed-forward network 331; fifthly, the output of the fused feedforward network 331 and the first conversion characteristic 305 are summed element by element to obtain the final second conversion characteristic.

In some possible implementations, the attention fusion module 321 replaces a two-layer multi-layer Perceptron (MLP) with the fusion feed-forward network 331 provided in the embodiments of the present application. The input feature vector of the attention fusion module 321 in the l-th layer attention conversion module 304 is set to be Z_lWhere l ∈ [0, l), the attention fusion module 321 can be expressed as shown in equations (3) and (4):

Z_l'＝MSA(LN₁(Z_l-1))+Z_lformula (3);

Z_l+1＝F3N(LN₂(Z_l'))+Z_l' equation (4);

where MSA and LN represent standard multi-headed transformer self-attention and layer normalization, respectively.

In the process of processing the input multiple first conversion features by the attention conversion block 304 of this layer, the implementation process of the fusion feed-forward network 331 is as follows: firstly, inputting the features processed by the regularization 324 into the MLP 332, performing linear conversion in the MLP 332, and reshaping the linearly converted features into pixel blocks 333 (i.e., conversion pixel blocks); secondly, synthesizing the plurality of reshaped pixel blocks 333 into a feature map 334 (i.e., a second feature map); thirdly, the feature map 334 is divided again to obtain a middle pixel block 335 with an overlapping area; finally, a plurality of intermediate pixel blocks 335 (e.g., the fused intermediate pixel block (1, 1), block (5, 5), block (9, 9) are input into another MLP 336, linear-converted, and the linearly-converted pixel blocks are converted into conversion features, i.e., feature vectors.

In some possible implementations, each optimized feature vector in Z

The linear transformation into k · k · c channel vectors and reshaping into pixel blocks k · k · c. Re-synthesizing all the generated pixel blocks pixel by pixel to the position of the original frame to obtain a second feature map

No additional parameters are introduced into the fusion feed-forward network 331 and soft-combining and soft-splitting operations are inserted between the two layers of MLPs. Is provided with Z_l', i ∈ [0, t.n) ] is LN in formula (4)₂The output feature vector of (1). The fused feed-forward network 331 can be expressed as equations (5) to (8):

p_i＝MLP₁(z_i'), i ∈ [0, t · n) formula (5);

A_j＝SC(p_j·n,…,p_(j+1)·n-1) J ∈ [0, t) formula (6);

p'_j·n,…,p'_(j+1)·n-1＝SS(A_j) J ∈ [0, t) formula (7);

in the fusion feed-forward network 331, the input and output channels of the MLP 332 are (d, 4. d), the input and output channels of the MLP 336 are (4. d, d), the fusion feed-forward network 331 linearly converts the feature pixels into pixel blocks between the two MLPs, then linearly converts the pixel blocks into feature vectors, and changes the input and output channels of the two MLPs into (d, k, d)²C') and (k)²C', d). Wherein the content of the first and second substances,

in the soft-combining operation of the feature vectors, first, each feature vector is divided from [1 × d]Is reshaped into [ kXkXc ]]And placed at the original spatial location in the convolved feature map. Then, in the next soft segmentation operation, the feature map A is recombined from the corresponding frame j using the segmentation size k and the step size s_jAnd dividing each pixel block by medium-soft division. Finally, the re-segmented pixel blocks are expanded into one-dimensional vectors and fed into the MLP 336.

Again, inputting the plurality of second conversion features into another linear projection processing module 306, performing linear projection in the linear projection processing module 306, and reshaping the plurality of second conversion features into a plurality of updated pixel blocks 307, including: a block (1, 1), a block (5, 5), a block (9, 9) obtained by performing fusion conversion processing on the first frame image; and a block (1, 1), a block (5, 5), and a block (9, 9) obtained by performing fusion conversion processing on the second frame image.

Finally, the plurality of updated pixel blocks 307 (including block (1, 1), block (5, 5), block (9, 9) are synthesized at the position of the video frame for each updated pixel block 307, resulting in a video frame 308 for repairing the first frame image 301 and a video frame 309 for repairing the second frame image 302.

In some possible implementations, the second feature map is re-synthesized using several deconvolution layer pairs

Decoding is performed to output a restored video frame (i.e., a second image) having an original size

In the embodiment of the application, the fusion conversion network formed by a plurality of groups of fusion conversion blocks and the fusion feedforward network is adopted to perform fusion conversion processing on the first conversion characteristics of each pixel block for a plurality of times, so that second conversion characteristics which can contain richer edge information and deeper information are obtained, and a second image with a better repairing effect can be synthesized.

In some embodiments, the image inpainting method provided by the embodiments of the present application may be implemented by an image inpainting network including the framework shown in fig. 3, where a training process of the image inpainting network is as follows:

the embodiment of the application trains the network for image restoration by reducing the following loss to the maximum extent, wherein the loss function is as the formula (9):

L＝λ_R·L_R+λ_adv·L_advformula (9);

wherein L is_RIs the reconstruction loss of all pixels, and L_advIs a countermeasure loss of the countermeasure network, λ_RAnd λ_advFor balancing the importance of different loss functions. For reconstruction loss L_RFor measuring composite video

The distance from the original video Y can be expressed as:

moreover, in the embodiment of the present application, the classifier D is adopted to perform an auxiliary training together with the generator of the attention fusion module 321, so as to obtain better comprehensive fidelity and time consistency. The classifier takes a real video and a synthesized video as input, and the output range is [0, 1 ]; where 0 represents false and 1 represents true. The classifier is trained in a direction that distinguishes all composite videos from real videos. Training the generator of the attention fusion module 321 is in the opposite direction, which will generate a video that classifier D cannot distinguish. The loss function of classifier D is formula (11):

the loss function of the generator of the attention fusion module 321 is formula (11):

in some embodiments, the first fused feature is soft-synthesized and soft-segmented a plurality of times by using a fused feedforward network based on fine-grained feature fusion, so as to obtain a second transformed feature with fine-grained information, that is, the step S132 may be implemented by:

firstly, regularization processing is carried out on the first fusion feature to obtain a second regularization feature.

And performing data preprocessing on the first conversion characteristics corresponding to each pixel block to enable the plurality of first conversion characteristics to be more normalized. The process of regularizing the first fused features may be the same as the process of regularizing the first transformed features.

And secondly, performing feedforward conversion on the second regularization characteristic to obtain a feedforward conversion characteristic.

And inputting the second regularization characteristic into the fusion feedforward network, and performing fusion feedforward processing to obtain a feedforward conversion characteristic of the output of the fusion feedforward network. The output feedforward conversion characteristic is the same dimension as the input second regularization characteristic.

And thirdly, fusing the feedforward conversion characteristic and the first fusion characteristic to obtain a second fusion characteristic.

And fourthly, determining the second fusion characteristic as the second conversion characteristic.

And adding the feedforward conversion characteristic and the first fusion characteristic element by element to obtain a second fusion characteristic, namely obtaining a second conversion characteristic with finer granularity.

In the embodiment of the present application, in order to improve normalization between the first fusion features corresponding to each pixel block, regularization processing is performed on the obtained multiple first fusion features. And taking the regularized features as the input of a fusion feedforward network, and performing soft synthesis and soft segmentation operation on the input first fusion features for multiple times in the fusion feedforward network to obtain the final output attention conversion features. In this way, the first conversion feature and the third conversion feature of the same pixel block are fused, and the fused features are subjected to soft synthesis and soft segmentation for multiple times, so that more detailed information can be captured, and the granularity of the obtained attention conversion features is finer.

In some possible implementation manners, the second regularization feature is processed by using a fusion feedforward network, and a process of obtaining a feedforward conversion feature is as follows:

firstly, the second regularization features are converted into pixel blocks to obtain at least two converted pixel blocks.

In the fusion feedforward network, performing dimension-increasing processing on the input second regularization features, and reshaping the second regularization features into pixel blocks; for example, a second regularization feature is a 512-dimensional vector that is transformed into a 7 × 7 × c block of pixels by up-scaling the vector; and then, a pixel block after each second regularization feature conversion, namely at least two converted pixel blocks, can be obtained. In some possible implementation manners, the at least two converted pixel blocks are obtained by performing dimension-up conversion on the second regularization features, and first, each second regularization feature is subjected to linear conversion to obtain at least two dimension-up features. In a specific example, a multilayer perceptual network is adopted to perform linear processing on the second regularization feature, so that the second regularization feature performs dimension raising based on a second preset value, and the dimension of the dimension-raised feature obtained after the dimension raising is the same as the dimension of the initial pixel block (i.e., the pixel block obtained after the division of the first image in step S102). And then, performing linear projection on each liter-dimension characteristic by adopting a second preset numerical value to obtain the at least two conversion pixel blocks. Here, the second predetermined value is proportional to the first predetermined value (for example, the first predetermined value and the second predetermined value are reciprocal); the linear projection mode of the dimension-increasing characteristic is the same as the linear projection mode of the dimension-reducing pixel block, but the preset values in the projection parameters are reciprocal, so that the dimension-increasing characteristic is linearly projected by adopting the second preset value, and the conversion pixel block with the same size as the initial pixel block can be obtained. In this way, the second regularization features are converted into converted pixel blocks of the same size as the initial pixel blocks by using an inverse linear conversion method, thereby facilitating aggregation of different second regularization features in the overlap region.

And secondly, synthesizing the at least two conversion pixel blocks based on the position information of each conversion pixel block in the first image to obtain a second feature map.

For each conversion pixel block, position information of the conversion pixel block in the first image, i.e. coordinates in the first image, is determined. And splicing the conversion pixel blocks based on the coordinates of each conversion pixel block, so that a second feature map can be synthesized. In the process of synthesizing the second feature map, for an overlapping region of a plurality of conversion pixel blocks, the conversion pixel blocks in the overlapping region are subjected to element-by-element addition, and the added sum is taken as a pixel value of the overlapping region. In this way, the pixel values of the overlapping region are enabled to gather features from different transformed pixel blocks, thereby making the edge region in the resulting second feature map smoother.

And thirdly, dividing the second feature map into at least two middle pixel blocks with overlapping areas.

Segmenting the second feature map in a manner similar to step S102; that is, the second feature map is traversed by using the set division size and division step size, and the second feature map is divided into at least a plurality of intermediate pixel blocks having overlapping regions.

And finally, performing dimension conversion on each middle pixel block to obtain the feedforward conversion characteristic of each middle pixel block.

And carrying out dimension conversion on each middle pixel block to obtain the feedforward conversion characteristic of each middle pixel block. And performing dimension conversion on each middle pixel block, unfolding the middle pixel block into a one-dimensional characteristic vector, and performing linear projection on the one-dimensional characteristic vector by adopting a fixed size to obtain the feedforward conversion characteristic of the middle pixel block. In this way, in the fusion feedforward network, by performing soft synthesis on the first fusion feature for multiple times and performing soft segmentation on the feature map subjected to soft synthesis again, the feedforward conversion feature obtained after segmentation again has richer edge information and can fuse more information of adjacent pixel blocks.

In some embodiments, in order to improve the effect of repairing the incomplete image, the transformed image is reconstructed into pixel blocks by decoding the transformed features, and the repaired second image is synthesized according to the positions of the pixel blocks in the original image, that is, the step S105 can be implemented by the following steps S151 to 153 (not shown in the figure):

step S151, performing linear conversion on each second conversion feature to obtain a fourth conversion feature.

For each second conversion feature in the plurality of second conversion features, inputting the second conversion feature into the linear layer and decoding the second conversion feature into a one-dimensional vector to obtain a fourth conversion feature; the number of the fourth conversion features is the same as the number of the second conversion features.

Step S152, reshaping each of the fourth conversion features into a pixel block, so as to obtain an updated pixel block set.

Performing dimension-increasing processing on each fourth conversion characteristic to obtain an updated pixel block set; for example, the dimension of each fourth conversion feature is increased, the one-dimensional feature vector is converted into a two-dimensional pixel block, the fourth conversion feature is reshaped, and an updated pixel block set is obtained; the size of the two-dimensional pixel block is the same as the size of the original pixel block. As shown by update pixel blocks 421, 422, and 423 in fig. 4.

Step S153, synthesizing the update pixel blocks in the update pixel block set into the second image based on the position information of each update pixel block in the first image.

Determining the coordinates of each updating pixel block in the first image, and combining the updating pixel blocks with the overlapping areas into a feature map based on the coordinates; and obtaining a repaired second image by decoding the characteristic diagram. In this way, by soft-combining the updated pixel blocks having the overlapping regions, the overlapping regions between different pixel blocks can be better utilized, thereby improving the image restoration effect.

In some possible implementations, by decoding the feature vector input into the linear layer into a one-dimensional vector and reshaping the one-dimensional pixel block, and then combining a plurality of pixel blocks with overlapping regions into a feature map, the overlapping regions between different pixel blocks can be better utilized, that is, the step S153 can be implemented by:

firstly, overlapping pixel blocks of the updating pixel blocks in the same overlapping area are fused to obtain a fused pixel block.

And carrying out element-by-element summation on partial pixels of the updated pixel block in the same overlapping region, thereby obtaining a fusion pixel block of the overlapping region. That is, because there is an overlap region between the updated pixel blocks, the pixel values overlapped at the same spatial position are summed element by element to obtain a pixel block on the overlap region, i.e., a fusion pixel block.

And secondly, determining the first position information of the fusion pixel block and the second position information of the non-overlapped pixel block based on the position information of each updating pixel block in the first image.

And respectively determining the coordinates of the overlapping area corresponding to the fusion pixel block and the coordinates of the non-overlapping pixel block in the first image according to the coordinates of each updating pixel block in the first image.

And thirdly, splicing the fusion pixel block and the non-overlapped pixel block based on the first position information and the second position information to obtain a third characteristic diagram.

And splicing the fused pixel block and the non-overlapped pixel block in the overlapped region according to the position of the overlapped region in the first image and the positions of other non-overlapped regions belonging to the same updating pixel block in the overlapped region, so as to obtain a third feature map with the same size as the original first feature map.

And fourthly, decoding the third feature map to obtain the second image.

And decoding the third characteristic diagram by adopting a decoder based on the convolutional neural network to obtain a repaired image, namely a second image. As shown in the soft combining module 402 in fig. 4, in the soft combining module 402, first conversion features 414, 415 and 416 are reshaped into update pixel blocks 421, 422 and 423, respectively; then, based on the position information of the pixel blocks 421, 422, and 423 in the original first feature map, the pixels in the overlapped region are fused and spliced with the pixels in the non-overlapped region, so as to obtain a feature map 424 for repairing the first image (where the feature map 424 is used to indicate the reference numeral of the feature map).

In the embodiment of the application, the two conversion characteristics are input into a soft synthesis module to obtain a third characteristic diagram with the same size as the first characteristic diagram, and the learned third characteristic diagram is input into a decoder based on a convolutional neural network to obtain a repaired second image; in this way, when the second conversion feature after the fusion conversion processing is soft-synthesized to the original position, the overlapping region collects information from different conversion features, and by fusing information from adjacent pixel blocks, it is possible to contribute to smoothing the boundary of the pixel block and expanding the reception range of the pixel block.

An exemplary application of the embodiment of the present application in an actual application scenario will be described below, taking implementation of high-quality image restoration by using a high-quality image restoration framework as an example.

In some embodiments, video image restoration is an important problem in the field of computer vision, and a machine is required to learn from a large amount of natural videos, so that the video restoration with partial pixel damage can generate a vivid and natural video. The technology has a plurality of application scenes, such as sundry removal, foreground removal, video image generation and reconstruction and the like.

Networks for repairing images include convolutional neural networks, attention mechanisms, and the like. The ability of convolutional neural networks to accommodate most video image tasks. The attention mechanism can better model among pixels with longer distances, and is particularly important for repairing video images, which needs to refer to visual features from other space-time positions. The attention mechanism can be flexibly inserted into the convolutional neural network, so that the neural network learns better visual characteristics, and the video image restoration is facilitated. However, in the related art, attention is paid to a mechanism in the visual task, and due to the calculation amount, the image feature is often required to be hard-divided into many pixel blocks. By hard segmentation is meant that an image feature is sliced into non-overlapping small blocks of pixels along both the width and height dimensions (coarse granularity). Then, the attention mechanism directly models among a plurality of small pixel blocks, which results in under-utilization of pixels (fine granularity) inside the small pixel blocks, and affects the final image restoration effect.

Based on the above, the embodiments of the present application provide a method for performing soft segmentation and soft synthesis on an image, where the soft segmentation is to segment a feature map of the image into a plurality of pixel blocks with overlapping regions, and the soft synthesis is to combine the plurality of pixel blocks with overlapping regions into one feature map; the pair of soft segmentation and soft synthesis modules are used once at the beginning and the end of the fusion conversion processing and are repeatedly used in a feedforward network in the fusion conversion processing, so that the fusion conversion processing has the capability of learning fine-grained subblock (sub-patch) characteristics, and a clearer video can be generated.

Referring to fig. 5, fig. 5 is a schematic view of an application scenario of the image inpainting method according to the embodiment of the present application, where an image set 501 represents a first column of images in fig. 5, and is an image frame marked with a detection frame; the image set 502 represents the second column of images in fig. 5, which are real reference images; the image set 503 is a result of repairing the images in the image set 501 in the related art; the image set 504 is based on the implementation framework shown in fig. 3, but does not include the multi-layer attention conversion block 304 in fig. 3, that is, the images in the image set 501 are repaired by performing soft segmentation and soft synthesis on the images in the image set 501, so as to obtain a repaired image set 504; the image set 505 is a restored image obtained by restoring the image in the image set 501 based on the implementation framework shown in fig. 3.

As can be seen from fig. 5, the images in the image set 504 can better process detailed information than the image set 503 by a combination of soft segmentation and soft synthesis. When the function of forward conversion (transform) is implemented by the multi-layer attention conversion block 304 provided in the embodiment of the present application on the basis of the image set 504, the images in the image set 505 can recover details and severely damaged objects.

In the embodiment of the application, firstly, inputting the damaged video into a convolutional neural network-based encoder to extract a first feature map; secondly, obtaining a first feature vector by the first feature map through a soft segmentation module, and further learning the group of first feature vectors (corresponding to the first conversion features in the above embodiment) through an attention fusion module to obtain a feature vector; thirdly, inputting the feature vector into a soft synthesis module to obtain a feature map with the same size as the first feature map. Finally, inputting the learned feature map into a decoder based on a convolutional neural network to obtain a repaired video; thereby improving the performance of video image restoration.

An image restoration apparatus according to an embodiment of the present application is provided, fig. 6 is a schematic structural composition diagram of the image restoration apparatus according to an embodiment of the present application, and as shown in fig. 6, the image restoration apparatus 600 includes:

a first obtaining module 601, configured to obtain a first image;

a first segmentation module 602 for segmenting the first image into at least two blocks of pixels having overlapping regions;

a first conversion module 603, configured to perform dimension conversion on each pixel block to obtain a first conversion characteristic of each pixel block;

a first processing module 604, configured to perform at least one fusion conversion processing on the first conversion feature to obtain a second conversion feature after the first conversion feature is repaired;

the first synthesizing module 605 is configured to synthesize the pixel blocks corresponding to the second conversion feature to obtain a second image.

In some embodiments, the first segmentation module 602 includes:

the first extraction submodule is used for extracting the features of the first image to obtain a first feature map;

a first segmentation submodule for segmenting the first feature map into the at least two pixel blocks having an overlap region.

In some embodiments, the first segmentation submodule comprises:

a first determination unit configured to determine a segmentation step size and a segmentation size based on a size of the first feature map; wherein the segmentation step size is smaller than the segmentation size;

and the first segmentation unit is used for traversing the first feature map according to the segmentation step length and segmenting the first feature map into at least two pixel blocks meeting the segmentation size.

In some embodiments, the first conversion module 603 includes:

the first conversion submodule is used for carrying out linear conversion on each pixel block to obtain a dimensionality reduction pixel block;

and the first projection submodule is used for performing linear projection on the dimensionality reduction pixel block by adopting a first preset numerical value to obtain the first conversion characteristic.

In some embodiments, the first processing module 604 includes:

the first processing submodule is used for carrying out attention fusion processing on the first conversion characteristic for the jth time to obtain a first fusion characteristic; wherein j is 1, 2 … … M, and M is the preset number of times of fusion conversion processing;

and the second processing submodule is used for carrying out the j-th feedforward fusion processing on the first fusion characteristic to obtain the second conversion characteristic.

In some embodiments, the first processing sub-module comprises:

the first processing unit is used for conducting regularization processing on the first conversion characteristic to obtain a first regularization characteristic;

the second processing unit is used for processing the first regularization feature based on a multi-head attention mechanism to obtain a third conversion feature;

and the first fusion unit is used for fusing the third conversion characteristic and the first conversion characteristic to obtain the first fusion characteristic.

In some embodiments, the second processing sub-module comprises:

the third processing unit is used for conducting regularization processing on the first fusion feature to obtain a second regularization feature;

the first conversion unit is used for carrying out feedforward conversion on the second regularization characteristic to obtain a feedforward conversion characteristic;

the second fusion unit is used for fusing the feedforward conversion characteristic and the first fusion characteristic to obtain a second fusion characteristic;

a second determination unit configured to determine the second fusion feature as the second conversion feature.

In some embodiments, the first conversion unit comprises:

the first conversion subunit is used for converting the second regularization features into pixel blocks to obtain at least two conversion pixel blocks;

the first synthesis subunit is used for synthesizing the at least two conversion pixel blocks based on the position information of each conversion pixel block in the first image to obtain a second feature map;

a first dividing subunit, configured to divide the second feature map into at least two middle pixel blocks having an overlapping region;

and the first conversion subunit is used for carrying out dimension conversion on each middle pixel block to obtain the feedforward conversion characteristic of each middle pixel block.

In some embodiments, the first conversion subunit is further configured to: performing linear conversion on the second regularization features to obtain at least two ascending-dimension features; performing linear projection on each liter-dimension characteristic by adopting a second preset numerical value to obtain at least two conversion pixel blocks; the second preset value and the first preset value have a proportional relation.

In some embodiments, the first synthesis module 605 includes:

the second conversion submodule is used for performing linear conversion on each second conversion characteristic to obtain a fourth conversion characteristic;

the first reshaping submodule is used for reshaping each fourth conversion characteristic into a pixel block to obtain an updated pixel block set;

and the first synthesis submodule is used for synthesizing the updating pixel blocks in the updating pixel block set into the second image based on the position information of each updating pixel block in the first image.

In some embodiments, the first synthesis submodule comprises:

the third fusion unit is used for fusing the overlapped pixel blocks of the updated pixel blocks in the same overlapped region to obtain a fused pixel block;

a third determining unit configured to determine first position information of the fused pixel block and second position information of a non-overlapping pixel block based on position information of each of the updated pixel blocks in the first image;

the first splicing unit is used for splicing the fusion pixel block and the non-overlapped pixel block based on the first position information and the second position information to obtain a third feature map;

and the first decoding unit is used for decoding the third feature map to obtain the second image.

It should be noted that the above description of the embodiment of the apparatus, similar to the above description of the embodiment of the method, has similar beneficial effects as the embodiment of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the image restoration method is implemented in the form of a software functional module and sold or used as a standalone product, the image restoration method may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a terminal, a server, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a hard disk drive, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, the embodiment of the present application further provides a computer program product, where the computer program product includes computer-executable instructions, and after the computer-executable instructions are executed, the image inpainting method provided by the embodiment of the present application can be implemented.

Accordingly, an embodiment of the present application further provides a computer storage medium, where computer-executable instructions are stored on the computer storage medium, and when the computer-executable instructions are executed by a processor, the image inpainting method provided by the foregoing embodiment is implemented.

Accordingly, an electronic device is provided in an embodiment of the present application, fig. 7 is a schematic view of a composition structure of the electronic device in the embodiment of the present application, and as shown in fig. 7, the electronic device 700 includes: a processor 701, at least one communication bus, a communication interface 702, at least one external communication interface, and a memory 703. Wherein communication interface 702 is configured to enable connectivity communications between these components. The communication interface 702 may include a display screen, and the external communication interface may include a standard wired interface and a wireless interface, among others. The processor 701 is configured to execute an image processing program in the memory to implement the image repairing method provided by the above embodiment.

The above descriptions of the embodiments of the image restoration apparatus, the computer device and the storage medium are similar to the above descriptions of the embodiments of the method, and have similar technical descriptions and advantages to the corresponding embodiments of the method, which are limited by the space. For technical details not disclosed in the embodiments of the image restoration apparatus, the computer device and the storage medium of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of features does not include only those features but may include other features not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, reference to a feature identified by the phrase "comprising an … …" does not exclude the presence of additional similar features in any process, method, article, or apparatus that comprises the feature.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image inpainting method, comprising:

acquiring a first image;

segmenting the first image into at least two blocks of pixels having overlapping regions;

performing dimension conversion on each pixel block to obtain a first conversion characteristic of each pixel block;

performing at least one fusion conversion processing on the first conversion characteristic to obtain a second conversion characteristic after the first conversion characteristic is repaired;

and synthesizing the pixel blocks corresponding to the second conversion characteristics to obtain a second image.

2. The method of claim 1, wherein the segmenting the first image into at least two blocks of pixels having overlapping regions comprises:

performing feature extraction on the first image to obtain a first feature map;

and dividing the first feature map into the at least two pixel blocks with overlapping areas.

3. The method of claim 2, wherein said partitioning the first feature map into the at least two pixel blocks having overlapping regions comprises:

determining a segmentation step size and a segmentation size based on the size of the first feature map; wherein the segmentation step size is smaller than the segmentation size;

and traversing the first feature map according to the segmentation step length, and segmenting the first feature map into at least two pixel blocks meeting the segmentation size.

4. The method according to any one of claims 1 to 3, wherein the performing the dimension conversion on each pixel block to obtain the first conversion characteristic of each pixel block comprises:

performing linear conversion on each pixel block to obtain a dimensionality reduction pixel block;

and performing linear projection on the dimensionality reduction pixel block by adopting a first preset numerical value to obtain the first conversion characteristic.

5. The method according to any one of claims 1 to 4, wherein the performing at least one fusion transformation process on the first transformed feature to obtain a second transformed feature after repairing the first transformed feature comprises:

performing attention fusion processing on the first conversion characteristic for the jth time to obtain a first fusion characteristic; wherein j is 1, 2 … … M, and M is the preset number of times of fusion conversion processing;

and performing the j-th feedforward fusion processing on the first fusion characteristic to obtain the second conversion characteristic.

6. The method according to claim 5, wherein the performing the jth attention fusion process on the first converted feature to obtain a first fused feature comprises:

regularizing the first conversion feature to obtain a first regularization feature;

processing the first regularization feature based on a multi-head attention mechanism to obtain a third conversion feature;

and fusing the third conversion characteristic and the first conversion characteristic to obtain the first fused characteristic.

7. The method according to claim 5, wherein the performing the j-th feedforward fusion process on the first fusion feature to obtain the second conversion feature comprises:

regularizing the first fusion feature to obtain a second regularization feature;

performing feedforward conversion on the second regularization characteristic to obtain a feedforward conversion characteristic;

fusing the feedforward conversion characteristic and the first fusion characteristic to obtain a second fusion characteristic;

determining the second fused feature as the second transformed feature.

8. The method of claim 7, wherein the feed-forward converting the second regularization feature to obtain a feed-forward converted feature comprises:

converting the second regularization features into pixel blocks to obtain at least two converted pixel blocks;

synthesizing the at least two conversion pixel blocks based on the position information of each conversion pixel block in the first image to obtain a second feature map;

dividing the second feature map into at least two intermediate pixel blocks having overlapping regions;

and carrying out dimension conversion on each middle pixel block to obtain the feedforward conversion characteristic of each middle pixel block.

9. The method of claim 8, wherein converting the second regularization feature into a block of pixels results in at least two converted blocks of pixels comprising:

performing linear conversion on the second regularization features to obtain at least two ascending-dimension features;

performing linear projection on each liter-dimension characteristic by adopting a second preset numerical value to obtain at least two conversion pixel blocks; the second preset value and the first preset value have a proportional relation.

10. The method according to any one of claims 1 to 9, wherein the synthesizing of the pixel blocks corresponding to the second conversion feature to obtain the second image comprises:

performing linear conversion on each second conversion characteristic to obtain a fourth conversion characteristic;

reshaping each fourth conversion characteristic into a pixel block to obtain an updated pixel block set;

and synthesizing the updating pixel blocks in the updating pixel block set into the second image based on the position information of each updating pixel block in the first image.

11. The method according to claim 10, wherein said composing the update pixel blocks of the set of update pixel blocks into the second image based on the position information of each update pixel block in the first image comprises:

fusing overlapping pixel blocks of the updating pixel blocks in the same overlapping area to obtain a fused pixel block;

determining first position information of the fused pixel block and second position information of a non-overlapping pixel block based on position information of each update pixel block in the first image;

based on the first position information and the second position information, splicing the fusion pixel block and the non-overlapped pixel block to obtain a third feature map;

and decoding the third feature map to obtain the second image.

12. An image restoration apparatus, characterized in that the apparatus comprises:

the first acquisition module is used for acquiring a first image;

13. A computer storage medium having computer-executable instructions stored thereon, the computer-executable instructions being executable to implement the image inpainting method of any one of claims 1 to 11.

14. An electronic device, comprising a memory having computer-executable instructions stored thereon and a processor capable of implementing the image inpainting method of any one of claims 1 to 11 when the processor executes the computer-executable instructions on the memory.