WO2022057837A1

WO2022057837A1 - Image processing method and apparatus, portrait super-resolution reconstruction method and apparatus, and portrait super-resolution reconstruction model training method and apparatus, electronic device, and storage medium

Info

Publication number: WO2022057837A1
Application number: PCT/CN2021/118591
Authority: WO
Inventors: 侯剑堃
Original assignee: 广州虎牙科技有限公司
Priority date: 2020-09-16
Filing date: 2021-09-15
Publication date: 2022-03-24

Abstract

The embodiments of the present application relate to the technical field of computer vision, and provide an image processing and model training method and apparatus, an electronic device, and a storage medium. Said method comprises: acquiring an image to be processed and inputting same into an image reconstruction model, the image reconstruction model comprising a feature extraction network and a sub-pixel convolution layer, using the feature extraction network to perform multi-scale feature extraction and image channel extension on said image, so as to obtain a reconstructed feature map, and then using the sub-pixel convolution layer to enlarge the reconstructed feature map, so as to obtain a reconstructed image. As the feature extraction network can extract multi-scale features and extend an image channel, a good reconstruction effect can be obtained without increasing the network depth; moreover, the sub-pixel convolution layer is used at the end of the model to perform image enlargement, and the feature extraction network processes a small-sized image, greatly reducing the amount of calculation and parameters, thereby increasing the processing speed while ensuring the reconstruction effect. In addition, the embodiments of the present application further provide a portrait super-resolution reconstruction method and apparatus, a model training method and apparatus, an electronic device, and a readable storage medium.

Description

Image processing and portrait super-resolution reconstruction and model training method, device, electronic device and storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This disclosure requires a Chinese patent application with the application number 202010977254.4, entitled "Image Processing and Model Training Method, Apparatus, Electronic Device, and Storage Medium" filed with the State Intellectual Property Office of China on September 16, 2020, and filed on September 20, 2020 The priority of the Chinese patent application with the application number 202011000670.5 and the title of "Portrait Super-resolution Reconstruction Method, Model Training Method, Apparatus, Electronic Equipment and Readable Storage Medium" submitted to the State Intellectual Property Office of China on the 22nd, the entire content of which is approved by References are incorporated in this disclosure.

technical field

The present application relates to the technical field of computer vision, and in particular, to a method, apparatus, electronic device, and storage medium for image processing, super-resolution reconstruction of portraits, and model training.

Background technique

Image super-resolution reconstruction or image super-resolution restoration refers to the process of restoring a given low-resolution image or image sequence into a corresponding high-resolution image through specific processing. It is widely used in various types of videos or images that need to be improved. Quality fields, such as video image processing, medical imaging, remote sensing imaging, video surveillance, etc.

At present, when super-resolution reconstruction is performed by deep learning algorithms, a network with a sufficient depth of layers needs to be used to obtain a better reconstruction effect. Therefore, the network structure is usually complicated, and the amount of calculation is large, which affects the processing speed.

In addition, image super-resolution reconstruction technology is also widely used in many fields such as face recognition, big data analysis, security, etc., which is of great help to achieve portrait restoration, portrait recognition, and matching. However, currently in the process of image super-resolution reconstruction, for example, in the process of super-resolution reconstruction of a human portrait, the method usually adopted is to reconstruct the entire image, because this method does not focus on the human eye. Perceive the more important information, which makes the reconstructed image difficult to meet the actual needs.

SUMMARY OF THE INVENTION

Embodiments of the present application provide an image processing and model training method, apparatus, electronic device, and storage medium, so as to improve the processing speed while ensuring the reconstruction effect.

Embodiments of the present application also provide a portrait super-resolution reconstruction method, a model training method, an apparatus, an electronic device, and a readable storage medium, which can improve the recognition of the obtained super-resolution image and meet user requirements.

In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present application are as follows:

Some embodiments of the present application provide an image processing method, the method may include:

Get the image to be processed;

Inputting the to-be-processed image into an image reconstruction model, and using the feature extraction network of the image reconstruction model to perform multi-scale feature extraction on the to-be-processed image and expand image channels to obtain a reconstructed feature map;

The reconstructed feature map is enlarged by using the sub-pixel convolution layer of the image reconstruction model to obtain a reconstructed image.

In an optional embodiment, the feature extraction network may include a convolutional layer, a plurality of concatenated blocks and a plurality of first convolutional layers, a plurality of the concatenated blocks and a plurality of the first convolutional layers Alternately set, the feature extraction network can adopt a global cascade structure;

The step of using the feature extraction network of the image reconstruction model to perform multi-scale feature extraction on the to-be-processed image to obtain a reconstructed feature map may include:

Inputting the image to be processed into the convolution layer for convolution processing to obtain an initial feature map;

Using the initial feature map as the input of the first of the concatenated blocks and the output of the N-1th first convolutional layer as the input of the Nth of the concatenated blocks, using the stage The multi-scale feature extraction is performed in the joint block, and the intermediate feature map is output;

Perform channel stacking on the initial feature map and the intermediate feature map output by each of the concatenated blocks before the Nth first convolution layer, and input the Nth first convolution layer after stacking layer for convolution processing;

The output of the last first convolutional layer is used as the reconstructed feature map.

In an optional implementation manner, the number of the concatenated blocks may be 3 to 5, and the number of the first convolutional layers may be 3 to 5.

In an optional embodiment, the concatenated block may include a plurality of residual blocks and a plurality of second convolution layers, and the plurality of the residual blocks and the plurality of the second convolution layers are alternately arranged, so The cascading block can adopt a local cascading structure;

The step of using the cascaded blocks to perform multi-scale feature extraction and outputting an intermediate feature map may include:

Taking the input of the concatenated block as the input of the first residual block, and taking the output of the N-1th second convolutional layer as the input of the Nth residual block, using the The residual block learns the residual features, and obtains the residual feature map;

The input of the concatenated block and the output of each of the residual blocks before the Nth second convolutional layer are channel-stacked, and input to the Nth second convolutional layer for convolution after stacking accumulated processing;

The output of the last second convolutional layer is used as the intermediate feature map.

In an optional implementation manner, the number of the residual blocks may be 3 to 5, and the number of the second convolutional layers may be 3 to 5.

In an optional embodiment, the residual block may include a grouped convolutional layer, a third convolutional layer and a fourth convolutional layer, the grouped convolutional layer adopts a ReLu activation function, the grouped convolutional layer and The third convolutional layer is connected to form a residual path, and the residual block can adopt a local skip connection structure;

The step of using the residual block to learn residual features to obtain a residual feature map may include:

The input of the residual block is used as the input of the grouped convolution layer, and features are extracted through the residual path;

Feature fusion is performed between the input of the residual block and the output of the third convolution layer, and after fusion, the input is input to the fourth convolution layer for convolution processing, and the residual feature map is output.

In an optional embodiment, the step of using the sub-pixel convolution layer of the image reconstruction model to amplify the reconstructed feature map to obtain a reconstructed image may include:

Using the sub-pixel convolution layer to adjust the pixel positions in the reconstructed feature map to obtain the reconstructed image.

Other embodiments of the present application also provide an image reconstruction model training method, the method may include:

acquiring training samples, where the training samples include low-resolution images and high-resolution images, and the low-resolution images are obtained by down-sampling the high-resolution images;

Inputting the low-resolution image into a pre-built image reconstruction model, the image reconstruction model includes a feature extraction network and a sub-pixel convolution layer;

Use the feature extraction network to perform multi-scale feature extraction on the low-resolution image and expand the image channel to obtain a training feature map;

Using the sub-pixel convolution layer to amplify the training feature map to obtain a training reconstructed image;

Back-propagation training is performed on the image reconstruction model based on the training reconstructed image, the high-resolution image and the preset objective function to obtain a trained image reconstruction model.

In an optional embodiment, the objective function may be an L2 loss function;

The step of performing back-propagation training on the image reconstruction model based on the training reconstruction image, the high-resolution image and the preset objective function to obtain the trained image reconstruction model:

Back-propagation training is performed on the image reconstruction model based on the training reconstructed image, the high-resolution image and the L2 loss function to adjust the parameters of the image reconstruction model until the preset training is completed condition to obtain the image reconstruction model after training.

In an optional embodiment, the image reconstruction model training method may further include:

The trained image reconstruction model is pruned to preserve long-line cascades and delete short-line cascades.

In an optional embodiment, before the step of inputting the low-resolution image into a pre-built image reconstruction model, the method may further include:

A self-reducing average is performed on the low-resolution image to highlight texture details of the low-resolution image.

Perform flip symmetry processing on the low-resolution image to obtain at least one processed low-resolution image;

The step of inputting the low-resolution image into a pre-built image reconstruction model may include:

inputting the at least one processed low-resolution image into the image reconstruction model;

The step of using the feature extraction network to perform multi-scale feature extraction on the low-resolution image to obtain a training feature map may include:

Use the feature extraction network to perform multi-scale feature extraction on the at least one processed low-resolution image to obtain at least one auxiliary feature map;

Perform anti-flip symmetry processing on at least one auxiliary feature map, and obtain the training feature map by averaging after the anti-flip symmetry processing.

Still other embodiments of the present application also provide an image processing apparatus, and the apparatus may include:

an image acquisition module, which can be configured to acquire an image to be processed;

The first execution module can be configured to input the image to be processed into an image reconstruction model, and use the feature extraction network of the image reconstruction model to perform multi-scale feature extraction on the to-be-processed image and expand image channels to obtain a reconstruction feature map;

The second execution module may be configured to use the sub-pixel convolution layer of the image reconstruction model to amplify the reconstructed feature map to obtain a reconstructed image.

Still other embodiments of the present application also provide an image reconstruction model training apparatus, the apparatus may include:

a sample acquisition module, which can be configured to acquire training samples, where the training samples include low-resolution images and high-resolution images, and the low-resolution images are obtained by down-sampling the high-resolution images;

a first processing module, which can be configured to input the low-resolution image into a pre-built image reconstruction model, the image reconstruction model including a feature extraction network and a sub-pixel convolution layer;

The second processing module can be configured to use the feature extraction network to perform multi-scale feature extraction on the low-resolution image and expand image channels to obtain a training feature map;

A third processing module may be configured to use the sub-pixel convolutional layer to amplify the training feature map to obtain a training reconstructed image;

The fourth processing module may be configured to perform back-propagation training on the image reconstruction model based on the training reconstructed image, the high-resolution image and a preset objective function to obtain a trained image reconstruction model.

Compared with the prior art, an image processing and model training method, device, electronic device, and storage medium provided by the embodiments of the present application are obtained by acquiring an image to be processed and inputting an image reconstruction model. The image reconstruction model includes a feature extraction network and sub-pixel convolution. First, the feature extraction network is used to extract the multi-scale feature of the image to be processed and expand the image channel to obtain the reconstructed feature map, and then use the sub-pixel convolution layer to amplify the reconstructed feature map to obtain the reconstructed image. Since the feature extraction network can extract multi-scale features and expand image channels, it is possible to obtain a better reconstruction effect without increasing the depth of the network. Image processing greatly reduces the amount of calculation and parameters; thus improving the processing speed while ensuring the reconstruction effect.

Some embodiments of the present application provide a method for super-resolution reconstruction of a portrait, the method may include:

Use the image reconstruction model to detect the key points of the image to be processed, and obtain the key points of the face;

Perform super-resolution reconstruction processing according to the face key points and the image features obtained based on the to-be-processed image to obtain image high-frequency information;

Perform restoration processing on the to-be-processed image by using the high-frequency information of the image to obtain a super-resolution image corresponding to the to-be-processed image.

In an alternative embodiment, the super-resolution reconstruction process is performed using the image processing method described above.

In an optional implementation manner, the key point detection, super-resolution reconstruction processing and restoration processing may include multiple rounds of iterative processing, and the to-be-processed image is an unprocessed to-be-processed image, or an image that has been processed in a previous round of iterations. The super-resolution image obtained after the key point detection, super-resolution reconstruction processing and restoration processing.

In an optional implementation manner, the face key points may include multiple, and the image to be processed is restored by using the high-frequency information of the image to obtain a super-resolution image corresponding to the to-be-processed image steps that can include:

Process the to-be-processed image by using a pre-built portrait cognitive model, and output the position information of each of the key points of the face;

Based on the position information of each of the face key points and the high-frequency information of the image, restoration processing is performed on the to-be-processed image to obtain a super-resolution image corresponding to the to-be-processed image.

In an optional implementation manner, performing restoration processing on the to-be-processed image based on the position information of each of the face key points and the high-frequency information of the image to obtain a super-resolution corresponding to the to-be-processed image Image steps can include:

obtaining the restoration attributes corresponding to each of the face key points;

According to each of the face key points and their corresponding position information, image high-frequency information, and restoration attributes, restoration processing is performed on the corresponding face key points in the to-be-processed image.

In an optional embodiment, the reconstructed model may include a discriminator and a generation network, and the generation network is obtained after training with training samples under the supervision of the trained discriminator.

In an optional embodiment, the face key points may include the contours of the left eye, the right eye, the nose, the mouth and the chin.

Other embodiments of the present application provide a method for training a portrait super-resolution reconstruction model, the method may include:

obtaining a training sample and a target sample corresponding to the training sample;

Use the constructed generating network to perform key point detection on the training sample to obtain training key points;

Perform super-resolution reconstruction processing and restoration processing based on the training key points and the training samples to obtain an output image;

Comparing the output image and the target sample, and adjusting the network parameters of the generating network based on the comparison result, the training continues until a reconstructed model is obtained when a first preset condition is satisfied.

In an optional implementation manner, the output image and the target sample are compared, and the generation network is adjusted based on the comparison result after network parameters are adjusted, and the training is continued until the reconstruction is obtained when a first preset condition is satisfied. The steps of the model can include:

constructing a first loss function based on the difference between the pixel information of the output image and the pixel information of the target sample;

Construct a second loss function based on the difference between each face key point in the output image and the corresponding face key point in the target sample;

Comparing the output image and the target sample, and adjusting the network parameters of the generating network based on the comparison result, continue training until the weighted function values of the first loss function and the second loss function satisfy The reconstructed model is obtained at the first preset condition.

In an optional embodiment, the reconstruction model further includes a discriminator, and the discriminator is used to supervise the training of the generation network, and the method may further include:

constructing a discriminator, and using the discriminator to discriminate the output image and the target sample corresponding to the output image;

According to the obtained discrimination result, the parameters of the discriminator are adjusted until the trained discriminator is obtained when the second preset condition is satisfied.

Inputting the output image to the trained discriminator to obtain discriminant information;

Comparing the output image and the target sample to obtain a comparison result;

After adjusting the network parameters of the generating network according to the discrimination information and the comparison result, continue training until a reconstructed model is obtained when the first preset condition is satisfied.

In an optional implementation manner, the step of performing network parameter adjustment on the generation network according to the discrimination information and the comparison result and continuing to train until the reconstructed model is obtained when a first preset condition is satisfied may include: :

A third loss function is constructed based on the discriminant information of the output image by the discriminator, and a fourth loss function is constructed based on the image difference between the output image and the target sample obtained by the pre-built portrait cognitive model;

After adjusting the network parameters of the generating network according to the discriminant information and the comparison result, continue training until the first loss function, the second loss function, the third loss function and the fourth loss function are weighted. The reconstructed model is obtained when the function value satisfies the first preset condition.

Still other embodiments of the present application provide a human portrait super-resolution reconstruction device, the device may include:

The detection module can be configured to use the pre-built reconstruction model to perform key point detection on the image to be processed to obtain face key points;

a processing module, which can be configured to perform super-resolution reconstruction processing according to the face key points and image features obtained based on the to-be-processed image to obtain high-frequency image information;

The restoration module may be configured to perform restoration processing on the to-be-processed image by using the high-frequency information of the image to obtain a super-resolution image corresponding to the to-be-processed image.

In an optional embodiment, the apparatus for super-resolution reconstruction of a human portrait may further include: according to the above-mentioned image processing apparatus, the image processing apparatus may be configured to perform super-resolution reconstruction processing.

Still other embodiments of the present application provide a human portrait super-resolution reconstruction model training device, and the human portrait super-resolution reconstruction device may include:

an acquisition module, which can be configured to acquire training samples and target samples corresponding to the training samples;

a key point obtaining module, which can be configured to perform key point detection on the training sample by using the constructed generation network to obtain training key points;

an output image obtaining module, which can be configured to perform super-resolution reconstruction processing and restoration processing based on the training key points and the training samples to obtain an output image;

A training module, which can be configured to compare the output image and the target sample, and adjust the network parameters of the generation network based on the comparison result and continue training until a reconstructed model is obtained when the first preset condition is satisfied .

Still other embodiments of the present application provide an electronic device, the electronic device may include: one or more processors; one or more storage media for storing one or more machine-executable instructions, when the One or more machine-executable instructions, when executed by the one or more processors, cause the one or more processors to implement the image processing method according to some embodiments, or the image processing method according to other embodiments An image reconstruction model training method, or a portrait super-resolution reconstruction method according to further embodiments, or a portrait super-resolution reconstruction model training method according to further embodiments.

Still other embodiments of the present application provide a computer-readable storage medium storing machine-executable instructions that, when executed, implement image processing according to some embodiments method, or the image reconstruction model training method according to other embodiments, or the portrait super-resolution reconstruction method according to still other embodiments, or the portrait super-resolution reconstruction model training method according to still some embodiments .

The beneficial effects of the embodiments of the present application include, for example:

In the portrait super-resolution reconstruction method, model training method, device, electronic device and readable storage medium provided by this application, the key points of the image to be processed are detected by using the pre-built reconstruction model to obtain the key points of the face, and then according to the face The key points and the image features obtained based on the image to be processed are subjected to super-resolution reconstruction processing to obtain high-frequency information of the image. In the present application, the super-resolution reconstruction of the image is realized by combining the detection of the key points of the face and the restoration of the face, and the recognition of the obtained super-resolution image is improved, which meets the needs of users in practical applications.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following drawings will briefly introduce the drawings that need to be used in the embodiments. It should be understood that the following drawings only show some embodiments of the present application, and therefore do not It should be regarded as a limitation of the scope, and for those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.

FIG. 1 shows an application scenario diagram of the image processing method provided by the embodiment of the present application.

FIG. 2 shows a schematic flowchart of an image processing method provided by an embodiment of the present application.

FIG. 3 shows an example diagram of an image reconstruction model provided by an embodiment of the present application.

FIG. 4 shows an example diagram of a cascaded block provided by an embodiment of the present application.

FIG. 5 shows an example diagram of a residual block provided by an embodiment of the present application.

FIG. 6 shows another example diagram of an image reconstruction model provided by an embodiment of the present application.

FIG. 7 shows an image processing result presentation diagram provided by an embodiment of the present application.

FIG. 8 is a flowchart of a method for super-resolution reconstruction of a portrait provided by an embodiment of the present application.

FIG. 9 is a schematic diagram of a processing flow of a method for super-resolution reconstruction of a portrait provided by an embodiment of the present application.

FIG. 10 is another schematic diagram of a processing flow of the method for super-resolution reconstruction of a portrait provided by an embodiment of the present application.

FIG. 11 is a flowchart of a method for obtaining a super-resolution image in the method for super-resolution reconstruction of a portrait provided by an embodiment of the present application.

FIG. 12 is another schematic diagram of a processing flow of the method for super-resolution reconstruction of a portrait provided by an embodiment of the present application.

FIG. 13 shows a schematic flowchart of an image reconstruction model training method provided by an embodiment of the present application.

FIG. 14 is a flowchart of a method for training a portrait super-resolution reconstruction model provided by an embodiment of the present application.

FIG. 15 is one of the flowcharts of a method for obtaining a reconstructed model in the method for training a super-resolution reconstruction model of a portrait provided by an embodiment of the present application.

FIG. 16 is the second flowchart of a method for obtaining a reconstructed model in the method for training a super-resolution reconstruction model of a portrait provided by an embodiment of the present application.

17(a) to 17(c) are schematic diagrams of output images obtained by the interpolation processing method, the method without adding the discriminator, and the method adding the discriminator, respectively.

FIG. 18 shows a schematic block diagram of an image processing apparatus provided by an embodiment of the present application.

FIG. 19 shows a schematic block diagram of an apparatus for training an image reconstruction model provided by an embodiment of the present application.

FIG. 20 shows a schematic block diagram of an electronic device provided by an embodiment of the present application.

FIG. 21 is a structural block diagram of an electronic device provided by an embodiment of the present application.

FIG. 22 is a block diagram of functional modules of an apparatus for super-resolution reconstruction of a portrait provided by an embodiment of the present application.

FIG. 23 is a block diagram of functional modules of an apparatus for training a super-resolution reconstruction model of a portrait provided by an embodiment of the present application.

Icons: 10-electronic equipment; 11-processor; 12-memory; 13-bus; 20-first terminal; 30-second terminal; 40-network; 50-server; 100-image processing device; 110-image acquisition module; 120-first execution module; 130-second execution module; 200-model training device; 210-sample acquisition module; 220-first processing module; 230-second processing module; 240-third processing module; 250 - Fourth processing module.

2110-storage medium; 2120-processor; 2130-machine executable instructions; 131-portrait super-resolution reconstruction device; 1311-detection module; 1312-processing module; 1313-restoration module; 132-model training device; 1321-acquisition module; 1322-key point acquisition module; 1323-output image acquisition module; 1324-training module; 140-communication interface.

detailed description

In order to make the purposes, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of the present application, but not all of the embodiments. The components of the embodiments of the present application generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations.

Thus, the following detailed description of the embodiments of the application provided in the accompanying drawings is not intended to limit the scope of the application as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

In the description of the present application, it should be noted that if the terms "first", "second" etc. appear, they are only used to distinguish the description, and should not be construed as indicating or implying relative importance. It should be noted that the features in the embodiments of the present application may be combined with each other under the condition of no conflict.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

Please refer to FIG. 1 . FIG. 1 shows an application scenario diagram of the image processing method provided by the embodiment of the present application, including a first terminal 20 , a second terminal 30 , a network 40 and a server 50 , the first terminal 20 and the second terminal 20 . The terminals 30 are each connected to the server 50 through the network 40 . The first terminal 20 and the second terminal 30 may be mobile terminals, and various application programs (Application, App) may be installed on the mobile terminals, for example, a video playing App, an instant messaging App, a video/image capturing App, and a shopping App. Wait. The network 40 may be a wide area network or a local area network, or a combination of the two, using a wireless link for data transmission.

The first terminal 20 and the second terminal 30 may be any mobile terminals having a screen display function, for example, a smart phone, a notebook computer, a tablet computer, a desktop computer, a smart TV, and the like.

The first terminal 20 may upload the video file or picture to the server 50, and the server 50 may store the video file or picture after receiving the video file or picture uploaded by the first terminal 20. When the user watches a video or views a picture through the second terminal 30 , the second terminal 30 can request the video file or picture from the server 50 , and the server 50 can return the video file or picture to the second terminal 30 . Usually, in order to improve the transmission speed, the video file or picture will be compressed, so the resolution of the video file or picture is lower.

After receiving the video file or picture, the second terminal 30 can perform real-time processing on the video file or picture by using the image processing method provided in the embodiment of the present application to obtain a high-resolution video or picture, and display it on the second terminal 30 in the display interface to improve the user's picture quality experience. The image processing method provided by the embodiment of the present application may be integrated into a video playback App or a gallery App of the second terminal 30 as a functional plug-in.

Taking a live video scene as an example, the first terminal 20 may be the mobile terminal of the host, and the second terminal 30 may be the mobile terminal of the viewer. When the host is broadcasting live, the first terminal 20 can upload the live video to the server 50, and the server 50 can store the live video. When the audience watches the live broadcast through the second terminal 30, the server 50 can return the live broadcast to the second terminal 30. video. After receiving the live video, the second terminal 30 can process the live video in real time by using the image processing method provided in the embodiment of the present application to obtain a high-resolution live video and display it, so that the audience can watch the live video clearly. Live video.

It should be pointed out that the image processing method provided in this embodiment of the present application can be applied to a mobile terminal. Although the above description is given by taking the application to the second terminal 30 as an example, it should be understood that the image processing method can also be applied to the first terminal. 20. The specific value can be determined according to the actual application scenario, which is not limited here.

The image processing methods provided by the embodiments of the present application are described in detail below.

On the basis of the schematic diagram of the application scenario shown in FIG. 1, please refer to FIG. 2, which shows a schematic flowchart of an image processing method provided by an embodiment of the present application. The image processing method may include the following steps:

S101, acquiring an image to be processed.

The image to be processed may be a picture displayed on the mobile terminal that needs to be reconstructed by super-resolution to improve the image quality or a video frame in a video stream, for example, it may be a low-resolution video obtained by the second terminal 30 from the server 50. video frame.

In this embodiment, the mobile terminal can directly perform super-resolution reconstruction when receiving a low-resolution picture or a low-resolution video file; it can also display a low-resolution picture or a low-resolution video file first after receiving Display the interface, wait until the user performs the resolution switching operation, and then perform the super-resolution reconstruction. For example, when a low-resolution video is received, play it first, and when the user switches the resolution from "standard definition" to "ultra-definition", then Perform super-resolution reconstruction.

S102 , input the image to be processed into an image reconstruction model, and use a feature extraction network of the image reconstruction model to perform multi-scale feature extraction on the image to be processed and expand image channels to obtain a reconstructed feature map.

After the to-be-processed image is acquired, the to-be-processed image is input into the image reconstruction model for super-resolution reconstruction. Please refer to FIG. 3, the image reconstruction model includes a feature extraction network and a sub-pixel convolution layer. The feature extraction network is used to extract multi-scale features of the image to be processed and expand the image channel, and the sub-pixel convolution layer is used to reconstruct the output of the feature extraction network. The feature map is zoomed in.

Multi-scale feature extraction refers to extracting feature information at different levels by means of global cascade and local cascade. For example, feature extraction can be performed step by step from the bottom layer to the high layer, or the bottom layer information can be directly transferred to the high layer.

An image channel refers to one or more color channels after an image is divided according to color components. Usually, an image can be divided into a single-channel image, a three-channel image and a four-channel image according to the image channel. A single-channel image means that each pixel in the image is represented by only one value, such as a grayscale image; a three-channel image means that each pixel in the image is represented by three values, such as an RGB color image; four-channel image The image is based on the three-channel image plus transparency, Alpha color space, etc.

Extending image channels means increasing the number of channels in the image without changing the size of the image. For example, the input is an image of H×W×C, where H×W is the size of the input image and C is the number of channels of the input image; the output is an image of H×W×r ² C, where H×W is the size of the output image size, r ² C is the number of channels of the output image.

S103, using the sub-pixel convolution layer of the image reconstruction model to amplify the reconstructed feature map to obtain a reconstructed image.

The sub-pixel convolution layer, also known as PixelShuffle, is a convolutional layer that can be computed efficiently. Get high-resolution feature maps. Compared to artificial boosting filters such as bilinear or bicubic samplers, subpixel convolutional layers can be trained to learn more complex boosting operations, while the overall computation time is reduced.

For example, if the input feature map is H×W×r ² C, the main function of the sub-pixel convolution layer is to combine the feature maps of r ² channels into a new r×H, r×W upsampling result, namely (r ×H)×(r×W)×C, the output image of rH×rW×C is obtained, and the r-fold enlargement of the input feature map to the output image is completed.

The working process of the sub-pixel convolutional layer can be as follows: firstly, the original low-resolution pixel is divided into r×r small grids; then according to certain rules, the values of the corresponding positions of the r×r input feature maps are used to fill these small grids ; The recombination process is completed by filling the small grids divided by each low-resolution pixel in the same way.

In one embodiment, a sub-pixel convolutional layer may be used to adjust pixel positions in the reconstructed feature map to obtain a reconstructed image.

For example, the reconstructed feature map output by the feature extraction network is H×W×r ² C, and the sub-pixel convolution layer is used to adjust the pixel position to obtain a reconstructed image of rH×rW×C, and then complete the r-fold magnification.

In this embodiment, the sub-pixel convolutional layer can support multiple magnification sizes. For example, a 4-times magnification operation can be accomplished with a combination of 2-times sub-pixel convolutional layers, or a 2-times and 3-times sub-pixel volume The layered combination completes a 6x magnification operation.

At the same time, the existing super-resolution reconstruction algorithm is to first interpolate to high resolution and then make corrections, while the image reconstruction model in the embodiment of the present application is designed to enlarge the sub-pixel convolution layer at the end, which can ensure the characteristics of the front part of the model. The extraction network processes small-sized images, which greatly reduces the amount of computation and parameters.

Step S102 will be described in detail below.

Please refer to Figure 3 again, the feature extraction network includes convolutional layers, multiple concatenated blocks and multiple first convolutional layers, multiple concatenated blocks and multiple first convolutional layers are alternately arranged, and the feature extraction network adopts a global level link structure. The global cascade structure refers to the left fast channel and the right fast channel in Figure 3. The output of the cascaded block can be directly sent to each first convolutional layer after the cascaded block through the left fast channel. The side fast channel can feed the output of the convolutional layers directly to each first convolutional layer. The transport here refers to the superposition of channels, not the addition of data.

On the basis of the feature extraction network shown in FIG. 3 , in step S102, the feature extraction network of the image reconstruction model is used to perform multi-scale feature extraction on the image to be processed and expand the image channel to obtain the reconstructed feature map, which may include:

Input the image to be processed into the convolution layer for convolution processing to obtain the initial feature map;

Taking the initial feature map as the input of the first concatenated block and the output of the N-1th first convolutional layer as the input of the Nth concatenated block, the multi-scale feature extraction is performed by using the concatenated block, and the output intermediate feature map;

Perform channel stacking on the initial feature map and the intermediate feature map output by each concatenated block before the Nth first convolutional layer, and input the Nth first convolutional layer for convolution processing after stacking;

Take the output of the last first convolutional layer as the reconstructed feature map.

Among them, the convolutional layer and the first convolutional layer can expand the image channel, and the convolutional layer, concatenated block and the first convolutional layer can extract features.

The channel stacking of the initial feature map and the intermediate feature map refers to combining the channels of the initial feature map and the channels of the intermediate feature map. For example, the initial feature map has 4 channels and the intermediate feature map has 8 channels. Overlay, the superimposed feature map has 12 channels; in other words, each pixel in the initial feature map is represented by 4 values, and each pixel in the intermediate feature map is represented by 8 values. The feature map after channel stacking Each pixel is represented by 12 values.

In one embodiment, the structure of the concatenated block is shown in FIG. 4 , the concatenated block includes multiple residual blocks and multiple second convolution layers, and multiple residual blocks and multiple second convolution layers are alternately arranged , the cascade block adopts a local cascade structure. The local cascade structure refers to the left fast channel and the right fast channel in Figure 4. The output of the residual block can be directly sent to each second convolution layer after the residual block through the left fast channel, and the The side fast channel can feed the input of the concatenated block directly to each second convolutional layer. As above, the transmission here refers to the superposition of channels, not the addition of data.

On the basis of the cascaded block shown in Figure 4, the multi-scale feature extraction is performed by using the cascaded block, and the way of outputting the intermediate feature map may include:

Taking the input of the concatenated block as the input of the first residual block and the output of the N-1 second convolutional layer as the input of the Nth residual block, using the residual block to learn the residual features, we get Residual feature map;

The input of the concatenated block and the output of each residual block before the Nth second convolutional layer are channel-superposed, and after the superposition is inputted into the Nth second convolutional layer for convolution processing;

Take the output of the last second convolutional layer as an intermediate feature map.

Among them, the second convolutional layer can expand the image channel, and the residual block and the second convolutional layer can extract features.

The process of channel stacking the input of the cascaded block and the output of the residual block is similar to the above-mentioned process of channel stacking the initial feature map and the intermediate feature map, and will not be repeated here.

In one embodiment, the structure of the residual block is shown in Figure 5. The residual block may include a grouped convolutional layer, a third convolutional layer and a fourth convolutional layer. The grouped convolutional layer adopts the ReLu activation function, and the grouped convolutional layer The convolutional layer and the third convolutional layer are connected to form a residual path, and the residual block adopts a local skip connection structure. The local skip connection structure refers to the fusion of the input of the residual block and the output of the residual path to learn residual features.

On the basis of the residual block shown in Figure 5, the residual feature is learned by using the residual block to obtain the residual feature map, which may include:

Use the input of the residual block as the input of the grouped convolutional layer, and extract features through the residual path;

The input of the residual block and the output of the third convolution layer are feature fusion, and after fusion, the input is input to the fourth convolution layer for convolution processing, and the residual feature map is output.

Among them, the third convolutional layer and the fourth convolutional layer can expand the image channel, and the grouped convolutional layer can extract features.

The Group Convolution layer can group the input feature maps. Each group is then convolved separately. Compared with regular convolution, grouped convolution can reduce model parameters, thereby increasing the processing speed of the model.

In this embodiment, the number of layers of a plurality of grouped convolutional layers and the number of groups of each grouped convolutional layer to the input feature map can be flexibly selected by the user according to actual needs. For example, the number of layers of the grouped convolutional layers is 2, the number of groups is 3 and so on.

The types of the convolutional layer, the first convolutional layer, the second convolutional layer, the third convolutional layer, and the fourth convolutional layer in this embodiment are not limited. ×1 point convolution, depth convolution, etc., can be flexibly adjusted according to actual needs.

Generally, the expressiveness of an image reconstruction model increases with the complexity of the global cascade or the local cascade, that is, the greater the number of cascaded blocks and the first convolutional layer in the feature extraction network, or the number of cascaded blocks in the The greater the number of residual blocks and second convolutional layers, the more expressive the image reconstruction model will be. However, the more complex the network structure, the slower the calculation speed. Therefore, in order to improve the processing speed while ensuring the reconstruction effect, the number of each module should not be too large.

In one embodiment, the number of concatenated blocks and the first convolutional layer in the feature extraction network can both be 3 to 5, and the number of residual blocks and the second convolutional layer in the concatenated block can both be 3 to 5, The number of grouped convolutional layers in the residual block can be 2 to 4. For example, referring to Figure 6, the feature extraction network can be set to include 3 concatenated blocks and 3 first convolutional layers, the concatenated block includes 3 residual blocks and 3 second convolutional layers, and the residual block includes 2 Layer grouping convolutional layers.

In addition, the shared parameters of each module in the cascaded block can be set, that is, the shared parameters of multiple residual blocks and the shared parameters of multiple second convolution layers, so that the image reconstruction model can be further lightened and the processing speed can be improved. However, There will be a certain loss of effect after sharing parameters.

For example, please refer to FIG. 7 , in which the left picture and the middle picture are reconstructed images obtained by adopting the image processing method provided by the embodiment of the present application, the left picture does not share parameters, and the middle picture shares parameters; the right picture shows the use of bicubic interpolation (Bicubic ) algorithm to obtain the reconstructed image. As can be seen from the figure, it can be seen from the figure that the left and middle pictures are obviously clearer than the right picture.

According to an exemplary embodiment of the present application, FIG. 8 shows a schematic flowchart of a method for super-resolution reconstruction of a portrait. The detailed steps of the portrait super-resolution reconstruction method are introduced as follows.

Step S110 , using a pre-built reconstruction model to perform key point detection on the image to be processed to obtain face key points.

Step S120: Perform super-resolution reconstruction processing according to the face key points and the image features obtained based on the to-be-processed image to obtain image high-frequency information.

Step S130, performing restoration processing on the to-be-processed image by using the high-frequency information of the image to obtain a super-resolution image corresponding to the to-be-processed image.

It should be understood that, in other embodiments of the present application, the order of some steps in the method for super-resolution reconstruction of a portrait of this embodiment may be exchanged according to actual needs, or some steps may be omitted or deleted.

Super-resolution processing of face images, that is, processing to improve the clarity of face images. In the portrait super-resolution reconstruction method according to the embodiment of the present application, the target image to be processed may be an image with low definition, such images often have low definition of the face, and for example, image recognition, image matching, etc. problems that create obstacles. For example, the image to be processed may be a face image collected by a monitoring device, or a face image obtained by taking a screenshot of a web page, or a host's face image collected during a live broadcast, and so on.

The key points of the face on the face are often the most important information for human cognition. If the clarity of the key points of the face is effectively improved, the obtained face image will better meet the requirements of portrait reconstruction. Therefore, in the portrait super-resolution reconstruction method according to the embodiment of the present application, the constructed reconstruction model may be used to first perform key point detection on the image to be processed to obtain the face key points.

The obtained face key points may include left eye, right eye, nose, mouth and chin contour. Based on these face key points, the outline of the face can be roughly outlined, and the most important part of the face for human eye cognition among the key points of the face is included.

Although face key points are very important for portrait super-resolution reconstruction, the reconstruction processing of other areas in the face image also needs to be considered in the reconstruction process. In this way, local key points can be processed and the global can also be processed to meet the requirements. super-resolution images.

Therefore, in the portrait super-resolution reconstruction method according to the embodiment of the present application, on the one hand, the key points of the face are obtained by means of key point detection; Combining the face key points and the obtained image features for super-resolution reconstruction processing, the high-frequency information of the image is obtained.

Among them, the high-frequency information of the image mainly embodies the information at some edges and contours in the image, while the part of the slowly changing grayscale within the contour is the low-frequency information. The high-frequency information of the image can reflect the information of the relative change area, so it is very important for the reconstruction of the image.

In some embodiments of the present application, the image processing method according to the present application described in conjunction with FIG. 2 may be used to perform super-resolution reconstruction processing of the image, so as to obtain the high-frequency information of the image.

The image high-frequency information obtained in the exemplary embodiment of the present application is local information in the face image, and it is necessary to restore the image high-frequency information to the to-be-processed image, so as to perform restoration processing on the to-be-processed image, and obtain the super-high-frequency information corresponding to the to-be-processed image. resolution image.

The super-resolution reconstruction method of a portrait provided by this embodiment detects the key points of the face, obtains the high-frequency image information by using the key points of the face and the image features of the image, and then uses the high-frequency information of the image to restore the image to be processed, The recognition of the obtained super-resolution image can be improved, and it meets the needs of users in practical applications.

In practical applications, since the definition of the image to be processed is often low, the key points of the face obtained by the key point detection based on the low-definition image are not satisfactory, which will lead to the effect of the subsequent super-resolution images. not good. Therefore, according to the embodiments of the present application, the above-mentioned key point detection, super-resolution reconstruction processing and restoration processing may include multiple rounds of iterative processing. The above image to be processed may be an unprocessed image to be processed, or a super-resolution image obtained after the key point detection, super-resolution reconstruction processing and restoration processing in the previous iteration.

Specifically, referring to FIG. 9, for the unprocessed image LR Face (Low Resolution Face) to be processed, the super-resolution after this round of iterative processing can be obtained after the above-mentioned key point detection, super-resolution reconstruction processing and restoration processing Image SRFace (Super Resolution Face). Then, on the basis of the obtained super-resolution image, the above-mentioned key point detection, super-resolution reconstruction processing and restoration processing are performed to obtain the super-resolution image after the second round of iteration. According to the processing logic, the final super-resolution image is obtained when certain requirements are met after multiple iterations.

Please refer to Figure 10. For the input image Input to be processed, first, key point detection can be performed on it to obtain the corresponding face key point Face Points 0. Based on Face Points 0 and combined with the image features of Input, the image high-frequency information can be obtained , and then obtain the first-round super-resolution image Face SR1 according to the high-frequency information of the image and Input. On this basis, key points are detected for Face SR1, and the corresponding face key points Face Points 1 are obtained. Based on Face Points 1 and combined with the image features of Face SR1, the image high-frequency information is obtained, and then according to the image high-frequency information and Face SR1 Get the second-round super-resolution image Face SR2. According to this processing logic, the final super-resolution image Face SR N can be obtained after N iterations (N is the preset number of iteration stops or the image obtained after N iterations meets the preset requirements).

In the portrait super-resolution reconstruction method according to the embodiment of the present application, the image obtained by the previous round of processing is used as the detection object to perform multiple loop processing in a recursive manner, which can continuously improve the quality of the obtained super-resolution image. .

Correspondingly, model parameters in multiple loops can be shared, thereby making the model more lightweight and providing support for applying the model to devices with weak processing capabilities, such as mobile terminals. In addition, in the portrait super-resolution reconstruction method according to the embodiment of the present application, in the case of limited processing resources, in addition to sharing parameters, the network width can be preferentially increased within a certain range, that is, the number of feature extraction channels, Instead of focusing on the network depth, that is, the number of network layers, combined with the use of recursive processing methods, the recognition accuracy of the model can be improved.

It can be seen from the above that in the portrait super-resolution reconstruction method according to the embodiment of the present application, the detected face key points include a plurality of face key points, and the image high-frequency information is obtained based on the face key points and image features to restore the image to be processed. When , it is necessary to be able to correspond the corresponding face key points to the exact position in the image to be processed, so as to avoid the phenomenon of key point shift. Therefore, referring to FIG. 11 , in the portrait super-resolution reconstruction method according to the embodiment of the present application, the above-mentioned restoration processing can be implemented by the following steps:

In step S131, the image to be processed is processed by using a pre-built portrait cognitive model, and the position information of each of the key points of the face is output.

Step S132, performing restoration processing on the corresponding face key points in the to-be-processed image according to each of the face key points and their corresponding position information and image high-frequency information to obtain the super-resolution corresponding to the to-be-processed image image.

In some embodiments, a neural network model can be constructed, and the neural network model can be, for example, a convolutional neural network model (Convolutional Neural Networks, CNN) or the like. Multiple training samples can be collected, wherein each training sample contains a face image, and the face key points in each face image carry position information, and the position information can be the position of each face key point in the face area. For the relative position information, the face area can also be mapped into the coordinate system, and the coordinate value of the key point of the face in the coordinate system is used as its position information.

Use the training samples to train the constructed neural network model to obtain a portrait cognitive model that meets the requirements. The position information of each face key point in the image to be processed can be identified and obtained by using the face recognition model.

In this way, when performing the restoration process, referring to FIG. 12 , the key points of the face in the LR Face of the image to be processed, such as the left eye, right eye, nose, mouth, and chin contour, can be The position information of the obtained face key points and the high-frequency information of the corresponding face key points contained in the high-frequency information of the image are restored and processed to obtain the final super-resolution image SR Face.

In the portrait super-resolution reconstruction method according to the embodiment of the present application, since it is only necessary to use the portrait cognitive model to identify the position information of each key point of the human face, there is less data information to be analyzed and processed. Models can be built based on lightweight network models to avoid unnecessary excessive consumption of processing resources by network model building and running.

In the portrait super-resolution reconstruction method according to the embodiment of the present application, the method of obtaining the position information of each face key point by adopting the portrait cognitive model can accurately process the processing based on the position of each face key point during restoration. The corresponding position in the image is restored to avoid the phenomenon of restoration and displacement of the corresponding key points of the face.

In addition, the specific restoration requirements of different face key points are often different during restoration. For example, for the eyes, it is hoped that the restored eyes can be brighter, while for the chin contour, it may be desirable to restore the processed eyes. The chin contour is more defined.

Therefore, in the portrait super-resolution reconstruction method according to the embodiment of the present application, based on the above considerations, when performing the restoration process, the restoration attribute corresponding to each face key point may be obtained first, and the restoration attribute is the value of the restoration process described above. Different request information. Then, according to the position information of each face key point, the restoration attribute and the high frequency information of the image, the restoration processing is performed on the image to be processed, and the corresponding super-resolution image is obtained.

In the portrait super-resolution reconstruction method according to the embodiment of the present application, through the above method, the face key points can be independently restored by distinguishing each face key point and based on its corresponding position information and restoration attributes, which can not only satisfy different needs The specific requirements for the restoration of key points of the face, and the reconstruction model can also be processed synchronously based on the group convolution method, which can greatly reduce the processing time.

In the above method for super-resolution reconstruction of a portrait according to an exemplary embodiment of the present application, the super-resolution reconstruction process is implemented by using a reconstruction model constructed and trained in advance.

Next, the training process of the image reconstruction model according to the embodiment of the present application will be described in detail with reference to FIG. 13 .

The model training method provided in the embodiments of the present application can be applied to any electronic device with an image processing function, for example, a server, a mobile terminal, a general-purpose computer, or a special-purpose computer.

Please refer to FIG. 13. FIG. 13 shows a schematic flowchart of an image reconstruction model training method provided by an embodiment of the present application. The model training method may include the following steps:

S201 , acquiring training samples, where the training samples include low-resolution images and high-resolution images, and the low-resolution images are obtained by down-sampling the high-resolution images.

The training sample here is a dataset, which can obtain a large number of high-resolution images (for example, the resolution is higher than a certain preset value) as the original sample, and these high-resolution images can be various types of pictures or videos. The video frame, for example, may be a high-definition live video in a live video scene, and the like.

After the original samples are obtained, down-sampling is performed on the original samples, that is, down-sampling is performed on each high-resolution image according to the same method to obtain training samples. The way of downsampling processing can be bicubic interpolation or the like.

In addition, if you want to complete noise reduction at the same time as super-resolution reconstruction, you can add noise to the low-resolution binning in the training samples, and then input the model for training, so that the trained model can complete super-resolution reconstruction. Noise reduction is done again.

S202: Input the low-resolution image into a pre-built image reconstruction model, where the image reconstruction model includes a feature extraction network and a sub-pixel convolution layer.

S203, using a feature extraction network to perform multi-scale feature extraction on the low-resolution image and expand image channels to obtain a training feature map.

S204, using the sub-pixel convolution layer to enlarge the training feature map to obtain a training reconstructed image.

It should be noted that the processing procedures of steps S203-S204 are similar to the processing procedures of steps S102-S103, and are not repeated here.

S205, performing back-propagation training on the image reconstruction model based on the training reconstruction image, the high-resolution image and the preset objective function, to obtain a trained image reconstruction model.

In this embodiment, the objective function may be an L2 loss function, which is also called a mean square error (Mean Square Error, MSE) function, which is a type of regression loss function. The curve of the L2 loss function is smooth, continuous, and derivable everywhere, which is convenient for using the gradient descent algorithm; and as the error decreases, the gradient is also decreasing, which is conducive to convergence, even if a fixed learning rate is used, it can be faster. converge to the minimum value.

In this embodiment, back-propagation training can be performed on the image reconstruction model based on the training reconstructed image, the high-resolution image, and the L2 loss function, so as to adjust the parameters of the image reconstruction model until the preset training completion condition is reached, and the result is obtained The trained image reconstructs the model.

The training completion condition can be that the number of iterations reaches a set value (for example, 2000 times), or the L2 loss function converges to a minimum value, etc., which is not limited here and can be set according to actual needs.

Usually, for the feature extraction network, the more features extracted in the later stage, the fewer features are extracted. Therefore, after the training is completed, the trained image reconstruction model can be pruned according to the requirements and test results, and the long-line cascades are retained and deleted. Cascading short lines, thereby reducing excessive jumps in the middle, making the model more lightweight.

In one embodiment, the low-resolution image can be pre-processed first, and then the image reconstruction model can be input after the pre-processing, and the pre-processing can be the self-subtraction of the image. Therefore, before step S202, the model training method may further include:

The low-resolution image is self-reduced to highlight the texture details of the low-resolution image.

The self-subtracting average processing can be performed without processing the foreground in the image, but subtracting the pixel average value of the background image from each pixel in the background, thereby enhancing the contrast between the background part and the foreground part and highlighting the texture details.

In another embodiment, in order to extract more features from the feature extraction network, the preprocessing can also be performed on the image by flipping the symmetry operation and then inputting the model, and then performing the reverse flip symmetry on the output result of the model and calculating the average value, thereby Reduce the deviation of some feature layers or parameters caused by anisotropy. Therefore, before step S202, the model training method may further include:

Perform inversion symmetry processing on the low-resolution image to obtain at least one processed low-resolution image.

Then, at least one processed low-resolution image is input into the image reconstruction model, and the feature extraction network is used to perform multi-scale feature extraction on the at least one processed low-resolution image to obtain at least one auxiliary feature map; Perform reverse-flip symmetry processing, and average values after reverse-flip-symmetric processing to obtain training feature maps.

For example, for an image of n×n, flip it 3 times in the clockwise direction, 90° each time, so that 4 images of n×n can be obtained; then the 4 images of n×n are input into the image reconstruction model , the feature extraction network outputs 4 auxiliary feature maps; then flip the corresponding 3 auxiliary feature maps by 90°, 180° and 270° in the counterclockwise direction; and then perform pixel averaging on the processed 4 auxiliary feature maps to obtain The final training feature map.

It should be pointed out that the low-resolution image can be subjected to self-reducing average processing first, and then the low-resolution image can be flipped symmetrically; Self-decreasing average processing. It can be flexibly set according to actual needs, and is not limited here.

In addition, in practical applications, in order to improve the processing speed of the model, a new model can be trained on the basis of the completed model. For example, when training 3x and 4x magnification models, assuming that the 2x magnification model has been trained, the parameters of the 2x magnification model can be used as the initial parameters of the 3x and 4x magnification models. Here based on training.

According to an exemplary embodiment of the present application, a method for training a portrait super-resolution reconstruction model is also provided, which is used for training to obtain a reconstruction model for the portrait super-resolution reconstruction method in the foregoing exemplary embodiment. FIG. 14 shows A schematic flowchart of a method for training a portrait super-resolution reconstruction model provided in an embodiment of the present application.

As shown in the figure, the method for training a portrait super-resolution reconstruction model according to the present application includes:

Step S2100, acquiring training samples and target samples corresponding to the training samples;

Step S2200, using the constructed generation network to perform key point detection on the training sample to obtain training key points;

Step S2300, performing super-resolution reconstruction processing and restoration processing based on the training key points and the training samples to obtain an output image;

Step S2400, compare the output image with the target sample, and adjust the network parameters of the generating network based on the comparison result, and then continue training until a reconstructed model is obtained when a first preset condition is satisfied.

In the portrait super-resolution reconstruction model training method provided by the embodiment of the present application, by performing key point detection on the training samples, and training the model based on the combination of the training key points and the image features of the training samples, the reconstruction of the obtained reconstruction model can be improved. Accuracy.

In some embodiments, a plurality of training samples are collected in advance, and each training sample may be a sample image including a face image with lower definition. The target sample corresponding to the training sample is the one that meets the requirements, that is, the high-definition sample expected to be obtained after processing the training sample.

In some embodiments, the pre-built generative network may be a recurrent recurrent network, and the process of using the generative network to perform key point detection, super-resolution reconstruction processing and restoration processing on training samples can be referred to the above description. After processing, the generative network can output the output images corresponding to the training samples.

The target sample is used as a comparison standard for the processing quality of the generation network. By comparing the difference between the output image and the target sample, the generation network can be continuously trained according to the comparison result, so that the difference between the output image and the target sample is reduced to meet the requirements. When certain requirements are required, the reconstructed model is obtained.

In some embodiments, for the samples input to the generation network, the samples may be preprocessed, for example, by means of self-reduction, so as to bring out the details of the image texture, so as to improve the effect of subsequent processing and recognition.

On this basis, the preprocessed samples can also be inverted symmetrically and then input to the generation network. For the output results of each network layer of the generation network, the output results can be reversed and symmetrically averaged. In this way, we can Reduce the deviation of some network layers or parameters caused by anisotropy.

In the process of training and testing the generated network, the network can be pruned according to the requirements and the results of the test to retain several previous cycles that have a greater impact on the results. Continue the training to improve the reconstruction accuracy of the resulting generative network, and the peak signal-to-noise ratio and structural similarity of the subsequently processed images can also be greatly improved.

In some embodiments of the present application, a loss function may be constructed to detect training of the generative network.

Please refer to FIG. 15 , according to the above-mentioned step S2400 of the method for training a portrait super-resolution reconstruction model of the present application, it can be implemented in the following ways:

Step S2410, constructing a first loss function based on the difference between the pixel information of the output image and the pixel information of the target sample;

Step S2420, based on the difference between each face key point in the output image and the corresponding face key point in the target sample, construct a second loss function;

Step S2430, compare the output image and the target sample, and adjust the network parameters of the generating network based on the comparison result and continue training until the weighted first loss function and the second loss function are obtained. The reconstructed model is obtained when the function value satisfies the first preset condition.

In the embodiments according to the present application, the first loss function and the second loss function may be constructed to comprehensively evaluate the training of the generative network. Among them, the first loss function is evaluated from the perspective of pixel differences between images. In addition, considering that the image in this embodiment has been detected by face key points, and face key points are particularly important for portrait reconstruction, a second loss function constructed with the difference information between face key points is added. .

Among them, the first loss function represents the overall pixel-level Euclidean distance between the output image of the generation network and the target sample (that is, the desired output effect), and the second loss function represents the face key point detection of the generation network. The Euclidean distance between the face key points and the corresponding face key points in the target sample (desired output effect).

The above-mentioned first loss function and second loss function are weighted and combined to jointly serve as the loss function of the generating function. In the training process of the generative network, the function value of the comprehensive loss function including the first loss function and the second loss function is calculated by comparing the output image and the target sample. The reconstructed model is obtained when the obtained function value satisfies the first preset condition. The first preset condition may be that the value of the loss function no longer decreases to achieve convergence, or that the value of the loss function is lower than a preset value. Alternatively, when the number of iterations reaches a preset maximum number, the training can be stopped to obtain a reconstructed model.

According to the embodiment of the present application, the first loss function constructed based on the difference of pixel information and the second loss function based on the difference between the key points of the face are used to perform the training supervision and judgment of the reconstructed model, which can improve the subsequent application of the reconstructed model to improve the performance of the reconstruction model. The awareness of the super-resolution images obtained during reconstruction processing.

According to the embodiment of the present application, the recognition of the obtained super-resolution image can be improved by applying the above-mentioned pre-built reconstruction model obtained by the generative network to the above-mentioned reconstruction of the to-be-processed image.

It can be seen from the above that the reconstruction model in the method for training a portrait super-resolution reconstruction model according to the embodiment of the present application includes a generation network, which is constructed for pre-training and can process low-resolution images to output corresponding images. model of super-resolution images.

In a possible implementation manner according to the present application, in order to further improve the reconstruction effect of the obtained reconstruction model, the reconstruction model may further include a discriminator, and the discriminator may be used to supervise the training of the generation network. Therefore, in this embodiment, the generation network is a generation network obtained after training with training samples under the supervision of the trained discriminator.

In some embodiments, the method for training a portrait super-resolution reconstruction model according to the present application further comprises the following steps:

constructing a discriminator, and using the discriminator to discriminate the output image;

In the embodiment according to the present application, the main realization principle of the discriminator is to discriminate a real image (that is, a high-resolution image that meets the requirements) as real as possible (for example, output a discriminant result of 1), and generate a network The output image of the generator can be judged as false as much as possible (for example, the output discrimination result is 0), so that the generator network can be supervised for continuous training, and finally the discriminator can judge the output image of the generator network as true. That is, the discriminator acts as the supervisor of the generative network to continuously optimize the training of the generative network.

When using the discriminator as a supervisor to optimize the generation network, it is first necessary to train and optimize the discriminator, so that the discriminator can make accurate judgments. In this embodiment, a loss function of the discriminator may be pre-built, and the loss function may be composed of discriminant information of the output image of the generation network and discriminant information of the target sample by the discriminator.

The training process of the discriminator is the process of minimizing the above-mentioned loss function. When the value of the above-mentioned loss function no longer decreases to achieve convergence, it can be determined that the training of the discriminator satisfies the second preset condition, and the training can be obtained. The discriminator can be fixed.

In the embodiment according to the present application, a discriminator is added to the reconstruction model to form an adversarial network including the discriminator and the generation network, which can further improve the reconstruction effect of the obtained reconstruction model.

In a possible implementation, when a discriminator is added to the reconstructed model to form an adversarial network, the training and adjustment of the generation network may include the relevant discriminant of the discriminator.

Please refer to FIG. 16 , according to the above-mentioned step S2400 in the training method of the portrait super-resolution reconstruction model of the present application, the following sub-steps may be included:

Step S2410', inputting the output image to the trained discriminator to obtain discriminant information;

Step S2420', compare the output image and the target sample to obtain a comparison result;

Step S2430', after adjusting the network parameters of the generating network according to the discrimination information and the comparison result, continue training until a reconstructed model is obtained when the first preset condition is satisfied.

According to the above embodiment, when a discriminator is added, the difference between the output image and the target sample and the discriminator's discriminant information on the output image can be combined to adjust the training of the generation network.

In some embodiments according to the present application, the construction of the loss function may be performed in the following manner, and the reconstruction model training is performed by using the constructed loss function:

The influence of the difference between the output image and the target sample on the adjustment of the generation network can be represented by the first loss function and the second loss function. The influence of the discriminator's discriminant information on the output image on the training adjustment of the generation network can be represented by the third loss function. In addition, in order to enhance the human eye cognition degree of the obtained super-resolution image, a fourth loss function constructed from the image difference between the output image obtained by the portrait cognition model and the target sample can also be added.

In this embodiment, the above-mentioned first loss function is constructed based on the difference between the pixel information of the output image and the pixel information of the corresponding target sample, and the second loss function is determined by the corresponding key points of each face in the output image and the target sample. It is constructed by the difference between the key points of the face. Since the purpose of constructing the discriminator to supervise the training of the generation network is to make the output image obtained by the generation network finally judged to be true by the discriminator, the third loss function is constructed by the discriminator's discriminative information on the output image. The fourth loss function is constructed by the difference of facial features between the output image obtained by the portrait cognitive model and the target sample.

The finally obtained loss function of the generating network can be obtained by weighted combination of the above-mentioned first loss function, second loss function, third loss function and fourth loss function.

Therefore, according to the method for training a portrait super-resolution reconstruction model according to the embodiment of the present application, the network parameters can be adjusted for the generation network according to the discrimination information of the discriminator and the comparison result between the output image and the target sample, and then the training can be continued. The calculation process of the function value of the loss function after the above-mentioned combination is adjusted for training, until the function value weighted by the first loss function, the second loss function, the third loss function and the fourth loss function satisfies the first preset condition, the trained reconstruction model can be obtained.

In this embodiment, by adding a discriminator to supervise the training of the generation network, the human eye recognition of the obtained output image can be improved, and the obtained image has a higher definition. Please refer to FIGS. 17( a ) to 17 ( c ), wherein, FIG. 17 ( a ) is an image obtained after conventional interpolation processing, and FIG. 17 ( b ) is the embodiment of the application without adding a discriminator. The obtained image, and FIG. 17( c ) is the image obtained under the implementation of adding a discriminator in this application. As can be seen from the figure, the image obtained under the solution of the present application has significantly higher definition and better effect than the conventional interpolation processing method. Among them, the image obtained by adding the discriminator is more clear in the human eye cognition than the image obtained without adding the discriminator.

Refer now to Figure 18. FIG. 18 shows a schematic block diagram of an image processing apparatus 100 provided by an embodiment of the present application. The image processing apparatus 100 is applied to a mobile terminal, and includes an image acquisition module 110 , a first execution module 120 and a second execution module 130 .

The image acquisition module 110 may be configured to acquire images to be processed.

The first execution module 120 may be configured to input the image to be processed into the image reconstruction model, and use the feature extraction network of the image reconstruction model to perform multi-scale feature extraction on the image to be processed and expand image channels to obtain a reconstructed feature map.

In an optional embodiment, the feature extraction network includes a convolutional layer, a plurality of concatenated blocks and a plurality of first convolutional layers, and the plurality of concatenated blocks and the plurality of first convolutional layers are alternately arranged, and the feature extraction network adopts global cascade structure;

The first execution module 120 may be specifically configured to: input the image to be processed into the convolution layer for convolution processing to obtain an initial feature map; use the initial feature map as the input of the first concatenated block; The output of the first convolutional layer is used as the input of the Nth convolutional block, and the multi-scale feature extraction is performed by the concatenated block, and the intermediate feature map is output; The intermediate feature maps output by the concatenated blocks are channel-stacked, and after stacking, they are input to the Nth first convolutional layer for convolution processing; the output of the last first convolutional layer is used as the reconstructed feature map.

In an optional embodiment, the concatenated block includes multiple residual blocks and multiple second convolutional layers, the multiple residual blocks and multiple second convolutional layers are alternately arranged, and the concatenated block adopts a local cascade structure ;

The first execution module 120 may perform multi-scale feature extraction using concatenated blocks, and output an intermediate feature map, including: taking the input of the concatenated block as the input of the first residual block, and using the N-1 th block as the input of the first residual block. The output of the second convolutional layer is used as the input of the Nth residual block, and the residual feature is learned by using the residual block to obtain the residual feature map; The output of the difference block is subjected to channel stacking, and after stacking, it is input to the Nth second convolutional layer for convolution processing; the output of the last second convolutional layer is used as the intermediate feature map.

In an optional embodiment, the residual block includes a grouped convolutional layer, a third convolutional layer and a fourth convolutional layer, the grouped convolutional layer adopts a ReLu activation function, and the grouped convolutional layer and the third convolutional layer are connected to form Residual path, residual block adopts local skip connection structure;

The first execution module 120 may perform the method of learning residual features by using the residual block to obtain the residual feature map, including: using the input of the residual block as the input of the grouped convolution layer, and extracting features through the residual path; The input of the block and the output of the third convolutional layer are feature fusion, and after fusion, they are input to the fourth convolutional layer for convolution processing, and the residual feature map is output.

The second execution module 130 may be configured to use the sub-pixel convolution layer of the image reconstruction model to amplify the reconstructed feature map to obtain a reconstructed image.

In an optional embodiment, the second execution module 130 may be specifically configured to: use a sub-pixel convolution layer to adjust the pixel positions in the reconstructed feature map to obtain a reconstructed image.

Referring to FIG. 19 , FIG. 19 shows a schematic block diagram of an image reconstruction model training apparatus 200 provided by an embodiment of the present application. The model training apparatus 200 is applied to any electronic device with image processing function, and may include: a sample acquisition module 210 , a first processing module 220 , a second processing module 230 , a third processing module 240 and a fourth processing module 250 .

The sample acquisition module 210 may be configured to acquire training samples, where the training samples include low-resolution images and high-resolution images, and the low-resolution images are obtained by down-sampling the high-resolution images.

The first processing module 220 may be configured to input the low-resolution image into a pre-built image reconstruction model, where the image reconstruction model includes a feature extraction network and a sub-pixel convolutional layer.

The second processing module 230 may be configured to use a feature extraction network to perform multi-scale feature extraction on the low-resolution image and expand image channels to obtain a training feature map.

The third processing module 240 may be configured to use a sub-pixel convolutional layer to amplify the training feature map to obtain a training reconstructed image.

The fourth processing module 250 may be configured to perform back-propagation training on the image reconstruction model based on the training reconstructed image, the high-resolution image and the preset objective function to obtain a trained image reconstruction model.

In an optional embodiment, the objective function is an L2 loss function;

The fourth processing module 250 may be specifically configured to: perform back-propagation training on the image reconstruction model based on the training reconstructed image, the high-resolution image and the L2 loss function, so as to adjust the parameters of the image reconstruction model until the preset value is reached. The training completion condition is obtained, and the image reconstruction model after training is obtained.

In an optional embodiment, the first processing module 220 may also be configured to: prune the trained image reconstruction model, so as to retain long-line cascades and delete short-line cascades.

In an optional embodiment, the first processing module 220 may also be configured to: perform flip symmetry processing on the low-resolution image to obtain at least one processed low-resolution image.

The second processing module 230 may be specifically configured to: input at least one processed low-resolution image into the image reconstruction model.

The third processing module 240 may be specifically configured to: use a feature extraction network to perform multi-scale feature extraction on at least one processed low-resolution image to obtain at least one auxiliary feature map; perform reverse flip symmetry on the at least one auxiliary feature map processing, and averaged after anti-flip symmetry processing to obtain the trained feature map.

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific working processes of the image processing apparatus 100 and the model training apparatus 200 described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated here. Repeat. Referring to FIG. 20 , FIG. 20 shows a schematic block diagram of an electronic device 10 provided by an embodiment of the present application. The electronic device 10 may be a mobile terminal that executes the above image processing method, or may be any electronic device having an image processing function that executes the above model training method. The electronic device 10 includes a processor 11 , a memory 12 and a bus 13 , and the processor 11 is connected to the memory 12 through the bus 13 .

The memory 12 is used to store programs, such as the image processing apparatus 100 shown in FIG. 18 or the model training apparatus 200 shown in FIG. 19 . Taking the image processing apparatus 100 as an example, the image processing apparatus 100 includes at least one software function module that can be stored in the memory 12 in the form of software or firmware. After receiving the execution instruction, the processor 11 executes the program to realize The image processing methods disclosed in the above embodiments.

The memory 12 may include a high-speed random access memory (Random Access Memory, RAM), and may also include a non-volatile memory (non-volatile memory, NVM).

The processor 11 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 11 or an instruction in the form of software. The above-mentioned processor 11 can be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a microcontroller unit (Microcontroller Unit, MCU), a complex programmable logic device (Complex Programmable Logic Device, CPLD), field programmable Gate Array (Field Programmable Gate Array, FPGA), embedded ARM and other chips.

Embodiments of the present application further provide a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by the processor 11, implements the image processing method or the model training method disclosed in the foregoing embodiments.

To sum up, an image processing and model training method, device, electronic device, and storage medium provided by the embodiments of the present application obtain an image to be processed and input an image reconstruction model, and the image reconstruction model includes a feature extraction network and a sub-pixel convolution First, the feature extraction network is used to extract the multi-scale feature of the image to be processed and expand the image channel to obtain the reconstructed feature map, and then use the sub-pixel convolution layer to amplify the reconstructed feature map to obtain the reconstructed image. The processing speed can be improved while ensuring the reconstruction effect.

Please refer to FIG. 21 , which is a schematic diagram of an exemplary component of an electronic device provided in an embodiment of the present application. The electronic device may include a storage medium 2110, a processor 2120, machine-executable instructions 2130 (the machine-executable instructions 2130 may be the portrait super-resolution reconstruction apparatus 131 or the portrait super-resolution reconstruction model training apparatus 132 according to the present application) and Communication interface 140 . In this embodiment, the storage medium 2110 and the processor 2120 are both located in the electronic device and are provided separately. However, it should be understood that the storage medium 2110 may also be independent of the electronic device, and may be accessed by the processor 2120 through a bus interface. Alternatively, the storage medium 2110 may also be integrated into the processor 2120, for example, may be a cache and/or a general purpose register.

The machine-executable instructions 2130 can be understood as the electronic device described in FIG. 21 , or the processor 2120 of the electronic device, and can also be understood as being implemented under the control of the electronic device independently of the electronic device or the processor 2120 described in FIG. 21 The software function module of the above-mentioned portrait super-resolution reconstruction method or portrait super-resolution reconstruction model training method.

As shown in FIG. 22 , the above-mentioned human portrait super-resolution reconstruction apparatus 131 may include a detection module 1311 , a processing module 1312 and a restoration module 1313 . The functions of each functional module of the portrait super-resolution reconstruction apparatus 131 will be described in detail below.

The detection module 1311 can be configured to use a pre-built reconstruction model to perform key point detection on the image to be processed to obtain face key points;

It can be understood that the detection module 1311 may be configured to perform the above step S110, and for the detailed implementation of the detection module 1311, please refer to the above-mentioned content related to the step S110.

The processing module 1312 can be configured to perform super-resolution reconstruction processing according to the face key points and the image features obtained based on the to-be-processed image to obtain high-frequency image information;

It can be understood that the processing module 1312 may be configured to execute the above-mentioned step S120, and for the detailed implementation of the processing module 1312, please refer to the above-mentioned content related to the step S120.

The restoration module 1313 may be configured to perform restoration processing on the to-be-processed image by using the high-frequency information of the image to obtain a super-resolution image corresponding to the to-be-processed image.

It can be understood that the restoration module 1313 may be configured to perform the above step S130, and for the detailed implementation of the restoration module 1313, please refer to the above-mentioned content related to the step S130.

The portrait super-resolution reconstruction apparatus may further include: the image processing apparatus according to FIG. 18 , the image processing apparatus being configured to perform super-resolution reconstruction processing.

In a possible implementation manner, the key point detection, super-resolution reconstruction processing and restoration processing include multiple rounds of iterative processing, and the to-be-processed image is an unprocessed to-be-processed image, or an image that has been processed in a previous round of iterations. The super-resolution image obtained after the key point detection, super-resolution reconstruction processing and restoration processing.

In a possible implementation manner, the face key points include multiple, and the above-mentioned restoration module 1313 can be used to obtain a super-resolution image in the following manner:

In a possible implementation manner, the restoration module 1313 may be configured to obtain a super-resolution image based on the position information of each face key point and the high-frequency information of the image in the following manner:

In a possible implementation manner, the reconstructed model includes a discriminator and a generation network, and the generation network is obtained after training with training samples under the supervision of the trained discriminator.

In a possible implementation manner, the face key points include left eye, right eye, nose, mouth and chin contours.

For the description of the processing flow of each module in the apparatus and the interaction flow between the modules, reference may be made to the relevant descriptions in the foregoing method embodiments, which will not be described in detail here. As shown in FIG. 23 , the above-mentioned portrait super-resolution reconstruction model training device 132 may include an acquisition module 1321 , a key point acquisition module 1322 , an output image acquisition module 1323 and a training module 1324 . The functions of each functional module of the portrait super-resolution reconstruction model training device 132 will be described in detail below.

an acquisition module 1321, which can be configured to acquire training samples and target samples corresponding to the training samples;

It can be understood that the acquisition module 1321 may be configured to perform the above step S2100, and for the detailed implementation of the acquisition module 1321, please refer to the above-mentioned content related to the step S2100.

The key point obtaining module 1322 can be configured to perform key point detection on the training sample by using the constructed generating network to obtain training key points;

It can be understood that the key point obtaining module 1322 may be configured to perform the above step S2200, and for the detailed implementation of the key point obtaining module 1322, reference may be made to the above-mentioned content related to step S2200.

The output image obtaining module 1323 can be configured to perform super-resolution reconstruction processing and restoration processing based on the training key points and the training samples to obtain an output image;

It can be understood that the output image obtaining module 1323 may be configured to perform the above-mentioned step S2300, and for the detailed implementation of the output image obtaining module 1323, reference may be made to the above-mentioned content related to the step S2300.

The training module 1324 can be configured to compare the output image and the target sample, and adjust the network parameters of the generation network based on the comparison result and continue training until the reconstruction is obtained when the first preset condition is met Model.

It can be understood that the training module 1324 may be configured to perform the above-mentioned step S2400, and for the detailed implementation of the training module 1324, reference may be made to the above-mentioned content related to the step S2400.

In a possible implementation, the training module 1324 may be configured to obtain a reconstructed model based on the comparison result between the output image and the target sample in the following manner:

Comparing the output image and the target sample, and adjusting the network parameters of the generating network based on the comparison result, continue training until the weighted function values of the first loss function and the second loss function satisfy The reconstructed model is obtained under the first preset condition.

In a possible implementation manner, the reconstruction model further includes a discriminator, and the discriminator is used to supervise the training of the generation network, and the portrait super-resolution reconstruction model training device 132 further includes a building module, and the building module is used for :

In one possible implementation, the training module 1324 can obtain the reconstructed model in the following manner:

Comparing the output image and the target sample to obtain a comparison result;

After adjusting the network parameters of the generation network according to the discrimination information and the comparison result, continue training until the reconstructed model is obtained when the first preset condition is satisfied.

In one possible implementation, the training module 1324 may be configured to construct a reconstruction model based on the discriminant information and the alignment results in the following manner:

For the description of the processing flow of each module in the apparatus and the interaction flow between the modules, reference may be made to the relevant descriptions in the foregoing method embodiments, which will not be described in detail here.

Further, the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores machine-executable instructions 130, and when the machine-executable instructions 130 are executed, realize the super-resolution reconstruction method for portraits provided by the above embodiments Or portrait super-resolution reconstruction model training method.

Specifically, the computer-readable storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, etc., when the computer program on the computer-readable storage medium is run, the above-mentioned portrait super-resolution reconstruction method or portrait super-resolution method can be executed. Rate reconstruction model training method. For the processes involved when the computer-readable storage medium and its executable instructions are executed, reference may be made to the relevant descriptions in the foregoing method embodiments, which will not be described in detail here.

To sum up, the portrait super-resolution reconstruction method, the portrait super-resolution reconstruction model training method, the device, the electronic device, and the readable storage medium provided by the embodiments of the present application perform key points on the image to be processed by using the pre-built reconstruction model. Detection, get the key points of the face, and then perform super-resolution reconstruction processing according to the key points of the face and the image features obtained based on the image to be processed to obtain the high-frequency information of the image, and use the high-frequency information of the image to restore the image to be processed to obtain the to-be-processed image. Process the super-resolution image corresponding to the image. In this application, the super-resolution reconstruction of the image is realized by combining the detection of face key points and the restoration of the face, and the recognition of the obtained super-resolution image is improved, which meets the needs of users in practical applications.

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application, All should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Industrial Applicability

The present application provides an image processing method, a portrait super-resolution reconstruction method, an image reconstruction model training method, a portrait super-resolution reconstruction model training method, and related devices, electronic equipment and storage media. Image reconstruction model. The image reconstruction model includes a feature extraction network and a sub-pixel convolution layer. First, the feature extraction network is used to extract multi-scale features of the image to be processed and expand the image channel to obtain the reconstructed feature map, and then use the sub-pixel convolution layer to reconstruct the feature map. Zoom in to get the reconstructed image. Since the feature extraction network can extract multi-scale features and expand image channels, it is possible to obtain a better reconstruction effect without increasing the depth of the network. Image processing greatly reduces the amount of calculation and parameters; thus improving the processing speed while ensuring the reconstruction effect.

In addition, it can be understood that, according to the image processing method and the portrait super-resolution reconstruction method of the present application, the image reconstruction model training method, the portrait super-resolution reconstruction model training method, and the related devices, electronic equipment and storage media can be reproduced. and can be used in a variety of industrial applications. For example, the image processing method and portrait super-resolution reconstruction method of the present application, the image reconstruction model training method and the portrait super-resolution reconstruction model training method, and related apparatuses, electronic equipment and storage media can be used for low-resolution Any apparatus for image super-resolution reconstruction of an image or sequence of images.

Claims

An image processing method, characterized in that the image processing method comprises:

Get the image to be processed;

Inputting the to-be-processed image into an image reconstruction model, and using the feature extraction network of the image reconstruction model to perform multi-scale feature extraction on the to-be-processed image and expand image channels to obtain a reconstructed feature map;

The reconstructed feature map is enlarged by using the sub-pixel convolution layer of the image reconstruction model to obtain a reconstructed image.
The image processing method according to claim 1, wherein the feature extraction network comprises a convolutional layer, a plurality of concatenated blocks and a plurality of first convolutional layers, a plurality of the concatenated blocks and a plurality of the The first convolutional layers are alternately arranged, and the feature extraction network adopts a global cascade structure;

The step of using the feature extraction network of the image reconstruction model to perform multi-scale feature extraction on the to-be-processed image to obtain a reconstructed feature map includes:

Inputting the image to be processed into the convolution layer for convolution processing to obtain an initial feature map;

Using the initial feature map as the input of the first of the concatenated blocks and the output of the N-1th first convolutional layer as the input of the Nth of the concatenated blocks, using the stage The multi-scale feature extraction is performed in the joint block, and the intermediate feature map is output;

Perform channel stacking on the initial feature map and the intermediate feature map output by each of the concatenated blocks before the Nth first convolution layer, and input the Nth first convolution layer after stacking layer for convolution processing;

The output of the last first convolutional layer is used as the reconstructed feature map.
The image processing method according to claim 2, wherein the number of the concatenated blocks is 3 to 5, and the number of the first convolutional layers is 3 to 5.
The image processing method according to claim 2 or 3, wherein the concatenated block comprises a plurality of residual blocks and a plurality of second convolution layers, a plurality of the residual blocks and a plurality of the first convolution layers The two convolutional layers are alternately arranged, and the cascaded block adopts a local cascaded structure;

The step of using the cascaded blocks to perform multi-scale feature extraction and outputting an intermediate feature map includes:

Taking the input of the concatenated block as the input of the first residual block, and taking the output of the N-1th second convolutional layer as the input of the Nth residual block, using the The residual block learns the residual features, and obtains the residual feature map;

The input of the concatenated block and the output of each of the residual blocks before the Nth second convolutional layer are channel-stacked, and input to the Nth second convolutional layer for convolution after stacking accumulated processing;

The output of the last second convolutional layer is used as the intermediate feature map.
The image processing method according to claim 4, wherein the number of the residual blocks is 3 to 5, and the number of the second convolution layer is 3 to 5.
The image processing method according to claim 4 or 5, wherein the residual block comprises a grouped convolutional layer, a third convolutional layer and a fourth convolutional layer, and the grouped convolutional layer adopts a ReLu activation function , the grouping convolution layer and the third convolution layer are connected to form a residual path, and the residual block adopts a local skip connection structure;

The step of using the residual block to learn residual features to obtain a residual feature map includes:

The input of the residual block is used as the input of the grouped convolution layer, and features are extracted through the residual path;

Feature fusion is performed between the input of the residual block and the output of the third convolution layer, and after fusion, the input is input to the fourth convolution layer for convolution processing, and the residual feature map is output.
The image processing method according to any one of claims 1 to 6, wherein the step of using a sub-pixel convolution layer of the image reconstruction model to amplify the reconstructed feature map to obtain a reconstructed image, include:

Using the sub-pixel convolution layer to adjust the pixel positions in the reconstructed feature map to obtain the reconstructed image.
An image reconstruction model training method, wherein the image reconstruction model training method comprises:

acquiring training samples, where the training samples include low-resolution images and high-resolution images, and the low-resolution images are obtained by down-sampling the high-resolution images;

Inputting the low-resolution image into a pre-built image reconstruction model, the image reconstruction model includes a feature extraction network and a sub-pixel convolution layer;

Use the feature extraction network to perform multi-scale feature extraction on the low-resolution image and expand the image channel to obtain a training feature map;

Using the sub-pixel convolution layer to amplify the training feature map to obtain a training reconstructed image;

Back-propagation training is performed on the image reconstruction model based on the training reconstructed image, the high-resolution image and the preset objective function to obtain a trained image reconstruction model.
The image reconstruction model training method according to claim 8, wherein the objective function is an L2 loss function;

The step of performing back-propagation training on the image reconstruction model based on the training reconstruction image, the high-resolution image and the preset objective function to obtain the trained image reconstruction model:

Back-propagation training is performed on the image reconstruction model based on the training reconstructed image, the high-resolution image and the L2 loss function to adjust the parameters of the image reconstruction model until the preset training is completed condition to obtain the image reconstruction model after training.
The image reconstruction model training method according to claim 8 or 9, wherein the image reconstruction model training method further comprises:

The trained image reconstruction model is pruned to preserve long-line cascades and delete short-line cascades.
The image reconstruction model training method according to any one of claims 8 to 10, wherein before the step of inputting the low-resolution image into a pre-built image reconstruction model, the image reconstruction model training method Also includes:

A self-reducing average is performed on the low-resolution image to highlight texture details of the low-resolution image.
The image reconstruction model training method according to any one of claims 8 to 11, wherein before the step of inputting the low-resolution image into a pre-built image reconstruction model, the image reconstruction model training method Also includes:

Perform flip symmetry processing on the low-resolution image to obtain at least one processed low-resolution image;

The step of inputting the low-resolution image into a pre-built image reconstruction model includes:

inputting the at least one processed low-resolution image into the image reconstruction model;

The step of using the feature extraction network to perform multi-scale feature extraction on the low-resolution image to obtain a training feature map includes:

Use the feature extraction network to perform multi-scale feature extraction on the at least one processed low-resolution image to obtain at least one auxiliary feature map;

Perform anti-flip symmetry processing on at least one auxiliary feature map, and obtain the training feature map by averaging after the anti-flip symmetry processing.
An image processing device, characterized in that the image processing device comprises:

an image acquisition module configured to acquire an image to be processed;

The first execution module is configured to input the to-be-processed image into an image reconstruction model, and use the feature extraction network of the image reconstruction model to perform multi-scale feature extraction on the to-be-processed image and expand image channels to obtain reconstructed features picture;

The second execution module is configured to use the sub-pixel convolution layer of the image reconstruction model to amplify the reconstructed feature map to obtain a reconstructed image.
An image reconstruction model training device, characterized in that the image reconstruction model training device comprises:

a sample acquisition module configured to acquire training samples, where the training samples include low-resolution images and high-resolution images, the low-resolution images are obtained by down-sampling the high-resolution images;

a first processing module configured to input the low-resolution image into a pre-built image reconstruction model, the image reconstruction model comprising a feature extraction network and a sub-pixel convolutional layer;

The second processing module is configured to use the feature extraction network to perform multi-scale feature extraction on the low-resolution image and expand the image channel to obtain a training feature map;

a third processing module, configured to use the sub-pixel convolutional layer to amplify the training feature map to obtain a training reconstructed image;

The fourth processing module is configured to perform back-propagation training on the image reconstruction model based on the training reconstructed image, the high-resolution image and the preset objective function to obtain a trained image reconstruction model.
A portrait super-resolution reconstruction method, characterized in that the portrait super-resolution reconstruction method comprises:

Use the image reconstruction model to detect the key points of the image to be processed, and obtain the key points of the face;

Perform super-resolution reconstruction processing according to the face key points and the image features obtained based on the to-be-processed image to obtain image high-frequency information;

Perform restoration processing on the to-be-processed image by using the high-frequency information of the image to obtain a super-resolution image corresponding to the to-be-processed image.
The super-resolution reconstruction method of a portrait according to claim 15, wherein the super-resolution reconstruction processing is performed using the image processing method according to any one of claims 1 to 7.
The portrait super-resolution reconstruction method according to claim 15 or 16, wherein the key point detection, super-resolution reconstruction processing and restoration processing comprise multiple rounds of iterative processing, and the to-be-processed image is unprocessed The image to be processed, or the super-resolution image obtained after the key point detection, super-resolution reconstruction processing, and restoration processing in the previous iteration.
The super-resolution reconstruction method for a portrait according to any one of claims 15 to 17, wherein the face key points include a plurality of key points, and the image to be processed is performed on the image to be processed by using the high-frequency information of the image. Restoration processing, the step of obtaining the super-resolution image corresponding to the image to be processed includes:

Process the to-be-processed image by using a pre-built portrait cognitive model, and output the position information of each of the key points of the face;

Based on the position information of each of the face key points and the high-frequency information of the image, restoration processing is performed on the to-be-processed image to obtain a super-resolution image corresponding to the to-be-processed image.
The portrait super-resolution reconstruction method according to claim 18, wherein the restoration processing is performed on the to-be-processed image based on the position information of each of the key points of the face and the high-frequency information of the image to obtain The step of the super-resolution image corresponding to the image to be processed includes:

Obtain the restoration attributes corresponding to each of the face key points;

According to each of the face key points and their corresponding position information, image high-frequency information, and restoration attributes, restoration processing is performed on the corresponding face key points in the to-be-processed image.
The portrait super-resolution reconstruction method according to any one of claims 15 to 19, wherein the reconstruction model comprises a discriminator and a generation network, and the generation network is supervised by a trained discriminator, Obtained after training with training samples.
The super-resolution reconstruction method for a human portrait according to any one of claims 15 to 20, wherein the key points of the human face include left eye, right eye, nose, mouth and chin contour.
A method for training a portrait super-resolution reconstruction model, characterized in that the method for training a portrait super-resolution reconstruction model comprises:

obtaining a training sample and a target sample corresponding to the training sample;

Use the constructed generating network to perform key point detection on the training sample to obtain training key points;

Perform super-resolution reconstruction processing and restoration processing based on the training key points and the training samples to obtain an output image;

Comparing the output image and the target sample, and adjusting the network parameters of the generating network based on the comparison result, the training continues until a reconstructed model is obtained when a first preset condition is satisfied.
The method for training a portrait super-resolution reconstruction model according to claim 22, wherein the output image and the target sample are compared, and the generation network is adjusted based on the comparison result after network parameters are adjusted. The steps of training until the reconstructed model is obtained when the first preset condition is met, including:

constructing a first loss function based on the difference between the pixel information of the output image and the pixel information of the target sample;

Construct a second loss function based on the difference between each face key point in the output image and the corresponding face key point in the target sample;

Comparing the output image and the target sample, and adjusting the network parameters of the generating network based on the comparison result, continue training until the weighted function values of the first loss function and the second loss function satisfy The reconstructed model is obtained at the first preset condition.
The method for training a portrait super-resolution reconstruction model according to claim 22 or 23, wherein the reconstruction model further comprises a discriminator, and the discriminator is used to supervise the training of the generation network, and the portrait super-resolution Reconstruction model training methods include:

constructing a discriminator, and using the discriminator to discriminate the output image and the target sample corresponding to the output image;

According to the obtained discrimination result, the parameters of the discriminator are adjusted until the trained discriminator is obtained when the second preset condition is satisfied.
The method for training a portrait super-resolution reconstruction model according to any one of claims 22 to 24, wherein the output image and the target sample are compared, and the generation network is compared based on the comparison result. After adjusting the network parameters, continue training until the first preset condition is met to obtain the steps of reconstructing the model, including:

Inputting the output image to the trained discriminator to obtain discriminant information;

Comparing the output image and the target sample to obtain a comparison result;

After adjusting the network parameters of the generating network according to the discrimination information and the comparison result, continue training until a reconstructed model is obtained when the first preset condition is satisfied.
The method for training a portrait super-resolution reconstruction model according to claim 25, wherein the generating network is adjusted according to the discriminant information and the comparison result, and the training is continued until the first The steps to obtain the reconstructed model when the preset conditions are obtained, including:

constructing a first loss function based on the difference between the pixel information of the output image and the pixel information of the target sample;

Construct a second loss function based on the difference between each face key point in the output image and the corresponding face key point in the target sample;

A third loss function is constructed based on the discriminant information of the output image by the discriminator, and a fourth loss function is constructed based on the image difference between the output image and the target sample obtained by the pre-built portrait cognitive model;

After adjusting the network parameters of the generating network according to the discriminant information and the comparison result, continue training until the first loss function, the second loss function, the third loss function and the fourth loss function are weighted. The reconstructed model is obtained when the function value satisfies the first preset condition.
A portrait super-resolution reconstruction device, characterized in that the portrait super-resolution reconstruction device comprises:

a detection module, configured to perform key point detection on the image to be processed by using the pre-built reconstruction model to obtain face key points;

a processing module, configured to perform super-resolution reconstruction processing according to the face key points and the image features obtained based on the to-be-processed image to obtain high-frequency image information;

The restoration module is configured to perform restoration processing on the to-be-processed image by using the image high-frequency information to obtain a super-resolution image corresponding to the to-be-processed image.
The human portrait super-resolution reconstruction device according to claim 27, wherein the processing module includes the image processing device according to claim 13 for performing super-resolution reconstruction processing.
A portrait super-resolution reconstruction model training device, characterized in that the portrait super-resolution reconstruction model training device comprises:

an acquisition module, configured to acquire training samples and target samples corresponding to the training samples;

a key point obtaining module, configured to perform key point detection on the training sample by using the constructed generating network to obtain training key points;

an output image obtaining module, configured to perform super-resolution reconstruction processing and restoration processing based on the training key points and the training samples to obtain an output image;

The training module is configured to compare the output image with the target sample, and adjust the network parameters of the generating network based on the comparison result and continue training until a reconstructed model is obtained when a first preset condition is satisfied.
An electronic device, characterized in that the electronic device comprises:

one or more processors;

One or more storage media for storing one or more machine-executable instructions that, when executed by the one or more processors, cause the one or more processing The device implements the image processing method according to any one of claims 1-7, or the image reconstruction model training method according to any one of claims 8-12, or according to any one of claims 15-21. The super-resolution reconstruction method for portrait, or the method for training a super-resolution reconstruction model for portrait according to any one of claims 22-26.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are executed, realize the image according to any one of claims 1-7 processing method, or the image reconstruction model training method according to any one of claims 8-12, or the portrait super-resolution reconstruction method according to any one of claims 15-21, or according to claims 22- The method for training a portrait super-resolution reconstruction model according to any one of 26.