CN113240584B

CN113240584B - Multitasking gesture picture super-resolution method based on picture edge information

Info

Publication number: CN113240584B
Application number: CN202110508733.6A
Authority: CN
Inventors: 方昱春; 冉启材
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2023-04-28
Anticipated expiration: 2041-05-11
Also published as: CN113240584A

Abstract

The invention discloses a multitasking gesture picture super-resolution method based on picture edge information, which specifically comprises the following steps: inputting a plurality of groups of 'low-resolution pictures-skeleton key points' into a super-resolution model; the low-resolution picture respectively executes a picture super-resolution task and an edge detection task through a super-resolution unit and an edge information unit; the features obtained by the super-resolution unit and the edge information unit are sent to a designed multi-scale feature fusion module to carry out feature fusion of 3 different scales; iteratively updating parameters of the model through designed edge loss functions and content loss functions until convergence; and inputting the low-resolution picture to be superseparated and the corresponding hand skeleton key point information into the model, and obtaining the finally generated super-resolution picture through forward transmission once. The method has good effect, generates the super-resolution picture which is more in line with the real scene, has high expandability, and can combine the latest super-resolution network and the edge detection network to improve the performance of the model.

Description

Multitasking gesture picture super-resolution method based on picture edge information

Technical Field

The invention relates to the field of computer vision, mainly relates to a super-resolution method of a single picture, in particular to a multi-task gesture picture super-resolution method based on picture edge information.

Background

The single image super-Resolution task is a typical inverse problem in computer vision, with the goal of reconstructing a High Resolution (HR) image from a Low Resolution (LR) input image. Image super-resolution technology is also widely applied in the real world, such as medical image processing, monitoring and security, and the like, and besides being capable of improving the quality of images and videos, the image super-resolution technology is also used as an upstream task of other advanced computer vision tasks (such as image segmentation, image recognition, action positioning, and the like) so as to improve the performance of the advanced computer vision tasks.

The existing general super-resolution technology has two main problems: first, convolutional Neural Network (CNN) based methods typically use Mean Square Error (MSE) as an objective function of the network, which ignores high frequency information of a large number of pictures by calculating the distance between HR and LR image pixels, resulting in a blurred structure of the finally generated picture; second, the method based on the generative countermeasure network retains the high frequency information of the picture, but often generates distortion of the picture. The existence of these two types of problems severely limits the application of image super-resolution technology in real life. Therefore, finding a technology capable of reducing image deformation and retaining image high-frequency information is a problem to be solved in the current image super-resolution task.

Disclosure of Invention

The invention aims to provide a multitasking gesture picture super-resolution method based on picture edge information, so as to solve the problems in the prior art and make the method more suitable for processing gesture pictures.

In order to achieve the above object, the present invention provides the following solutions:

the invention provides a multitasking gesture picture super-resolution method based on picture edge information, which comprises the following steps:

information acquisition, namely acquiring a plurality of high-resolution gesture pictures, and preprocessing data of the high-resolution gesture pictures to acquire low-resolution gesture pictures and skeletal key point information of hands;

information processing, namely constructing a super-resolution model based on the low-resolution gesture picture, wherein the super-resolution model comprises an edge information unit and a super-resolution unit, and performing edge information detection on the low-resolution gesture picture based on the edge information unit to obtain edge characteristics and a first edge information picture; after feature extraction is carried out on the hand skeleton key point information based on a convolution block, merging the hand skeleton key point information into the super-resolution unit, and carrying out super-resolution processing on the low-resolution gesture picture based on the super-resolution unit to obtain picture features;

determining a loss function, inputting the edge features and the picture features into a multi-scale feature fusion module, and carrying out feature fusion of three different scales to obtain a super-resolution picture; obtaining a second edge information graph based on a high-resolution picture, obtaining an edge loss value based on the first edge information graph and the second edge information graph, obtaining a content loss value based on the high-resolution picture and the super-resolution picture, and obtaining a loss function by carrying out weighted summation based on the edge loss value and the content loss value;

and training and utilizing the model, carrying out one-time back propagation based on the loss function, updating parameters of the super-resolution model, repeating iteration until the parameters are converged, finishing the training of the super-resolution model, inputting a low-resolution gesture picture needing super resolution and corresponding hand skeleton key point information into the super-resolution model, and obtaining a finally generated super-resolution gesture picture through one-time forward propagation, thereby finishing super-resolution of the low-resolution gesture picture.

Further, the data preprocessing method comprises the following steps: and 4 times of downsampling is carried out on the high-resolution gesture image through a bilinear interpolation algorithm, so that a low-resolution gesture image is obtained.

Further, the hand skeletal keypoint information includes, but is not limited to: the skeleton key point coordinates are obtained by the following steps:

and acquiring skeleton key point coordinates of the high-resolution gesture picture and the low-resolution gesture picture by using an OpenPose tool, wherein positions where the skeleton key point coordinates cannot be acquired are represented by (-1, -1) coordinates.

Further, the acquiring method is adopted to acquire 21 skeleton key point coordinates, distances between each skeleton key point coordinate and each pixel coordinate on the corresponding low-resolution picture are calculated respectively, thermodynamic diagram data corresponding to each skeleton key point coordinate are obtained, and thermodynamic diagram data corresponding to the 21 skeleton key point coordinates are overlapped in picture depth to obtain input data of the skeleton key point coordinates.

Further, the first edge information map is an edge information map with the same number of pixels as the high-resolution picture.

Further, in the information processing, feature extraction is performed on the skeleton key point information, three 3×3 convolution blocks are used for feature extraction, and features acquired by each convolution block are sent to a super-resolution unit to be fused with picture features of each layer; the super-resolution unit adopts two-dimensional convolution with a convolution kernel of 3 multiplied by 3 and comprises four 3 multiplied by 3 convolution blocks; the edge information unit comprises four layers, wherein the first layer adopts 3×3 convolution, and the other layers adopt multi-scale residual blocks.

Further, the multi-scale residual block includes, but is not limited to: two convolution blocks with a convolution kernel size of 3, two convolution blocks with a convolution kernel size of 5, and a dimension-reducing convolution layer of 1x 1.

Further, the super-resolution unit and the edge information unit also form a dynamic multi-task structure, and the dynamic multi-task structure is used for synchronously carrying out the super-resolution unit and the edge information unit, assisting the edge feature acquired by the edge information unit in enhancing the super-resolution unit, and adding the skeleton key point coordinates as auxiliary features.

Further, in the model construction, when three features with different scales are fused, 16 residual groups containing residual modules are adopted to extract the features with the three different scales, and then the fusion is carried out.

Further, in the model construction, performing one back propagation specifically includes:

and carrying out gradient back propagation on the loss function by using an Adam optimizer, updating parameters of the super-resolution model until the loss of the final model is no longer changed, and completing model training.

The invention discloses the following technical effects:

the system generates the high-resolution picture corresponding to the low-resolution picture through one-time operation, has high efficiency, performs special processing on the gesture picture, is more suitable for processing the gesture picture, can bear preprocessing work of tasks such as sign language recognition and gesture recognition, and improves quality and efficiency of related tasks of the gesture. The method is more focused on the super-resolution task of the gesture picture, and a clearer hand super-resolution result is obtained.

According to the invention, a dynamic multitasking method is used for linking the edge information detection task and the super-resolution task, so that the model learns the contribution degree of the image edge information to the super-resolution task from the data, and the better image characteristic representation is obtained.

The super-resolution unit and the edge information unit are of a plug-and-play unit structure, which means that the super-resolution unit and the edge information unit are replaced by any network with the same task and more excellent performance, and the performance of the network is better improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a super-resolution model according to the present invention;

fig. 2 is a schematic structural diagram of a multi-scale feature fusion module.

Detailed Description

Various exemplary embodiments of the invention will now be described in detail, which should not be considered as limiting the invention, but rather as more detailed descriptions of certain aspects, features and embodiments of the invention.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In addition, for numerical ranges in this disclosure, it is understood that each intermediate value between the upper and lower limits of the ranges is also specifically disclosed. Every smaller range between any stated value or stated range, and any other stated value or intermediate value within the stated range, is also encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this specification are incorporated by reference for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present specification will control.

It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments of the invention described herein without departing from the scope or spirit of the invention. Other embodiments will be apparent to those skilled in the art from consideration of the specification of the present invention. The specification and examples are exemplary only.

As used herein, the terms "comprising," "including," "having," "containing," and the like are intended to be inclusive and mean an inclusion, but not limited to.

The "parts" in the present invention are all parts by mass unless otherwise specified.

The invention relates to a gesture picture super-resolution method based on dynamic multitasking. The super-resolution task and the edge information detection task are constructed, the edge information is used for assisting in enhancing the super-resolution task, attention around a hand is enhanced by adopting the hand key point information enhancing model, and the obtained features are subjected to feature fusion at different scales, so that the performance of the model is improved.

Example 1

According to the technical scheme, a multi-task network for simultaneously performing superdivision tasks and edge detection tasks is constructed by adopting dynamic multi-tasks, the characteristics of an edge information unit are assisted to enhance the characteristics of a super-resolution unit, and hand skeleton key point information is added as an auxiliary characteristic, so that the attention of the network is focused on a hand related area, and finally a gesture super-resolution picture with a sharper edge is generated.

A multitasking gesture picture super-resolution method based on picture edge information comprises the following specific steps:

a. simultaneously, a plurality of low-resolution (LR) gesture pictures and corresponding hand skeleton key point information are input into the network provided by the invention. The gesture image is subjected to processing such as feature coding, feature fusion, feature decoding and the like by a super-resolution unit and an edge information unit, hand skeleton key point information is fused into the super-resolution unit after feature extraction is performed through a convolution block each time, and finally the super-resolution unit generates an image feature f _image The edge information unit generates an edge feature f _edge . At the same time, a picture edge information map with the same size as the high-resolution picture is generated at the same step.

b. The two features generated in step a are sent together to a multi-scale feature fusion module, which ultimately generates a super-resolution output SR. And (c) calculating the L1 loss by the obtained SR and the real super-resolution picture HR, and simultaneously calculating the edge loss by the edge information picture generated in the step a and the real edge information picture. Both losses are then back propagated once to update the network parameters. And (5) carrying out iterative updating for a plurality of times until parameters in the network are converged to obtain a final model.

c. B, inputting a low-resolution gesture picture with super resolution and corresponding hand key point information (obtained through OpenPose) into the model obtained in the step b, and obtaining the finally generated super-resolution gesture picture through forward propagation once.

Preferably, the super-resolution units in a above all use conventional 3x3 convolution, i.e. the convolution kernel is a two-dimensional convolution of 3x3 size. The edge information uses multi-scale residual blocks for the layers except for the first layer which uses 3x3 convolution. The hand key point information is also extracted by adopting a 3x3 convolution block.

The specific steps of the step a are as follows:

a-1, data preprocessing: the existing high-resolution HR gesture image data (with the image size of 512x 512) is subjected to 4 times downsampling through a bilinear interpolation algorithm, so that input LR image data of a network with the size of 128x128 is obtained. The HR and LR pictures are then hand keypoint estimated using the openelse tool, where the positions of points that cannot be estimated are represented using (-1, -1). Obtaining 21 skeleton key points of the hand in total, then obtaining thermodynamic diagram data related to each key point by calculating the distance between each skeleton key point coordinate and each pixel coordinate on the picture, and obtaining skeleton key point input data by superposing the thermodynamic diagram data of the 21 key points on the depth of the picture. The overall structure of the network is shown in fig. 1.

a-2, model construction:

a-2-1. Hand key points adopt three convolution blocks of 3x3 to extract the characteristics, and the characteristics after each convolution block are sent to a super resolution unit to be fused with the picture characteristics of each layer.

The super-resolution unit is also subjected to feature extraction by four 3x3 convolution blocks, except for the last layer, the output of each layer is fused with the hand key point feature, and the output of each layer is fused with the feature obtained by the edge information unit by a dynamic multitasking method.

a-2-3. The first layer of the edge information unit is a 3x3 convolutional block, and the three following layers are all multi-scale residual blocks with residuals. Each multi-scale residual block consists of two convolution blocks with a convolution kernel size of 3 and two convolution blocks with a convolution kernel size of 5, and a dimension-reducing convolution layer of 1x 1.

The mathematical representation of process a-2 is:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the characteristics obtained by the ith convolution block; />

An ith convolution block representing a hand keypoint information unit; i _K And indicating the inputted hand key point information.

/>

Wherein the method comprises the steps of

Representing the characteristics obtained by the ith convolution block in the super-resolution unit; />

Representing an ith convolution block in the super-resolution unit; />

Representing the characteristics obtained by the ith layer of the edge information unit;/>

a convolution block representing a first layer of edge information units; MSRB _i (·) represents the ith multi-scale residual block of the edge information unit; θ _i Representing parameters in dynamic multitasking.

The output characteristic of the fourth layer in the super resolution unit is denoted as f _image Simultaneously, the output of each layer of the edge information unit is connected in a residual mode, and then the characteristic fusion block is adopted to carry out characteristic fusion f _edge . The above process is mathematically expressed as:

where Fusion (·) represents a feature Fusion module, which consists of a 1x1 convolution.

The specific steps of the step b are as follows:

b-1. Two features obtained in the formula (7) and the formula (8) are sent to a multi-scale feature fusion module to perform feature fusion of different scales, and each feature is extracted by using a Residual Group (RG) containing 16 Residual modules, wherein the total number of the features is 3. Each residual block contains two 3x3 convolutions. The multi-scale feature fusion module is shown in fig. 2.

b-2. Feature fusion layer of edge information element in step a, except generating f _edge Edge information maps Edge' of the same size as the high resolution map (HighResolution, HR) are also generated. Obtaining Edge loss through a formula (9), and marking the Edge loss as l _E . Through the multi-scale feature fusion module, the network generates a Super Resolution picture (SR). Calculating the L1 loss between the generated SR picture and the original HR picture according to the formula (10), and recording as L _I . Based onThe final loss function of the model expressed above is:

l _E ＝||I _SR -I _HR || ₁ formula (9)

l _I ＝||Edge'-Edge|| ₁ Formula (10)

L _loss ＝l _I +λ*l _E Formula (11)

Wherein I is _SR Representing a super-resolution picture generated by the model; i _HR Representing an original high-resolution picture; lambda is an over-parameter controlling both losses, set to 0.5; l (L) _loss I.e. the total loss function of the model.

Example 2

The invention relates to a gesture picture super-resolution method based on dynamic multitasking edge information and hand skeleton key point assistance. As shown in fig. 1, the specific implementation is as follows:

the network inputs a low resolution picture of RBG type, a picture size of 128x128, and three channels, so the input picture type is 3x128x128. The number of the hand key points corresponding to each picture is 21, the distance between each key point position and each pixel position of the picture is calculated to form a numpy distance matrix, and therefore the hand key point information corresponding to each picture is a numpy type three-dimensional matrix with the size of 21x128x 128.

Extracting features of the key point information of the hand by adopting 3 convolution blocks with the convolution kernel size of 3; 4 convolution blocks with the convolution kernel size of 3 are adopted to perform feature extraction on an input LR picture in a super-resolution unit; in the edge information unit, a convolution block with a convolution kernel size of 3 is adopted in the first layer, and then 3 multi-scale residual blocks are adopted, wherein each residual block comprises 2 convolutions with the convolution kernel size of 3 and two convolutions with the convolution kernel size of 5.

The outputs of the super resolution unit and the edge information unit are up sampled by 4 times to obtain the characteristic f with the size of 512x512 _image And f _edge The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the Edge information unit generates an Edge information graph Edge' of 512x512 size. Edge corresponding to the obtained Edge' and the existing high-resolution pictureThe edge loss is calculated by the formula (10) according to the edge picture.

f _image And f _edge Is simultaneously fed into a multi-scale feature fusion module, wherein the multi-scale feature fusion module comprises three features with different scales, the same scale represents a residual group module with the same number, and each residual group comprises 16 residual blocks formed by 3x3 convolution.

And finally outputting the super-resolution picture SR by the multi-scale feature fusion module, and calculating the L1 loss of the picture through a formula (9).

The multiple sets of picture-hand key point information pairs are sent into the network of the invention, and then the parameters of the model are updated by using the Adam optimizer to perform gradient back propagation on the final loss function until the loss of the final model is unchanged, namely, the model training is considered to be completed.

The above embodiments are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solutions of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims

1. A multitasking gesture picture super-resolution method based on picture edge information is characterized in that: the method comprises the following steps:

information processing, namely constructing a super-resolution model based on the low-resolution gesture picture, wherein the super-resolution model comprises an edge information unit and a super-resolution unit, and performing edge information detection on the low-resolution gesture picture based on the edge information unit to obtain edge characteristics and a first edge information picture; extracting features of the hand skeleton key point information based on the convolution block to obtain hand skeleton key point features; performing super-resolution processing on the low-resolution gesture image based on the super-resolution unit to obtain image characteristics;

in the information processing, feature extraction is carried out on the hand skeleton key point information, and three are adopted

The convolution blocks extract features, and the hand skeleton key point features acquired by each convolution block are fused with the output of each layer in the super-resolution unit;

the super-resolution unit adopts convolution kernel as

Comprises four +.>

A convolution block; the outputs of the first three layers of the super-resolution unit are respectively fused with hand skeleton key point features;

the edge information unit comprises four layers, wherein the first layer adopts

Convolving, wherein the rest layers adopt multi-scale residual blocks;

the multi-scale residual block includes, but is not limited to: two convolution blocks with the convolution kernel size of 3, two convolution blocks with the convolution kernel size of 5 and a dimension-reducing convolution layer with the convolution kernel size of 1x 1;

the super-resolution unit and the edge information unit also form a dynamic multi-task structure, the dynamic multi-task structure is used for synchronously carrying out the super-resolution unit and the edge information unit, the edge characteristics acquired by the edge information unit are used for assisting in enhancing the super-resolution unit, and the hand skeleton key point coordinates are added as auxiliary characteristics;

in the model construction, when three features with different scales are fused, firstly, 16 residual groups containing residual modules are adopted to extract the three features with different scales, and then fusion is carried out;

2. The picture edge information-based multitasking gesture picture super-resolution method of claim 1, characterized by: the data preprocessing method comprises the following steps: and 4 times of downsampling is carried out on the high-resolution gesture image through a bilinear interpolation algorithm, so that a low-resolution gesture image is obtained.

3. The picture edge information-based multitasking gesture picture super-resolution method of claim 1, characterized by: the hand skeletal keypoint information includes, but is not limited to: the skeleton key point coordinates are obtained by the following steps:

4. A multitasking gesture picture super-resolution method based on picture edge information as claimed in claim 3, characterized by: and acquiring 21 skeleton key point coordinates by adopting the acquisition method, respectively calculating the distance between each skeleton key point coordinate and each pixel coordinate on the corresponding low-resolution picture, obtaining thermodynamic diagram data corresponding to each skeleton key point coordinate, and superposing the thermodynamic diagram data corresponding to the 21 skeleton key point coordinates on the picture depth to obtain the input data of the skeleton key point coordinates.

5. The picture edge information-based multitasking gesture picture super-resolution method of claim 1, characterized by: the first edge information graph is an edge information graph with the same pixel number as that of the high-resolution picture.

6. The picture edge information-based multitasking gesture picture super-resolution method of claim 1, characterized by: in the model construction, performing one back propagation specifically includes: