CN110533579B

CN110533579B - Video style conversion method based on self-coding structure and gradient order preservation

Info

Publication number: CN110533579B
Application number: CN201910680259.8A
Authority: CN
Inventors: 牛毅; 郭博嘉; 李甫; 李宜烜; 石光明
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2022-12-02
Anticipated expiration: 2039-07-26
Also published as: CN110533579A

Abstract

The invention provides a video style conversion method based on a self-coding structure and gradient order preservation, which is used for solving the technical problem that halo is generated at the edge of a foreground target in a stylized video in the existing video style conversion method. The method comprises the following implementation steps: 1) Constructing a training sample set and a testing sample set; 2) Constructing a video stylized network model; 3) Training a video stylized network model; 4) Testing the trained video stylized network model, and 5) obtaining a video style conversion result. According to the method, the video stylized network model based on the self-coding structure and the gradient order-preserving loss function is constructed, and the time consistency constraint is redefined in a more reasonable mode, so that the halo generated at the edge of the foreground target in the stylized video is effectively eliminated, the texture detail information of the original video is retained, the visual sensory experience of people is improved, and the method can be used for finishing the post-production processing of photography and movie and television works.

Description

Video style conversion method based on self-coding structure and gradient order preservation

Technical Field

The invention belongs to the technical field of digital image processing, relates to a video style conversion method, and particularly relates to a video style conversion method based on a self-coding structure and gradient order preservation, which can be used for finishing post-production processing of photographic and movie works.

Background

An important branch of the computer vision field is image generation, which includes image super resolution, image coloring, image semantic segmentation, style conversion of images or videos, and the like. In the field of computer vision, style conversion of images or videos is considered as a general problem of texture synthesis, i.e. in the case of a given style image, texture extraction and transfer from source to target, generating a corresponding style conversion result.

The image stylization conversion method can be divided into two types, namely an image style conversion method based on a traditional iteration method and an image style conversion method based on a neural network, wherein the image style conversion method based on the traditional iteration method comprises the following steps: a stroke-based rendering method, a region-technology-based rendering method, an example-based rendering method; this type of method, although faithfully describing a specific style pattern without CNN, has limitations in flexibility, style diversity, and effective extraction of image structure. A second type of image style conversion method based on neural network, for example: the method comprises the steps of extracting content features of an input image through a pre-trained convolutional neural network, extracting texture features of the input image through a Gram matrix, and optimizing an output image by adopting an iteration method aiming at matching feature distribution of the output image with feature distribution of an expected convolutional neural network.

After the image style conversion method achieves certain results, many scholars turn the attention to the video style conversion method. The method is characterized in that a video is formed by combining a plurality of frames of images, continuous stylized video frames are obtained by frame splitting and an image style conversion method is used for combining the frames of the videos into a stylized video, and therefore the video style conversion method is improved based on the image style conversion method. The video style conversion method can be applied to post-production processing and treatment of film and television works, and can generate corresponding video style conversion results under the condition of specifying target style images.

At present, the following methods are mainly used for typical video style conversion:

manuel Ruder et al published an article entitled "adaptive style transfer for video" in 2016 and discloses a video style conversion method based on iteration, which adds time loss on the basis of image style conversion and provides a time consistency concept between adjacent video frames. To penalize deviations between two frames. The time consistency between adjacent stylized video frames is ensured, and the video flicker phenomenon is effectively prevented. However, due to iterative optimization, the video generation speed is very slow, which causes high time cost of video generation.

Haozhi Huang et al published an article entitled "Real-Time Neural Style Transfer for video" in 2017 on Computer Vision and Pattern Recognition, and disclosed a video Style conversion method based on a feed-forward network, and achieved the purpose of shortening the conversion Time by training a Style conversion feed-forward Neural network with Time consistency. Although this method has been successful in improving the efficiency of video generation, it still has two drawbacks when converting the video style: firstly, halos exist around foreground objects in the generated stylized video, and visual and sensory experiences of people are influenced; secondly, due to the fact that accuracy of optical flow estimation is insufficient, a computing mode of time consistency loss is unreasonable, optical flow data detected on an original video frame is not suitable for restraining time consistency between stylized video frames, training errors are generated, and fluency and continuity of stylized videos are affected.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a video style conversion method based on a self-coding structure and gradient order preservation, which improves the visual and sensory experience of people by eliminating the halo around the foreground object in the stylized video on the premise of ensuring the temporal video style conversion speed.

The technical idea of the invention is as follows: firstly, a video stylized network structure is constructed by adopting the idea of a self-coding structure, the video stylized network structure is used for improving the generation speed of a stylized video, defining a new time consistency loss function, inhibiting the jittering and flickering phenomena of the stylized video, increasing a reconstruction loss function, retaining the detail information of an original video in the stylized video, eliminating the halo around a foreground object in the stylized video by increasing a gradient order-preserving loss function, and improving the visual sensory experience of people, and the method comprises the following specific steps:

(1) Constructing a training sample set and a testing sample set:

(1a) Obtaining target style images s and M _r Resolution size N _r ×N _r And for each video data at a frame rate N _f Performing frame splitting to obtain M _r Grouping original video frames x, and simultaneously extracting optical flow data of each video data to obtain M _r Group of optical flow data, wherein N _r ≥64，M _r ≥100，N _f ≥25；

(1b) Will be 4M _r Original video frames of/5 groups x and 4M _r The optical flow data corresponding to the 5 groups of original video frames form a training set, the target style image s and the training set form a training sample set, and the rest 1M are used _r 5, forming a test sample set by the original video frames x;

(2) Constructing a video stylized network model:

(2a) Constructing a video stylized network structure:

constructing a video stylized network structure comprising an encoder network, a decoder network, and a loss network, wherein:

an encoder network including an input layer, a plurality of convolutional layers, a plurality of residual layers, and a plurality of anti-convolutional layers for generating a stylized video frame

A decoder network comprising an input layer, a plurality of convolutional layers, a plurality of residual layers and a plurality of deconvolution layers for generating a reconstructed video frame

A lossy network comprising an input layer, a plurality of convolutional layers and a plurality of pooling layers for extracting a target-style image s, a stylized video frame

And higher order of original video frame xCharacteristic;

(2b) Defining a total loss function L for a video stylized network structure _total ：

Defining includes spatial structure loss function

Time consistency loss function

Gradient order preserving loss function

And reconstruction loss function

Total loss function L of the video stylized network structure of (1) _total ：

Wherein x is ^t Is the original video frame at time t,

is style at time tThe video frames are then assembled into a video frame,

is the reconstructed video frame at time t; α, β, λ and γ are each

And

the weight of (c); d = H × W × C, H is x ^t 、

And

w is x ^t 、

And

c is x ^t 、

And

the number of channels of (a);

μ is a content loss function

Is a style loss function

The weight of (c); f is an affine transformation operation; a. The ^t Is at x ^t The number of elements obtained by convolution is D _m Three-dimensional gradient matrix of, B ^t Is that

The number of elements obtained by convolution is D _m The three-dimensional gradient matrix of (a); g _T Is a threshold function, M is a morphological dilation operation;

(3) Training a video stylized network model:

the original video frame x of the target style image s and t in the training sample set ^t And original video frame x at time t +1 ^t+1 And x ^t And x ^t+1 Taking the optical flow data as the input of the video stylized network model, and performing K times of iterative training on the video stylized network model to obtain a trained video stylized network model, wherein K is more than or equal to 20000;

(4) Testing the trained video stylized network model:

taking the test sample set as the input of the encoder network in the trained video stylized network model to obtain 1M _r A/5 set of stylized video frame sequences;

(5) Acquiring a video style conversion result:

at frame rate N for each group of stylized video frame sequences _f And combining frames according to the time sequence to obtain the video with the converted style.

Compared with the prior art, the invention has the following advantages:

1. the invention constructs a video stylized network model containing a gradient order-preserving loss function and a reconstruction loss function based on the thought of a self-coding structure, and restrains the intermediate value and the output value of an encoder network and a decoder network in the self-coding structure by adopting the gradient order-preserving loss function and the reconstruction loss function in the training process, so as to prevent the pixel value at the edge of a foreground target in a stylized video from being too smooth or gradient reverse, effectively eliminate the halo at the edge of the foreground target of the stylized video in the prior art, provide a sharp halo-free foreground target boundary, retain the texture detail information of an original video in the stylized video, and effectively improve the visual sensory experience of people.

2. The invention constructs a video stylized network model containing a redefined time consistency loss function based on the idea of a self-coding structure, and calculates the mean square error between the pixels of the two adjacent frames of reconstructed video frames, because the reconstructed video frames are basically similar to the original video frames in the spatial structure, the calculation of the two adjacent frames of stylized video frames in the prior art is avoided, the error of using the optical flow extracted by the original video evaluation for video style conversion is favorably reduced, the flickering and shaking phenomena of the stylized video are effectively inhibited, the smoothness and smoothness of the stylized video are improved, and the visual sensory experience of people is further improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of a video stylized network architecture constructed in accordance with the present invention;

fig. 3 is a graph comparing the video style conversion results of the present invention and the prior art.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples.

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1) constructing a training sample set and a test sample set:

(1a) Obtaining target style images s and M _r Resolution size N _r ×N _r And for each video data at a frame rate N _f Performing frame splitting to obtain M _r Grouping original video frames x, and simultaneously extracting optical flow data of each video data to obtain M _r Group of optical stream data, wherein N _r ≥64，M _r ≥100，N _f ≥25；

(1b) Will be 4M _r Original video frames of/5 groups x and 4M _r The optical flow data corresponding to the 5 groups of original video frames form a training set, the target style image s and the training set form a training sample set, and the rest 1M are used _r The 5 groups of original video frames x form a test sample set;

in the existing video data set, the resolution size of most video data sets is rectangular, but when the optical flow is evaluated and extracted, rectangular video frames are automatically converted into the resolution size N _r ×N _r In the training process, only the optical flow data is evaluated and extracted for the square video, so that the original rectangular video is adjusted into the square video as a training set. During training, down-sampling operation is performed, and the effect of video with too small resolution after style conversion is not ideal, so that the resolution of the input video has a lower limit, N _r And the content is more than or equal to 64. In this embodiment, a target style image is obtained, 124 video data included in two video data sets of Sintel and DAVIS are adopted, 124 videos are unframed, an optical flow data is extracted by using a Flownet2 algorithm, 124 groups of original video frames with the resolution of 256 × 256 and 124 groups of corresponding optical flow data are obtained, optical flow data corresponding to 102 groups of video frames and 102 groups of video frames are combined into a training set, a training sample set is formed by the target style image and the training set, and the rest 22 groups of video frames are used as a test sample set.

Step 2), constructing a video stylized network model:

(2a) Constructing a video stylized network structure:

referring to fig. 2, a video stylized network structure is constructed comprising an encoder network, a decoder network and a loss network, wherein the encoder network and the decoder network together form a self-encoding structure for generating intermediate values and output values, providing conditions for constraints of respective loss functions:

the encoder network includes an input layer, four convolutional layers, five residual layers, and two deconvolution layers, for generating stylized video frames

Input layer → first convolution layer → second convolution layer → third convolution layer → first residual layer → second residual layer → third residual layer → fourth residual layer → fifth residual layer → first deconvolution layer → second deconvolution layer → fourth convolution layer;

the decoder network comprises an input layer, three convolutional layers, two residual layers and an anti-convolutional layer for generating a reconstructed video frame

Input layer → first convolution layer → second convolution layer → first residual layer → second residual layer → first deconvolution layer → third convolution layer;

the loss network adopts the front 16 layers of pre-trained VGG-19, including an input layer, sixteen convolutional layers and four pooling layers, and is used for extracting a target style image s and a stylized video frame

And the high-order features of the original video frame x:

the input layer → the first buildup layer → the second buildup layer → the first pooling layer → the third buildup layer → the fourth buildup layer → the second pooling layer → the fifth buildup layer → the sixth buildup layer → the seventh buildup layer → the eighth buildup layer → the third pooling layer → the ninth buildup layer → the tenth buildup layer → the eleventh buildup layer → the twelfth buildup layer → the fourth pooling layer → the thirteenth buildup layer → the fourteenth buildup layer → the fifteenth buildup layer → the sixteenth buildup layer.

Wherein the parameter settings of each layer of the decoder network and the encoder network are as follows:

Defining including spatial structure loss function

Time consistency loss function

Gradient order preserving loss function

And reconstructing the loss function L _{reconstruction} (x ^t ,x ^t ) Video wind ofTotal loss function L of a lattice network structure _total ：

Wherein x is ^t Is the original video frame at time t,

is the stylized video frame at time t,

is the reconstructed video frame at time t; α, β, λ and γ are each

And

the weight of (c); d = H × W × C, H is x ^t 、

And

w is x ^t 、

And

c is x ^t 、

And

the number of channels of (a);

μ is a content loss function

Is a style loss function

wherein the style loss function

Function of content loss

Three-dimensional gradient matrix A ^t Three-dimensional gradient matrix B ^t And a threshold function G _T Are respectively:

wherein x is ^t Is the original video frame at time t,

is the stylized video frame at time t +1,

is the reconstructed video frame at time t; phi is a _l Is a feature map extracted from the first layer convolution in the loss network, C _l 、H _l And W _l Are respectively phi _l Number of channels, height and width, G ^l Extracting a Gram matrix of a characteristic diagram from the l layer convolution in the loss network;

A ^t is prepared by mixing 12

Cascading to obtain a three-dimensional gradient matrix; b is ^t Is prepared by mixing 12

The three-dimensional gradient matrix obtained by cascading is obtained,

is by a convolution kernel K _q In that

The two-dimensional gradient matrix obtained by the up-convolution,

is by a convolution kernel K _q In that

Convolving the acquired two-dimensional gradient matrix, wherein,

and

are respectively:

is x ^t Is provided on the p-th channel of (a),

is that

P is the serial number of the channel number, and p belongs to {1,2,3}; for

And

all the pixel points on the upper part adopt a left lower part, a lower part and a,Convolution kernel K of four directions of lower right and right _q Performing a convolution operation with q being a convolution kernel K of different directions _q Is given by the sequence number of (1), q ∈ {1,2,3,4}, K _q (m, n) is K _q The value of the (m, n) th position, m and n being K respectively _q The row and column indices; i and j are

And

an index of rows and columns; k is a radical of _r Is K _q Width of (k) _c Is K _q The length of (d); in this embodiment, K _q The convolution operation is performed with four inverse convolution kernels as follows:

and

G _T in brief, the gradient sequence is direction information of the gradient, and by taking the gradient in one direction alone, for example, if the pixel value of the current pixel point is a greater than that of the pixel point at the lower right, the value of the pixel point at the lower right of the current pixel point is assigned to 1, so that when the point G is a point G _T A value of 1 indicates that the pixel value of the point is greater than the pixel value of the point below and to the right. The value of the threshold a is empirically derived in experiments.

Step 3) training the video stylized network model:

the original video frame x of the target style image at the time of s and t in the training sample set ^t And original video frame x at time t +1 ^t+1 And x ^t And x ^t+1 The optical flow data is used as the input of the video stylized network model, K times of iterative training are carried out on the video stylized network model to obtain the trained video stylized network model, wherein K is more than or equal to 20000, and the specific implementation steps are as follows:

(3a) Initializing parameters of an encoder network and a decoder network, loading the parameters of a trained loss network, setting the iteration frequency as T, setting the maximum iteration frequency as K, wherein K is more than or equal to 20000, and enabling T =0 and T =1;

(3b) Original video frame x ^t And x ^t+1 Simultaneously inputting the video frames into an encoder network of the video stylized network model to obtain stylized video frames output by the encoder network

And

(3c) According to gradient order-preserving loss function

Calculating the t time x ^t And

gradient order-preserving loss in between, and time x at t +1 ^t+1 And

gradient order preserving loss is obtained, and the sum of the gradient order preserving loss at the time t and the time t +1 is obtained to obtain the trained gradient order preserving loss; the method is used for preventing the pixel value at the edge of the foreground target in the stylized video from being too smooth or gradient reversal, and eliminating the halo at the edge of the foreground target in the stylized video in the prior art;

(3d) X is to be ^t And x ^t+1 、

And

and simultaneously inputting the target style image s into a loss network of the video stylized network model, and extracting x ^t 、x ^t+1 、

S higher order features of sum, and according to spatial structure loss function

Calculating the t time x ^t High-order characteristics of,

And the loss of spatial structure between the higher order features of s, and the time t +1 x ^t+1 High-order characteristics of,

The space structure loss between the high-order characteristic of s and the high-order characteristic of s, and the sum of the space structure losses at the time t and the time t +1 is solved to obtain the space structure loss after training; the system comprises a video processing unit, a storage unit and a processing unit, wherein the video processing unit is used for constraining content information, spatial features and texture features of a stylized video;

(3e) Formatting video frames

And

inputting the video frame into a decoder network of a video stylized network model to obtain a reconstructed video frame

And

(3f) According to x ^t And x ^t+1 Optical flow data between

Affine transformation

Predicted value at time t

And according to a time consistency loss function

Computing

And

loss of time consistency between, i.e. loss of time consistency after training; the method is used for inhibiting the flickering and shaking phenomena of the stylized video and improving the smoothness and smoothness of the stylized video;

(3g) According to a reconstruction loss function

Calculating the t time x ^t And

reconstruction loss in between, and t +1 time x ^t+1 And

the sum of reconstruction losses at the t moment and the t +1 moment is obtained to obtain the reconstruction loss after training; the texture detail information used for keeping the original video;

(3h) Substituting the results calculated in steps (3 c), (3 d), (3 f) and (3 g) into the total loss function L of the video stylized network structure _total Calculating the total loss of the trained video stylized network structure, and updating the parameters of the encoder network and the decoder network through the total loss of the video stylized network structure by using a gradient descent algorithm;

(3i) Judging whether T is equal to K, if so, obtaining a trained video stylized network model; otherwise, let t = t +1, and perform step (3 b).

Step 4) testing the trained video stylized network model:

taking a test sample set as a trained video stylized meshThe input of the encoder network in the network model is obtained as 1M _r A/5 set of stylized video frame sequences;

step 5) obtaining a video style conversion result:

The effects of the present invention can be further illustrated by the following practical experiments:

1. conditions of the experiment

The hardware test platform adopted in the actual experiment of the experiment is as follows: intel Core i7CPU with 3.60GHz main frequency and 8GB memory; the software simulation platform comprises: a Ubuntu 16.04 64-bit operating system, a Pycharm development platform; software simulation language: python; using a deep learning framework: tensorflow.

2. Analysis of experimental content and results

The experimental contents are as follows: the same video data is subjected to style conversion by using the method of the present invention and Haozhi Huang et al, and the obtained style conversion result is shown in FIG. 3.

Fig. 3 (a) is a target-style image.

Fig. 3 (b) is the 10 th original video frame in the scene of the sinter dataset ambush _ 1.

Figure 3 (c) is the 9 th frame original video frame in the scene of the sinter dataset ambush _ 1.

Fig. 3 (d) and 3 (e) are stylized video frames after the style conversion of fig. 3 (b) and 3 (c) by the prior art.

Fig. 3 (f) and 3 (g) are stylized video frames after the present invention style-converts fig. 3 (b) and 3 (c).

Comparing the experimental results in fig. 3 (f) and fig. 3 (g) with the experimental results in fig. 3 (d) and fig. 3 (e), it can be seen from the comparison of the results in the white square frame that the video style conversion effect of the method of the present invention is better, and the video stylization model based on the self-coding structure and the gradient order preservation is used in the experiment, thereby effectively eliminating the halo generated around the foreground object of the stylized video in the prior art, providing a sharp and non-halo foreground image boundary, and it can be seen that the new time consistency algorithm provided by the present invention also effectively inhibits the dithering and flickering phenomena of the stylized video, retains the texture detail information of the original video, and effectively improves the visual sensory experience of human.

The above description is only one specific example of the present invention and does not constitute any limitation of the present invention. It will be apparent to persons skilled in the relevant art that various modifications and changes in form and detail can be made therein without departing from the principles and arrangements of the invention, but these modifications and changes are still within the scope of the invention as defined in the appended claims.

Claims

1. A video style conversion method based on self-coding structure and gradient order preservation is characterized by comprising the following steps:

(1) Constructing a training sample set and a testing sample set:

(1b) Will be 4M _r 5 original video frames x and 4M _r The optical flow data corresponding to the 5 groups of original video frames form a training set, the target style image s and the training set form a training sample set, and the rest 1M are used _r The 5 groups of original video frames x form a test sample set;

(2) Constructing a video stylized network model:

(2a) Constructing a video stylized network structure:

an encoder network comprising an input layer, a plurality of convolutional layers, a plurality of residual layers, and a plurality of anti-convolutional layers for generating a stylized video frame

And high-order features of the original video frame x;

Defining including spatial structure loss function

Time consistency loss function

Gradient order preserving loss function

And reconstruction loss function

Total loss function L of video stylized network structure of _total ：

Wherein x is ^t Is the original video frame at time t,

is the stylized video frame at time t,

is the reconstructed video frame at time t; α, β, λ and γ are each

And

the weight of (c); d = H × W × C, H is x ^t 、

And

w is x ^t 、

And

is wideDegree, C is x ^t 、

And

the number of channels of (a);

μ is a content loss function

Is a style loss function

The weight of (c); f is an affine transformation operation; a. The ^t Is at x ^t The number of elements obtained by convolution is D _m Three-dimensional gradient matrix of (A), B ^t Is that

(3) Training a video stylized network model:

the original video frame x of the target style image at the time of s and t in the training sample set ^t And original video frame x at time t +1 ^t+1 And x ^t And x ^t+1 Taking the optical flow data as the input of the video stylized network model, and performing K times of iterative training on the video stylized network model to obtain a trained video stylized network model, wherein K is more than or equal to 20000;

(4) Testing the trained video stylized network model:

(5) Acquiring a video style conversion result:

for each group of stylized video frame sequencesFrame rate N _f And combining frames according to the time sequence to obtain the video with the converted style.

2. The method for transforming video style based on self-coding structure and gradient order preservation according to claim 1, wherein the specific structures of the encoder network, the decoder network and the loss network in step (2 a) are respectively:

the encoder network includes an input layer, four convolutional layers, five residual layers, and two deconvolution layers:

the decoder network includes an input layer, three convolutional layers, two residual layers, and one deconvolution layer:

the loss network adopts a trained loss network and comprises an input layer, sixteen convolutional layers and four pooling layers:

3. The method for converting video style based on self-coding structure and gradient order preservation according to claim 2, wherein the training of the video stylization network model in step (3) is implemented by the steps of:

And

(3c) According to gradient order-preserving loss function

Calculating the t time x ^t And

gradient order-preserving loss in between, and time x at t +1 ^t+1 And

gradient order preserving loss is obtained, and the sum of the gradient order preserving loss at the time t and the time t +1 is obtained to obtain the trained gradient order preserving loss;

(3d) X is to be ^t And x ^t+1 、

And

Calculating the t time x ^t High-order characteristics of,

And the loss of spatial structure between the higher-order features of s, and the time x at t +1 ^t+1 High-order characteristics of,

The space structure loss between the high-order feature of s and the high-order feature of s, and the sum of the space structure losses at the time t and the time t +1 is solved to obtain the space structure loss after training;

(3e) Formatting video frames

And

And

(3f) According to x ^t And x ^t+1 Optical flow data therebetween, will

Affine transformation

Predicted value at time t

And according to a time consistency loss function

Computing

And

loss of time consistency between, i.e. loss of time consistency after training;

(3g) According to a reconstruction loss function

Calculating the t time x ^t And

reconstruction loss in between, and t +1 time x ^t+1 And

the sum of reconstruction losses at the t moment and the t +1 moment is obtained to obtain the reconstruction loss after training;

4. The method according to claim 1, wherein the style loss function in step (2 b) is a style loss function

Function of content loss

wherein x is ^t Is the original video frame at time t,

is the stylized video frame at time t +1,

is the reconstructed video frame at time t; phi is a _l Is a feature map extracted by the first layer convolution in the loss network, C _l 、H _l And W _l Are respectively phi _l Number of channels, height and width, G ^l Extracting characteristic diagram from the l layer convolution in the loss networkA Gram matrix;

A ^t is prepared by mixing 12

Cascading the three-dimensional gradient matrixes; b is ^t Is prepared by mixing 12

The three-dimensional gradient matrix obtained by cascading is obtained,

is by a convolution kernel K _q In that

The two-dimensional gradient matrix obtained by the up-convolution,

is by a convolution kernel K _q In that

And (c) up-convolving the acquired two-dimensional gradient matrix, wherein,

and

are respectively:

is x ^t Is provided on the p-th channel of (a),

is that

P is the serial number of the channel number, and p belongs to {1,2,3}; for

And

all the pixel points on the upper surface adopt convolution kernels K in four directions of left lower direction, right lower direction and right lower direction _q Performing a convolution operation with q being a convolution kernel K of different directions _q Is given by the sequence number of (1), q ∈ {1,2,3,4}, K _q (m, n) is K _q The value of the (m, n) th position, m and n being K respectively _q An index of rows and columns; i and j are

And

the row and column indices; k is a radical of formula _r Is K _q Width of (k) _c Is K _q Length of (d);

G _T is a threshold function and a is an empirically derived threshold.