CN114842104A

CN114842104A - Capsule endoscope image super-resolution reconstruction method based on multi-scale residual errors

Info

Publication number: CN114842104A
Application number: CN202210539075.1A
Authority: CN
Inventors: 黄胜; 陈贤龙; 廖星; 曹维俊; 牟星宇
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-02

Abstract

The invention provides a super-resolution reconstruction method based on multi-scale residual error for super-resolution reconstruction of capsule endoscope images, which aims at the problem that clear images with structure, detail and other characteristics cannot be obtained due to low camera resolution and inspection environment limitation in Capsule Endoscope (CE) detection, wherein a super-resolution reconstruction network mainly comprises two parts, one part is a characteristic extraction part consisting of a shallow characteristic extraction module, a multi-scale residual error characteristic extraction Module (MRCB) and a hierarchical characteristic processing module (HFCA), used for extracting and utilizing information under different scales from the input low-resolution image (LR), and endowing corresponding weight to the characteristic information while avoiding information loss, for subsequent upsampled reconstruction, and an image reconstruction module consisting of a sub-pixel layer upsampling method and a convolutional layer for generating a final super-resolution image (SR).

Description

Capsule endoscope image super-resolution reconstruction method based on multi-scale residual errors

Technical Field

The invention relates to the field of deep learning and computer vision, in particular to a capsule endoscope image super-resolution reconstruction method based on multi-scale residual errors.

Background

Capsule Endoscopy (CE) is a modality used for non-invasive examination of the digestive tract, and is currently the first choice for diagnosis of digestive tract diseases, particularly small bowel diseases. Capsule endoscopy is a capsule-sized electronic device that is swallowed by a patient, primarily containing a CMOS camera, and typically produces about 6 million images in a single capsule endoscopy, for which 6 million images a physician typically uses a specific gaze pattern to observe pathology-related visual cues to diagnose pathology from CE images. In recent years, the rapid development of deep learning realizes the automatic and objective analysis of endoscope images, including depth estimation, polyp detection and feature description, which reduces the workload of doctors to a certain extent and improves the working efficiency. However, due to poor illumination, limited camera performance, and complex environments in the intestinal tract, images captured by CE tend to be Low Resolution (LR) which poses a limitation of capsule endoscopy. Research shows that low-resolution images have great influence on the diagnosis method, and the enhanced image quality may have better effect on the aspects of lesion detection, region segmentation, pathological analysis and the like. Therefore, there is a need for a method of improving the resolution of a capsule endoscope for subjective and objective analysis.

In order to obtain high-resolution endoscopic images, the most direct method is to improve hardware devices and add optical elements, but the cost required for the method is high, and the method is limited by the space size of the capsule endoscope device, and the method is not reasonable. To address this problem, the computer vision community has been studying a series of algorithms known as super resolution, which use signal processing techniques to reconstruct high resolution images from low resolution images. Using super-resolution techniques to overcome the fundamental resolution limit to achieve better image quality is a very challenging reconstruction process.

Super-resolution reconstruction is a super-resolution reconstruction (SR) technique that enables a low-resolution image captured under the influence of imaging condition limitations, a transmission medium, and the like to be reconstructed into a high-resolution image having richer texture details, greater pixel density, and higher reliability. According to the number of input images, the image super-resolution reconstruction can be divided into single-image super-resolution reconstruction and multi-image super-resolution reconstruction. In practical studies, single image super resolution reconstruction (SISR) is the hot spot of research.

Single image super resolution reconstruction (SISR) has two large categories in common, one based on tradition and the other based on deep learning. At present, with the rapid development of deep learning and the excellent performance of the deep learning in the aspect of image processing, the SR technology has achieved good success in the field of computer vision, and has been widely applied in the fields of medical imaging, security monitoring, satellite remote sensing, video recovery, and the like, and the development speed and the application achievement thereof are far superior to those of the traditional image super-resolution reconstruction method.

Due to the wide variety of medical images and the difficulty of acquiring medical image datasets. At present, super-resolution reconstruction on medical images mainly aims at medical images such as MRI, PET, CT, retina and the like, only a few researchers carry out super-resolution technology research on capsule endoscope images, and the research effect is not ideal.

The invention discloses a single-image super-resolution reconstruction based on deep learning, which has the key problem of how to mine and utilize an input LR image. The resolution of the reconstructed capsule endoscopic image is greater than the input capsule endoscopic image.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a capsule endoscope image super-resolution reconstruction method based on multi-scale residual errors, so that the super-resolution (SR) reconstruction of a capsule endoscope image is realized on the basis, and a CE image with clear characteristics such as structure and texture is generated.

In order to achieve the purpose, the invention adopts the following solution: a super-resolution reconstruction method for a capsule endoscope image comprises the following specific scheme and steps:

1. acquiring a capsule endoscope image of a patient to be diagnosed, and acquiring a pre-trained super-resolution reconstruction network, wherein the super-resolution reconstruction network (MRCN) consists of two parts, one part is a feature extraction part consisting of a shallow feature extraction module, a multi-scale residual feature extraction Module (MRCB) and a hierarchical feature processing module (HFCA), and the other part is an image reconstruction module consisting of a convolution layer formed by a sub-pixel layer up-sampling method and a convolution kernel with the size of 3x 3;

2. inputting a capsule endoscope image to be reconstructed into a shallow feature extraction module to perform shallow feature extraction to obtain a primary feature map;

3. sequentially passing the primary feature map through n MRCB modules to generate and obtain a secondary feature map;

4. extracting the hierarchical feature maps generated in the MRCB modules, inputting the two-level feature maps and the hierarchical feature maps generated in the MRCB modules into an HFCA module, fusing the two-level feature maps and the hierarchical feature maps by using the HFCA module, enhancing important features in the images, and outputting to obtain a three-level feature map;

5. sending the three-level feature map into a convolution layer with 3x3 as a convolution kernel for further feature extraction to generate a four-level feature map;

6. and inputting the four-level characteristic diagram into an image reconstruction module, and carrying out amplification reconstruction to obtain a super-resolution capsule endoscope image, wherein the resolution of the super-resolution capsule endoscope image is greater than that of the capsule endoscope image to be reconstructed.

The shallow feature extraction module in the above step is a convolution layer with a convolution kernel size of 3 × 3.

The MRCB module adopts the residual blocks as basic structural units, each residual block is composed of two convolution layers and a ReLU activating function, and the residual blocks can be represented by the following mathematical model:

B ₁ ＝φ ₂ (σ ₁ (φ ₁ (B)))+B

B ₂ ＝φ ₄ (σ ₂ (φ ₃ (B)))+B

wherein B denotes information of the input residual block, phi ₁ (. and phi) ₂ (. The) shows a convolution operation with 3x3 as the convolution kernel, phi ₃ (. and phi) ₄ (. The) shows the convolution operation with 5x5 as the convolution kernel, σ ₁ (. and σ) ₂ (. The) shows an information activation operation with ReLU as the activation function, B ₁ And B ₂ Shown are the outputs of the residual blocks with 3x3 and 5x5 as convolution kernels, respectively;

the MRCB module MRCB is composed of 8 parallel branches, and can be specifically represented by the following mathematical model:

Y ₁ ＝B ₁₁ (X)

Y ₂ ＝B ₂₁ (X)

Y ₃ ＝F ₃ (concat(B ₁₂ (F ₁ (concat([Y ₁ ,Y ₂ ]))),B ₂₂ (F ₂ (concat([Y ₁ ,Y ₂ ]))),

B ₁₃ (X),B ₂₃ (X),X))

Y ₄ ＝B ₁₄ (Y ₃ )

Y ₅ ＝B ₂₄ (Y ₃ )

Y ₆ ＝F ₆ (concat(B ₁₅ (F ₄ (concat([Y ₄ ,Y ₅ ]))),B ₂₅ (F ₅ (concat([Y ₄ ,Y ₅ ]))),

B ₁₆ (Y ₃ ),B ₂₆ (Y ₃ ),B ₁₇ (X),B ₂₇ (X),Y ₃ ))

Y＝F ₇ (Y ₆ )+X

wherein X denotes information input to the MRCB module, B ₁₁ (·)、B ₁₂ (·)、B ₁₃ (·)、B ₁₄ (·)、B ₁₅ (·)、B ₁₆ (. and B) ₁₇ (. 2) shows a convolution kernel residual block, B, of 3x3 ₂₁ (·)、B ₂₂ (·)、B ₂₃ (·)、B ₂₄ (·)、B ₂₅ (·)、B ₂₆ (. and B) ₂₇ Shown is a residual block convolved with 5x5, concat (a) shows a feature fusion operation, F ₁ (·)、F ₂ (·)、F ₃ (·)、F ₄ (·)、F ₅ (. and F) ₆ (. The) shows a convolution operation with 1x1 as the convolution kernel, F ₇ (. cndot.) represents a convolution operation with 3x3 as the convolution kernel, and Y represents the final output of a single multi-scale feature extraction module.

The HFCA module is provided with a shuffling module and a mean value and standard deviation based enhanced attention module (COAM); cascading the secondary characteristic diagram and the hierarchical characteristic diagrams generated in all MRCB modules, inputting the cascaded characteristic diagrams into a shuffling module, and performing channel compression through convolution of 1x1 to obtain a fusion characteristic diagram; and inputting the fused feature map into a COAM module to obtain the three-level feature map.

The shuffling module groups the channel number of the input feature information according to the required group number to obtain the channel number in each group, the output channel number at the moment can be represented as a result of multiplying the channel number of each group by the group number, then Reshape operation is carried out on the channel number at the moment to convert the channel number into a matrix form, then transposition operation is carried out on the channel number, and after the transposition is finished, the channel number is flattened and grouped and output according to the initial group number.

The COAM module respectively performs mean pooling and standard deviation pooling on input information in a coordinate-like mode according to horizontal and vertical directions, cascades the pooled information, uses shuffling operation to enable information among channels to be fully fused, reduces the dimension of the fused information by using convolution with 1x1 as a convolution kernel, sends the reduced-dimension information to a nonlinear activation layer for activation operation, and then calls a function h-swish to process the activated features to obtain new features; dividing the new features along the horizontal and vertical directions to obtain two independent features, then adopting a convolution layer with 1x1 as a convolution kernel to carry out dimension increasing on the new features, and respectively adopting Sigmoid activation functions to activate the feature information after dimension increasing to obtain weight features in the horizontal and vertical directions; and finally, multiplying the two weighted characteristics by the input characteristic information to obtain the three-level characteristic diagram.

Further, in order to train the network better, while enabling the network to learn richer features, the data set used for the network training proposed by the present invention is composed of two parts, one is a picture of Kvasir that is not secondarily contaminated and destroyed, and the other is a picture of ETIS-lairb poly DB provided in MICCAI2015 tournament regarding Polyp symptoms in endoscopy. Specifically, there were 1000 images of the capsule endoscope covering different symptoms for training of the images, 20 images for verification, 26 images for testing, each image having a resolution between 1080P and 2K. Low resolution pictures throughout the experiment were generated using Bicubic (BI) and small scale (BD) degradation models. Finally, all SR results are evaluated for PSNR and SSIM on the Y channel of the YCbCr space on the output image.

During the training process, using RGB images as input, all training images are randomly rotated 90 °, 180 °, 270 °, and then horizontally flipped for effective data enhancement. Loss function of the network is L ₁ Loss function, initial learning rate of 1x10 ^-4 The optimizer is Adam optimizer (and parameter β is set) ₁ ＝0.9,β ₂ ＝0.999,ε＝10 ^-8 ). Each round of training is iterated 1000 times, 16 low-resolution patches with the size of 48x48 are randomly selected as input in each training batch, and the learning rate is halved every 200 rounds for 1000 rounds of training.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. the multi-scale feature extraction module is adopted in feature extraction of the input image, combines the residual error thought and fully considers the influence of the external features of the network, namely the width of the network and the depth of the network, on the network performance, extracts feature streams of the image under different scales, and excavates and utilizes feature information of the input image;

2. in the process of processing the characteristic information among the layers, the method of sampling the channel shuffling enables the characteristic information among the layers to be fully fused, and the problem of information loss caused by insufficient fusion is effectively solved;

3. after the feature information between layers is subjected to primary processing, an enhanced attention mechanism based on a mean value and a standard deviation is used, the information between channels is captured in a coordinate mode, and corresponding weights are given to important feature information, so that the final super-resolution (SR) image has better texture and other feature details, and the quality of the final reconstructed image is improved.

Drawings

In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:

FIG. 1 is a schematic view of the overall frame structure proposed by the present invention;

FIG. 2 is a schematic structural diagram of a multi-scale feature extraction module proposed by the present invention;

FIG. 3 is a schematic diagram of a hierarchical feature processing module according to the present invention;

FIG. 4 is a diagram of a channel shuffle used in the present invention;

detailed description of the preferred embodiments

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention provides a capsule endoscope image super-resolution reconstruction method based on multi-scale residual errors, which specifically comprises the following steps as shown in figure 1:

step 1, obtaining a capsule endoscope image of a patient to be diagnosed, and obtaining a pre-trained super-resolution reconstruction network, wherein the super-resolution reconstruction network (MRCN) consists of two parts, one part is a feature extraction part consisting of a shallow feature extraction module, a multi-scale residual feature extraction Module (MRCB) and a hierarchical feature processing module (HFCA), and the other part is an image reconstruction module consisting of a convolution layer formed by a sub-pixel layer up-sampling method and a convolution kernel with the size of 3x 3;

step 2, inputting the capsule endoscope image to be reconstructed into a shallow feature extraction module for shallow feature extraction to obtain a primary feature map, wherein the shallow feature extraction module is composed of a convolution layer with a convolution kernel size of 3x 3;

step 3, sequentially passing the primary feature map through n MRCB modules to generate and obtain a secondary feature map;

fig. 2 shows a schematic structural diagram of a single MRCB module, where the MRCB module uses a basic structural unit with a residual block as a structure, each residual block is composed of two convolution layers and an activation function, RU3 and RU5 in the structural diagram refer to residual blocks with convolution kernels of 3x3 and 5x5, respectively, and a specific structure of the residual block can be shown by the following formula:

B ₁ ＝φ ₂ (σ ₁ (φ ₁ (B)))+B

B ₂ ＝φ ₄ (σ ₂ (φ ₃ (B)))+B

wherein B denotes information of the input residual block, phi ₁ (. and phi) ₂ (. The) shows a convolution operation with 3x3 as the convolution kernel, phi ₃ (. and phi) ₄ (. The) shows the convolution operation with 5x5 as the convolution kernel, σ ₁ (. and σ) ₂ (. represents the information activation operation with Relu as the activation function, B ₁ And B ₂ Shown are the outputs of the residual blocks with 3x3 and 5x5 as convolution kernels, respectively.

The operation flow of a single multi-scale feature extraction Module (MRCB) can be decomposed correspondingly, and the flow before the first conv1x1 in FIG. 2 can be represented by the following several formulas

Y ₁ ＝B ₁₁ (X)

Y ₂ ＝B ₂₁ (X)

B ₁₃ (X),B ₂₃ (X),X))

Wherein X represents information input to the MRCB module, B ₁₁ (·)、B ₁₂ (. and B) ₁₃ (. denotes) the operation of RU3 in FIG. 2, B ₂₁ (·)、B ₂₂ (. and B) ₂₃ (. denotes the operation of RU5 in FIG. 2, F ₁ (·)、F ₂ (. and F) ₃ (. The) shows a convolution operation with 1x1 as the convolution kernel, Y ₁ And Y ₂ Shown are the outputs of the first RU3 and first RU5 blocks, Y, respectively, at the top left of FIG. 2 ₃ Shown is the information output after the first conv1x1 in fig. 2.

The remaining portion of fig. 2 can be represented by 4 equations as follows:

Y ₄ ＝B ₁₄ (Y ₃ )

Y ₅ ＝B ₂₄ (Y ₃ )

B ₁₆ (Y ₃ ),B ₂₆ (Y ₃ ),B ₁₇ (X),B ₂₇ (X),Y ₃ ))

Y＝F ₇ (Y ₆ )+X

wherein X represents information input to the MRCB module, B ₁₄ (·)、B ₁₅ (·)、B ₁₆ (. and B) ₁₇ (. to) also shows RU3 operation in FIG. 2, B ₂₄ (·)、B ₂₅ (·)、B ₂₆ (. and B) ₂₇ (. expression) is RU5 operation in FIG. 2, F ₄ (·)、F ₅ (. and F) ₆ (. The) shows a convolution operation with 1x1 as the convolution kernel, F ₇ (. The) shows a convolution operation with 3x3 as the convolution kernel, Y ₄ And Y ₅ Shown are the outputs, Y, of the first RU3 and first RU5 blocks, respectively, of the middle portion of FIG. 2 ₆ Shown is the output of the second conv1x1 block in fig. 2. Y represents the final output of a single multi-MRCB module.

Step 4, extracting the hierarchical feature maps generated in the MRCB modules, inputting the secondary feature maps and the hierarchical feature maps generated in the MRCB modules into a hierarchical feature processing module (HFCA), as shown in fig. 3, fusing the secondary feature maps and the hierarchical feature maps by using the HFCA module, enhancing important features in the image, and outputting to obtain a tertiary feature map;

The shuffle module groups the number of channels of the input feature information according to the number of groups required to obtain the number of channels in each group, and the number of output channels at this time may be represented as a result of multiplying the number of channels of each group by the number of groups, then performs Reshape operation on the number of channels at this time to convert the number of channels into a matrix form, and then performs transpose operation on the number of channels, and after transpose is completed, flattens the number of channels and performs packet output according to the number of groups at first, and the structure of the shuffle module is shown in fig. 4.

Step 5, sending the three-level feature graph into a convolution layer with 3x3 as a convolution kernel for further feature extraction to generate a four-level feature graph;

and 6, inputting the four-level characteristic diagram into an image reconstruction module, and carrying out amplification reconstruction to obtain a super-resolution capsule endoscope image, wherein the resolution of the super-resolution capsule endoscope image is greater than that of the capsule endoscope image to be reconstructed.

Further, in order to train the network better, so that the network can learn richer features, the data set used in the network training proposed by the present invention is composed of two parts, one is a picture of Kvasir that is not secondarily contaminated and destroyed, and the other is a picture of ETIS-lairb poly DB provided in MICCAI2015 tournament regarding Polyp symptoms in endoscopy. Specifically, there were 1000 images of the capsule endoscope covering different symptoms for training of the images, 20 images for verification, 26 images for testing, each image having a resolution between 1080P and 2K. Low resolution pictures throughout the experiment were generated using Bicubic (BI) and small scale (BD) degradation models. Finally, the peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) evaluations were performed on all SR results on the Y channel of the YCbCr space on the output image.

The whole parameters in the network are set as follows, the convolution kernel of the basic block in the network is formed by combining two sizes, namely 3x3 and 5x5, the number of channels is 64, and the step length is 1. In the convolution layer for dimension reduction, the convolution kernels are all 1 × 1. The number of multi-scale residual error feature extraction modules in the network is 8, the number of groups subjected to the shuffling operation before entering the attention mechanism in the hierarchical feature fusion layer is 8, and the number of groups subjected to the shuffling operation in the attention mechanism is 2.

During the training process, using RGB images as input, all training images are randomly rotated 90 °, 180 °, 270 °, and then horizontally flipped for effective data enhancement. Loss function of the network is L ₁ Loss function, super-resolution optimization having multiple loss functions, e.g. L ₂ 、L ₁ Perceptual and hostile losses. In the present invention, the document follows EDSR in the field of natural images and uses L ₁ The loss function is optimized. Given N LR-HR pairs as the training set, 1000 image pairs are used hereinAs training set, L ₁ The loss function can be represented by the following functional form:

where θ represents the set of parameters and i represents the ith pair of training image blocks. The initial learning rate in training is 1x10 ^-4 The optimizer is Adam optimizer (and parameter β is set) ₁ ＝0.9,β ₂ ＝0.999,ε＝10 ^-8 ). And (2) carrying out 1000 times of iteration in each training round, randomly selecting 16 low-resolution patches with the size of 48x48 as input in each training batch, reducing the learning rate by half in each 200 training rounds for 1000 training rounds, and keeping the model with the highest PSNR in a verification set in the training process as a final training result.

Claims

1. The method for reconstructing the super-resolution image of the capsule endoscope based on the multi-scale residual error is characterized by being capable of correspondingly processing an image acquired by the capsule endoscope to obtain the super-resolution image of the capsule endoscope with clearer characteristics such as structure, texture and the like, and specifically comprises the following steps:

step 1, obtaining a capsule endoscope image of a patient to be diagnosed, and obtaining a pre-trained super-resolution reconstruction network, wherein the super-resolution reconstruction network consists of two parts, one part is a feature extraction part consisting of a shallow feature extraction module, a multi-scale residual feature extraction module and a hierarchical feature processing module, and the other part is an image reconstruction module consisting of a convolution layer formed by a sub-pixel layer up-sampling method and a convolution kernel with the size of 3x 3;

step 2, inputting the capsule endoscope image to be reconstructed into a shallow feature extraction module for shallow feature extraction to obtain a primary feature map, wherein the shallow feature extraction module is a convolution layer with a convolution kernel size of 3x 3;

step 3, sequentially passing the primary feature map through n multi-scale residual error feature extraction modules to generate and obtain a secondary feature map;

step 4, extracting the hierarchical feature map generated in each multi-scale residual error feature extraction module, inputting the secondary feature map and the hierarchical feature map generated in each multi-scale residual error feature extraction module into a hierarchical feature processing module, fusing the secondary feature map and the hierarchical feature map by using the hierarchical feature processing module, enhancing important features in the image, and outputting to obtain a tertiary feature map;

step 5, sending the three-level feature graph into a convolution layer formed by taking 3x3 as a convolution kernel, and performing further feature extraction to generate a four-level feature graph;

2. The super-resolution capsule endoscopic image reconstruction method according to claim 1, wherein: the multi-scale residual error feature extraction module adopts a residual error block as a basic structural unit, each residual error block consists of two convolution layers and a ReLU activation function, and the residual error block can be represented by the following mathematical model:

B ₁ ＝φ ₂ (σ ₁ (φ ₁ (B)))+B

B ₂ ＝φ ₄ (σ ₂ (φ ₃ (B)))+B

the multi-scale residual error feature extraction module consists of 8 parallel branches, and can be specifically represented by the following mathematical model:

Y ₁ ＝B ₁₁ (X)

Y ₂ ＝B ₂₁ (X)

B ₁₃ (X),B ₂₃ (X),X))

Y ₄ ＝B ₁₄ (Y ₃ )

Y ₅ ＝B ₂₄ (Y ₃ )

B ₁₆ (Y ₃ ),B ₂₆ (Y ₃ ),B ₁₇ (X),B ₂₇ (X),Y ₃ ))

Y＝F ₇ (Y ₆ )+X

wherein X represents the information input to the multi-scale residual feature extraction module, B ₁₁ (·)、B ₁₂ (·)、B ₁₃ (·)、B ₁₄ (·)、B ₁₅ (·)、B ₁₆ (. and B) ₁₇ (. 2) shows a convolution kernel residual block, B, of 3x3 ₂₁ (·)、B ₂₂ (·)、B ₂₃ (·)、B ₂₄ (·)、B ₂₅ (·)、B ₂₆ (. and B) ₂₇ Shown is a residual block convolved with 5x5, concat (a) shows a feature fusion operation, F ₁ (·)、F ₂ (·)、F ₃ (·)、F ₄ (·)、F ₅ (. and F) ₆ (. The) shows a convolution operation with 1x1 as the convolution kernel, F ₇ Shown is the convolution operation with 3x3 as the convolution kernel, and Y is the final output of a single multi-scale feature extraction module.

3. The super-resolution capsule endoscopic image reconstruction method according to claim 1, wherein: the hierarchical feature processing module is provided with a shuffling module and an enhanced attention module based on a mean value and a standard deviation; the second-level feature map and the hierarchical feature maps generated in the multi-scale residual error feature extraction modules are cascaded and then input into a shuffling module, and then channel compression is carried out through convolution of 1x1 to obtain a fusion feature map; and inputting the fused feature map into an enhanced attention module based on the mean value and the standard deviation to obtain the three-level feature map.

4. The super-resolution capsule endoscopic image reconstruction method according to claim 4, wherein: the enhanced attention module based on the mean value and the standard deviation respectively performs mean value pooling and standard deviation pooling on input information in a coordinate-like mode according to the horizontal direction and the vertical direction, after pooling is performed, information is cascaded and then is subjected to shuffling operation, information among channels is fully fused, dimension reduction is performed on the fused information by convolution with 1x1 as a convolution kernel, the dimension reduced information is sent to a nonlinear activation layer for activation operation, and then a function h-swish is called to process the activated features to obtain new features; dividing the new features along the horizontal and vertical directions to obtain two independent features, then adopting a convolution layer with 1x1 as a convolution kernel to carry out dimension increasing on the new features, and respectively adopting Sigmoid activation functions to activate the feature information after dimension increasing to obtain weight features in the horizontal and vertical directions; and finally, multiplying the two weighted characteristics by the input characteristic information to obtain the three-level characteristic diagram.