CN117542049A

CN117542049A - Image recognition method and system based on deep learning

Info

Publication number: CN117542049A
Application number: CN202410026958.1A
Authority: CN
Inventors: 梁爽; 闫瑞平; 张居奎; 张雨婷; 王悦宏; 李金春子; 张豪; 金明兰; 王博; 田佳豪; 唐雨微
Original assignee: Jilin Jianzhu University
Current assignee: Jilin Jianzhu University
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-02-09
Anticipated expiration: 2044-01-09
Also published as: CN117542049B

Abstract

The invention provides an image recognition method and system based on deep learning, and belongs to the technical field of image recognition. Firstly, acquiring a scanning electron microscope image of an environmental sample; preprocessing the scanning electron microscope image to obtain an input image; and inputting the input image into a target recognition network for feature extraction to obtain a recognition result, and carrying out statistical analysis on the target object to be detected according to the recognition result to obtain a statistical result. According to the invention, through preprocessing and feature extraction of a scanning electron microscope image, different types of targets in the image are detected and positioned by utilizing a target identification network, and statistical analysis is carried out on the targets to obtain information such as the quantity, the area and the load capacity of the targets. The method can effectively identify and analyze biological and non-biological targets in the environmental sample, and provides basis for environmental monitoring and evaluation.

Description

Image recognition method and system based on deep learning

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to an image recognition method and system based on deep learning.

Background

An electron scanning microscope (SEM) is an instrument that scans the surface of a sample with an electron beam to obtain morphology, composition and structural information of the sample by detecting secondary electrons reflected or emitted by the sample. The SEM image has the characteristics of high resolution, high depth and high contrast, can display the microstructure and characteristics of a sample, and has wide application in the fields of material science, biology, medicine, environmental science and the like.

However, analysis and processing of SEM images is a complex and time-consuming task, requiring steps of preprocessing, feature extraction, target detection and classification of images, which not only increases the consumption of manpower and material resources, but also is susceptible to human factors, resulting in inaccuracy and instability of the results. Therefore, how to automatically analyze and process SEM images by using a computer is a problem to be solved.

In recent years, as a powerful machine learning method, remarkable progress and breakthrough are made in the field of image recognition, and the deep learning can automatically learn features and rules from a large amount of data, so that a complex task is realized. However, the image recognition method based on deep learning still has some problems and challenges when applied to SEM images. Firstly, the quality of an SEM image is influenced by factors such as parameters of a scanning electron beam, preparation of a sample, interference of environment and the like, so that the image has problems such as noise, blurring, low contrast and the like, and the characteristic extraction and recognition effects of the image are influenced; secondly, SEM images have various kinds of objects and various forms, which make object detection and classification difficult. Therefore, in view of SEM image characteristics and problems, it is necessary to design a deep learning image recognition method suitable for SEM images.

Disclosure of Invention

Based on the technical problems, the image recognition method and the system based on deep learning are provided, through preprocessing and feature extraction of a scanning electron microscope image, different types of targets in the image are detected and positioned by utilizing a target recognition network, and statistical analysis is performed on the targets to obtain information such as the quantity, the area and the load capacity of the targets, so that an efficient and accurate automatic solution is provided for microstructure analysis of an environment sample.

The invention provides an image recognition method based on deep learning, which comprises the following steps:

step S1: acquiring a scanning electron microscope image of an environmental sample;

step S2: preprocessing the scanning electron microscope image to obtain an input image;

step S3: inputting the input image into a target recognition network for feature extraction to obtain a recognition result, wherein the method specifically comprises the following steps of:

the target recognition network comprises a multi-scale characteristic network and a detection head network;

the multi-scale feature network includes a first standard convolutional layer, a first activation function layer, a first maximum pooling layer, a second standard convolutional layer, a second activation function layer, a second maximum pooling layer, a third standard convolutional layer, a third activation function layer, a third maximum pooling layer, a fourth standard convolutional layer, a fourth activation function layer, a fourth maximum pooling layer, a fifth maximum pooling layer, a first attention module, a fifth standard convolutional layer, a fifth activation function layer, a second attention module, a sixth standard convolutional layer, a sixth activation function layer, a first tensor splice layer, a third attention module, a seventh standard convolutional layer, a seventh activation function layer, a second tensor splice layer, a fourth attention module, an eighth standard convolutional layer, an eighth activation function layer, a third tensor splice layer, a fifth attention module, a first upsampling layer, a second upsampling layer, and a third upsampling layer;

The detection head network comprises a first standard convolution active layer, a ninth standard convolution layer, a first type prediction remodelling layer, a first boundary regression remodelling layer, a second standard convolution active layer, a tenth standard convolution layer, a second type prediction remodelling layer, a second boundary regression remodelling layer, a third standard convolution active layer, an eleventh standard convolution layer, a third type prediction remodelling layer, a third boundary regression remodelling layer, a fourth standard convolution active layer, a twelfth standard convolution layer, a fourth type prediction remodelling layer, a fourth boundary regression remodelling layer, a fifth standard convolution active layer, a thirteenth standard convolution layer, a fifth type prediction remodelling layer, a fifth boundary regression remodelling layer, a fourth tensor splicing layer and a fifth tensor splicing layer;

step S4: and carrying out statistical analysis on the object to be detected according to the identification result to obtain a statistical result.

Optionally, the preprocessing the scanning electron microscope image to obtain an input image specifically includes:

denoising the scanning electron microscope image to obtain a denoised image, wherein the pixel denoising formula is as follows:

in the method, in the process of the invention,for denoised image at pixel position +.>Pixel values of (2); />For the original image at pixel position +. >Pixel value of>Is at->All pixel positions within the neighborhood of (a); />Is a normalization factor; />For position->And->Similarity weight between; />Is->A surrounding neighborhood;

performing self-adaptive histogram equalization on the denoising image to obtain an equalized image, wherein a pixel equalization formula is as follows:

in the method, in the process of the invention,for the equalized histogram, the pixel value +.>Is a cumulative distribution function value of (a); />For pixel values in the histogram of the denoised image +.>Probability density function values of (2); />For->An applied restriction function;pixel values of the image after equalization processing; />Is the maximum range of pixel values;

and carrying out edge detection on the equalized image to obtain an edge detection image, wherein the formula is as follows:

，/>

in the method, in the process of the invention,and->Respectively is image +.>Gradients in the horizontal and vertical directions;

performing characteristic focusing transformation on the edge detection image to obtain an input image, wherein a pixel characteristic focusing transformation formula is as follows:

in the method, in the process of the invention,for edge detection image in position +.>Pixel values of (2); />For position->Is the distance of the pixel to the nearest edge; />Parameters for controlling the intensity of the characteristic focus; />Position +.>Is a pixel value of (a).

Optionally, the multi-scale feature network specifically includes:

Inputting an input image into the first standard convolution layer to carry out convolution operation to obtain a feature map P1; inputting the feature map P1 into the first activation function layer for activation operation to obtain a feature map P2; inputting the feature map P2 into the first maximum pooling layer to perform maximum pooling operation to obtain a feature map P3;

inputting the characteristic map P3 into the second standard convolution layer to carry out convolution operation to obtain a characteristic map P4; inputting the feature map P4 into the second activation function layer for activation operation to obtain a feature map P5; inputting the feature map P5 into the second maximum pooling layer to perform maximum pooling operation to obtain a feature map P6;

inputting the characteristic map P6 into the third standard convolution layer to carry out convolution operation to obtain a characteristic map P7; inputting the feature map P7 into the third activation function layer for activation operation to obtain a feature map P8; inputting the feature map P8 into the third maximum pooling layer to perform maximum pooling operation to obtain a feature map P9;

inputting the characteristic map P9 into the fourth standard convolution layer to carry out convolution operation to obtain a characteristic map P10; inputting the feature map P10 to the fourth activation function layer for activation operation, so as to obtain a feature map P11; inputting the feature map P11 to the fourth maximum pooling layer for maximum pooling operation to obtain a feature map P12;

Inputting the feature map P12 to the fifth maximum pooling layer for maximum pooling operation to obtain a feature map P13; inputting the feature map P13 to the first attention module for performing attention operation to obtain a feature map P28;

inputting the characteristic map P12 into the fifth standard convolution layer to carry out convolution operation to obtain a characteristic map P14; inputting the feature map P14 to the fifth activation function layer for activation operation to obtain a feature map P15; inputting the feature map P15 to the second attention module for performing attention operation, so as to obtain a feature map P29;

inputting the characteristic map P12 into the sixth standard convolution layer to carry out convolution operation to obtain a characteristic map P16; inputting the feature map P16 to the sixth activation function layer for activation operation to obtain a feature map P17; performing up-sampling operation on the feature map P15 to obtain a feature map P22; inputting the feature map P17 and the feature map P22 into the first tensor splicing layer for splicing operation to obtain a feature map P23; inputting the feature map P23 to the third attention module for performing attention operation to obtain a feature map P30;

inputting the characteristic map P12 into the seventh standard convolution layer for convolution operation to obtain a characteristic map P18; inputting the feature map P18 to the seventh activation function layer for activation operation to obtain a feature map P19; performing up-sampling operation on the feature map P23 to obtain a feature map P24; inputting the feature map P19 and the feature map P24 to the second tensor stitching layer for stitching operation, so as to obtain a feature map P25; inputting the feature map P25 to the fourth attention module for performing attention operation to obtain a feature map P31;

Inputting the feature map P12 to the eighth standard convolution layer to perform convolution operation, so as to obtain a feature map P20; inputting the feature map P20 to the eighth activation function layer for activation operation, so as to obtain a feature map P21; performing up-sampling operation on the feature map P25 to obtain a feature map P26; inputting the feature map P21 and the feature map P26 into the third tensor splicing layer for splicing operation to obtain a feature map P27; and inputting the characteristic map P27 into the fifth attention module for attention operation to obtain a characteristic map P32.

Optionally, the first attention module specifically includes:

acquiring an input feature map; the input feature map comprises a batch, a width, a height and a channel number;

generating a coordinate attention graph of all channels according to the input feature graph, wherein the coordinate attention graph specifically comprises:

traversing each channel index, and extracting a feature map of each channel as a single feature map;

respectively calculating the minimum value and the maximum value of the single feature map in each channel to obtain two scalar quantities; the scalar represents a range of a single feature map;

generating two arithmetic series according to the range, the width and the height of the single characteristic diagram; the arithmetic series respectively represents a horizontal coordinate value and a vertical coordinate value;

Splicing the two arithmetic difference arrays with the single feature map to obtain two coordinate vectors; the coordinate vectors respectively represent coordinate vectors in the horizontal direction and the vertical direction;

copying the two coordinate vectors along the batch dimension and the channel dimension to obtain two copied coordinate vectors; the duplicate coordinate vector is the same as the batch and channel number of the single feature map;

performing channel dimension expansion on the two copied coordinate vectors, and returning two final coordinate vectors; the final coordinate vector represents position information in the horizontal direction and the vertical direction respectively;

splicing the single feature map and the two final coordinate vectors in a channel dimension to obtain an attention map, and adding the attention map to a list;

repeating the operation until the coordinate attention diagrams of all channels are obtained;

performing convolution operation on the coordinate attention graph twice to obtain an integrated attention graph with the same channel number as the input feature graph;

and performing element-by-element multiplication operation on the input feature map and the integrated attention map to obtain a weighted feature map.

Optionally, according to the identification result, performing statistical analysis on the target object to be tested to obtain a statistical result, which specifically includes:

Counting different classes of objects to be detected, and counting the number of each class;

calculating the areas of different classes of objects to be measured, and counting the area of each class;

and calculating the load capacity of different types of objects to be measured, and counting the quality of each type.

The invention also provides an image recognition system based on deep learning, which comprises:

the image acquisition module is used for acquiring a scanning electron microscope image of the environmental sample;

the image processing module is used for preprocessing the scanning electron microscope image to obtain an input image;

the target recognition module is used for inputting the input image into a target recognition network to perform feature extraction so as to obtain a recognition result; the target recognition network comprises a multi-scale characteristic network and a detection head network;

and the statistical analysis module is used for carrying out statistical analysis on the object to be detected according to the identification result to obtain a statistical result.

Optionally, the image processing module specifically includes:

in the method, in the process of the invention,for denoised image at pixel position +.>Pixel values of (2); / >For the original image at pixel position +.>Pixel value of>Is at->All pixel positions within the neighborhood of (a); />Is a normalization factor; />For position->And->Similarity weight between; />Is->A surrounding neighborhood;

，/>

in the method, in the process of the invention,for edge detection image in position +.>Pixel values of (2); />For position->Is the distance of the pixel to the nearest edge; />Parameters for controlling the intensity of the characteristic focus; />Position +. >Is a pixel value of (a).

Optionally, the multi-scale feature network specifically includes:

the first standard convolution sub-module is used for inputting an input image into the first standard convolution layer to carry out convolution operation to obtain a feature map P1; inputting the feature map P1 into a first activation function layer for activation operation to obtain a feature map P2; inputting the feature map P2 into a first maximum pooling layer for maximum pooling operation to obtain a feature map P3;

the second standard convolution sub-module is used for inputting the characteristic map P3 into a second standard convolution layer to carry out convolution operation to obtain a characteristic map P4; inputting the feature map P4 into a second activation function layer for activation operation to obtain a feature map P5; inputting the feature map P5 into a second maximum pooling layer to perform maximum pooling operation to obtain a feature map P6;

the third standard convolution sub-module is used for inputting the characteristic map P6 into a third standard convolution layer to carry out convolution operation to obtain a characteristic map P7; inputting the feature map P7 into a third activation function layer for activation operation to obtain a feature map P8; inputting the feature map P8 into a third maximum pooling layer for maximum pooling operation to obtain a feature map P9;

the fourth standard convolution sub-module is used for inputting the characteristic map P9 into a fourth standard convolution layer to carry out convolution operation to obtain a characteristic map P10; inputting the feature map P10 into a fourth activation function layer for activation operation to obtain a feature map P11; inputting the feature map P11 to a fourth maximum pooling layer for maximum pooling operation to obtain a feature map P12;

The first attention sub-module is used for inputting the feature map P12 to a fifth maximum pooling layer for maximum pooling operation to obtain a feature map P13; inputting the feature map P13 to a first attention module for performing attention operation to obtain a feature map P28;

the second attention submodule is used for inputting the feature map P12 into a fifth standard convolution layer to carry out convolution operation to obtain a feature map P14; inputting the feature map P14 into a fifth activation function layer for activation operation to obtain a feature map P15; inputting the feature map P15 to a second attention module for attention operation to obtain a feature map P29;

the third attention submodule is used for inputting the feature map P12 into a sixth standard convolution layer to carry out convolution operation to obtain a feature map P16; inputting the feature map P16 to a sixth activation function layer for activation operation to obtain a feature map P17; performing up-sampling operation on the feature map P15 to obtain a feature map P22; inputting the feature map P17 and the feature map P22 into a first tensor splicing layer for splicing operation to obtain a feature map P23; inputting the feature map P23 to a third attention module for performing attention operation to obtain a feature map P30;

The fourth attention sub-module is used for inputting the feature map P12 into a seventh standard convolution layer to carry out convolution operation to obtain a feature map P18; inputting the feature map P18 to a seventh activation function layer for activation operation to obtain a feature map P19; performing up-sampling operation on the feature map P23 to obtain a feature map P24; inputting the feature map P19 and the feature map P24 into a second tensor splicing layer for splicing operation to obtain a feature map P25; inputting the feature map P25 to a fourth attention module for performing attention operation to obtain a feature map P31;

a fifth attention sub-module, configured to input the feature map P12 to an eighth standard convolution layer for convolution operation, to obtain a feature map P20; inputting the feature map P20 to an eighth activation function layer for activation operation to obtain a feature map P21; performing up-sampling operation on the feature map P25 to obtain a feature map P26; inputting the feature map P21 and the feature map P26 into a third tensor splicing layer for splicing operation to obtain a feature map P27; and inputting the characteristic map P27 into a fifth attention module for attention operation to obtain a characteristic map P32.

Optionally, the first attention sub-module specifically includes:

The feature map acquisition unit is used for acquiring an input feature map; the input feature map comprises a batch, a width, a height and a channel number;

the coordinate attention graph generating unit is configured to generate coordinate attention graphs of all channels according to the input feature graph, and specifically includes:

the attention map integrating unit is used for carrying out convolution operation on the coordinate attention map twice to obtain an integrated attention map with the same channel number as the input characteristic map;

and the weighted feature map output unit is used for carrying out element-by-element multiplication operation on the input feature map and the integrated attention map to obtain a weighted feature map.

Optionally, the statistical analysis module specifically includes:

the counting sub-module is used for counting different types of objects to be detected and counting the number of each type;

the area calculation sub-module is used for calculating the areas of different types of objects to be measured and counting the area of each type;

and the load calculation operator module is used for calculating the load of different categories of objects to be measured and counting the quality of each category.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, through automatically processing the SEM image, the efficiency and accuracy of sample analysis are greatly improved; can process various types of targets, including biological and non-biological categories, and has a wide range of applications; not only can the target objects be identified and classified, but also the quantity, coverage area, load capacity and the like can be counted and analyzed; the method is applicable to SEM images with different types and different resolutions, and has good universality and adaptability.

Drawings

FIG. 1 is a flow chart of an image recognition method based on deep learning according to the present invention;

FIG. 2 is a diagram of a target recognition network structure of the image recognition method based on deep learning of the present invention;

fig. 3 is a block diagram of an image recognition system based on deep learning according to the present invention.

Detailed Description

The invention is further described below in connection with specific embodiments and the accompanying drawings, but the invention is not limited to these embodiments.

Example 1

As shown in fig. 1, the invention discloses an image recognition method based on deep learning, which comprises the following steps:

step S1: scanning electron microscope images of the environmental samples are acquired.

Step S2: and preprocessing the scanning electron microscope image to obtain an input image.

Step S3: and inputting the input image into a target recognition network for feature extraction to obtain a recognition result.

The steps are discussed in detail below:

The step S1 specifically comprises the following steps:

the sample is pre-treated, placed on a sample stage for observation or different sample holders are selected according to the size, shape and number of the samples, such as standard multi-sample SEM holders, electron probe/multi-probe stages, etc., the sample is fixed on the holder, and then the holder is mounted on the sample stage. Sending the sample table into the sample chamber, closing a sample chamber door, starting a vacuum system of the sample chamber, and reducing the air pressure in the sample chamber to a required level; according to different characteristics of the sample, different air pressures and gas types, such as water vapor, nitrogen, oxygen and the like are selected. Starting an electron gun, and adjusting an accelerating voltage, an emitting current and a focusing current to generate a stable electron beam; different acceleration voltages are selected according to different characteristics of the sample, and are generally between 1kV and 30 kV. The appropriate detector such as ESEM SED (GSED), DBS-GAD, ESEM-GAD, etc. is selected, and the inclination angle and working distance of the sample stage are adjusted according to the type and position of the detector, so as to obtain the best signal receiving effect. The sample is positioned and aligned using a navigation camera or scanning electron microscope image, the region of interest is selected, and the focusing, biasing and correction of the electron beam is adjusted to obtain a clear image. Different scan modes and parameters, such as resolution, magnification, scan speed, line pair number, etc., are used to meet different imaging requirements. And the required scanning electron microscope image is stored or output, so that the subsequent analysis of the image is convenient.

The step S2 specifically comprises the following steps:

in the method, in the process of the invention,for denoised image at pixel position +.>Pixel values of (2); />For the original image at pixel position +.>Pixel value of>Is at->All pixel positions within the neighborhood of (a); />For pixel position +.>Is ensured by the calculation of +.>Ownership of location +.>The sum of (2) is 1, and the pixel value after denoising does not exceed the normal range;for position->And->Similarity weight between them, reflecting +.>And->Similarity between; if->And->Very similar, then->Larger; if it is not similar->Smaller, the similarity is typically calculated based on the similarity of pixel values, using a gaussian function to measure the difference between two pixel values; />Is->A surrounding neighborhood.

Initializing a denoised image to create a new imageIs +_added to the original image>And the size is the same and is used for storing the denoising result.

Traversing each pixel, for each pixel location in the imageThe following steps are performed:

determination ofNeighborhood of->This neighborhood may be a window of fixed size, e.g. 5 x 5 or 7 x 7 pixels; for- >Every pixel position in +.>Calculate->The method comprises the steps of carrying out a first treatment on the surface of the Calculating normalization factor->Ensure that the sum of weights is 1; calculating +.>Denoising pixel value at +.>。

Obtaining a denoising image, traversing the denoising image, and obtaining each pixelThe values of (2) are updated to the denoised values, and finally the complete denoised image is obtained>。

In the present embodiment of the present invention,and->For the pixel position of the image, which is actually referred to as two-dimensional coordinates, can be used +.>Andthe representation respectively represents the position of one pixel point in the image and other pixel points in the neighborhood of the pixel point.

in the method, in the process of the invention,for the equalized histogram, the pixel value +.>Cumulative distribution function value of->Representing a particular pixel value in the image, typically ranging from 0 to 255 (for an 8-bit gray scale image); />For the histogram of the input image (the image after the denoising step), the pixel value +.>Probability density function value, < >>Also the pixel values in the image, and +.>Similarly, again ranging from 0 to 255; />For the original probability density function->An applied limiting (clipping) function, the purpose of which is to prevent some regions in the histogram from being over-enhanced; / >Pixel values of the image after equalization processing; />The 8-bit image is typically 255, which is the maximum possible value of the pixel values; the round function is used to round the result of the calculation to the nearest integer, since the pixel value must be an integer.

First, based on the denoised imageCalculating a restricted cumulative distribution function +.>The method comprises the steps of carrying out a first treatment on the surface of the Will beApplied to +.>To obtain an equalized image +.>The method comprises the steps of carrying out a first treatment on the surface of the Will->The value of each pixel of (a) as +.>Then multiplying the result by the maximum possible value of the pixel value (typically 255), thereby obtaining a new pixel value; this process will result in an enhanced local contrast of the original image, especially in dark areas of the image.

Edge detection is carried out on the equalized image to obtain an edge detection image, and the formula is as follows:

，/>

in the method, in the process of the invention,and->Respectively is image +.>Gradients in horizontal and vertical directions by applying a Sobel convolution kernel to the image +.>To calculate, the convolution kernel can highlight intensity variations in the horizontal and vertical directions in the image.

in the method, in the process of the invention, For edge detection image in position +.>Pixel values of (2); />For position->The distance of the pixel to the nearest edge reflects whether the pixel is close toEdges or features of the image; />Is a parameter controlling the focusing intensity of the feature, greater +.>The values will make the areas far from the edge darker, while the areas near the edge remain bright; />Position +.>Is a pixel value of (a).

The purpose of the feature focus transformation is to highlight key features in the image, particularly those areas near the edges; this transformation emphasizes structures and features in the image by reducing the pixel brightness away from the edge (i.e., the region where the feature is not apparent) while preserving or enhancing the pixels near the edge; plays an important role for the subsequent object recognition task because it reduces the impact of background noise while enhancing the visibility of important features.

In fig. 2, conv2D represents a standard convolution layer, with convolution kernel sizes of 3×3 and 1×1; strides represents the step size, and takes the value of 1; activating function layer (action #)))，/>The values are ReLU and Sigmoid; maxpooling2D represents the maximum pooling layer; reshape stands for tensor remodelling layer; concat (+) >，/>) Representation->，/>Performing tensor stitching; p->Representing the feature maps obtained in the object recognition network, respectively>The value range is [1,54 ]]，/>Is an integer.

As shown in fig. 2, step S3 specifically includes:

inputting an input image (256,256,1) into a first standard convolution layer for convolution operation to obtain a feature map P1, wherein the number of convolution kernels of the first standard convolution layer is 32, the size of the convolution kernels is 3 multiplied by 3, and the step length is 1; the feature map P1 is 256×256 of 32 channels; inputting the feature map P1 into a first activation function layer for activation operation to obtain a feature map P2; the feature map P2 is 256×256 of 32 channels; inputting the feature map P2 into a first maximum pooling layer for maximum pooling operation to obtain a feature map P3; the first maximum pooling layer pooling window size is 2×2; the profile P3 is 128×128 of 32 channels.

Inputting the feature map P3 into a second standard convolution layer to carry out convolution operation to obtain a feature map P4, wherein the number of convolution kernels of the second standard convolution layer is 64, the size of the convolution kernels is 3 multiplied by 3, and the step length is 1; the feature map P4 is 128×128 of 64 channels; inputting the feature map P4 into a second activation function layer for activation operation to obtain a feature map P5; the feature map P5 is 128×128 of 64 channels; inputting the feature map P5 into a second maximum pooling layer to perform maximum pooling operation to obtain a feature map P6; the size of the pooling window of the second maximum pooling layer is 2 multiplied by 2; the feature map P6 is 64×64 of 64 channels.

Inputting the feature map P6 into a third standard convolution layer for convolution operation to obtain a feature map P7, wherein the number of convolution kernels of the third standard convolution layer is 128, the size of the convolution kernels is 3 multiplied by 3, and the step length is 1; the profile P7 is 64×64 for 128 channels; inputting the feature map P7 into a third activation function layer for activation operation to obtain a feature map P8; feature map P8 is 64×64 for 128 channels; inputting the feature map P8 into a third maximum pooling layer for maximum pooling operation to obtain a feature map P9; the size of the pooling window of the third maximum pooling layer is 2 multiplied by 2; the profile P9 is 32×32 for 128 channels.

Inputting the feature map P9 into a fourth standard convolution layer for convolution operation to obtain a feature map P10, wherein the number of convolution kernels of the fourth standard convolution layer is 256, the size of the convolution kernels is 3 multiplied by 3, and the step length is 1; the feature map P10 is 32×32 of 256 channels; inputting the feature map P10 into a fourth activation function layer for activation operation to obtain a feature map P11; the feature map P11 is 32×32 of 256 channels; inputting the feature map P11 into a fourth maximum pooling layer for maximum pooling operation to obtain a feature map P12; the fourth maximum pooling layer pooling window size is 2×2; the feature map P12 is 16×16 of 256 channels.

Inputting the feature map P12 into a fifth maximum pooling layer for maximum pooling operation to obtain a feature map P13; the fifth maximum pooling layer pooling window size is 2×2; the feature map P13 is 8×8 for 256 channels; the feature map P13 is input to a first attention module to perform attention operation, so as to obtain a feature map P28, wherein the feature map P28 is 8×8 of 256 channels, and the first attention module specifically comprises:

Acquiring batch size, height, width and channel number of an input feature map P13; generating a coordinate attention graph of all channels according to the input feature graph, wherein the method specifically comprises the following steps:

traversing each channel index by using one cycle, and extracting a feature map of each channel as a single feature map; calculating the minimum value and the maximum value of a single feature map in each channel respectively by using a specified dimension minimum value function (tf. Reduce_min) and a specified dimension minimum value function (tf reduce_max) to obtain two scalar quantities, wherein the two scalar quantities represent the range of the feature map; generating two arithmetic progression according to the range, width and height of a single feature map by using a value function (tf. Linspace) which generates uniform intervals on a designated axis, and respectively representing coordinate values in the horizontal direction and the vertical direction; two coordinate arrays are adjusted to a proper shape using a change tensor shape function (tf. Reshape) so that they can be spliced with a single feature map, where [1, width ] and [1, height, 1] are used as target shapes, representing coordinate vectors in the horizontal direction and the vertical direction; copying the two coordinate vectors along the batch dimension and the channel dimension by using a specified dimension tensor element copy function (tf. Tile) to obtain two copied coordinate vectors, wherein the two copied coordinate vectors are identical to the batch size and the channel number of the single feature map; here, [ acquire tensor function (tf. Shape) (feature map) [0], height, 1] and [ tf. Shape (feature map) [0],1, width ] are used as multiples of the replication, indicating that each lot and each lane has the same coordinate vector; adding one dimension to the last dimension of the two duplicate coordinate vectors using a specified axis add dimension function (tf. Expand_dims) so that they can be stitched with the feature map in the channel dimension; and (3) returning two final coordinate vectors by using an index of which the last dimension is represented by-1, wherein the two final coordinate vectors respectively represent the position information in the horizontal direction and the vertical direction. Stitching the single feature map and the two coordinate vectors in the channel dimension using a tensor stitching function (tf. Concat) to obtain an attention map containing location information; adding an attention attempt to the list; and repeatedly executing the operation to obtain the coordinate attention graph of all the channels.

Performing convolution operation on the coordinate attention graph twice to obtain an integrated attention graph with the same channel number as the input feature graph, wherein the method specifically comprises the following steps:

the first convolution is the dimension of a compressed channel, the number of output channels of the first convolution layer is the number of channels of the input feature map/the compression ratio, and each channel of the compression ratio is represented to output one channel; the convolution kernel size is 1×1, and the activation function is Relu; performing a second convolution operation on the compressed attention map to recover the channel dimension and obtain the attention map with the same channel number as the input feature map, wherein the output channel number of the second convolution layer is the input feature map channel number; the convolution kernel size is 1 x 1 and the activation function is Relu.

The restored attention map and the input feature map are subjected to an element-by-element multiplication operation using a multiplication function (multiple) for weighting the input feature map to obtain a weighted feature map P28.

In this embodiment, the input profile includes batch, width, height, and channel number; scalar means the range of a single feature map; the arithmetic progression represents the horizontal coordinate value and the vertical coordinate value; the coordinate vectors respectively represent the coordinate vectors in the horizontal direction and the vertical direction; the duplicate coordinate vector is the same as the batch and channel number of the single feature map; the final coordinate vector represents position information in the horizontal direction and the vertical direction, respectively; attention is sought to contain location information; a single feature map, i.e. each channel feature map.

Inputting the feature map P12 into a fifth standard convolution layer for convolution operation to obtain a feature map P14, wherein the number of convolution kernels of the fifth standard convolution layer is 256, the size of the convolution kernels is 1 multiplied by 1, and the step length is 1; the feature map P14 is 16×16 of 256 channels; inputting the feature map P14 into a fifth activation function layer for activation operation to obtain a feature map P15; feature map P15 is 16×16 for 256 channels; the feature map P15 is input to the second attention module to perform an attention operation, and a feature map P29 is obtained, where the feature map P29 is 16×16 of 256 channels.

Inputting the feature map P12 into a sixth standard convolution layer for convolution operation to obtain a feature map P16, wherein the number of convolution kernels of the sixth standard convolution layer is 256, the size of the convolution kernels is 1 multiplied by 1, and the step length is 1; the feature map P16 is 16×16 of 256 channels; inputting the feature map P16 into a sixth activation function layer for activation operation to obtain a feature map P17; feature map P17 is 16×16 for 256 channels; upsampling the feature map P15 to the same size as the feature map P17 by bilinear interpolation using an image resizing function (tf.image. Resize); using tf.shape (P17) 1 and tf.shape (P17) 2 to obtain the height and width of P17 as the up-sampled target size; splicing the P17 and the up-sampled P22 along the last (channel) dimension by using a tensor splicing function (Concat) to obtain a characteristic map P23; the feature map P23 is 16×16 of 512 channels; the feature map P23 is input to the third attention module for performing an attention operation, and a feature map P30 is obtained, where the feature map P30 is 16×16 of 512 channels.

Inputting the feature map P12 into a seventh standard convolution layer for convolution operation to obtain a feature map P18, wherein the number of convolution kernels of the seventh standard convolution layer is 256, the size of the convolution kernels is 1 multiplied by 1, and the step length is 1; the feature map P18 is 16×16 of 256 channels; inputting the feature map P18 into a seventh activation function layer for activation operation to obtain a feature map P19; the feature map P19 is 16×16 of 256 channels; upsampling the feature map P23 to the same size as the feature map P19 by bilinear interpolation using an image resizing function (tf.image. Resize); splicing the P19 and the P24 after upsampling along the last (channel) dimension to obtain a feature map P25; the feature map P25 is 768 channels of 16×16; the feature map P25 is input to the fourth attention module to perform an attention operation, and a feature map P31 is obtained, where the feature map P31 is 768 channels 16×16.

Inputting the feature map P12 into an eighth standard convolution layer for convolution operation to obtain a feature map P20, wherein the number of convolution kernels of the eighth standard convolution layer is 256, the size of the convolution kernels is 1 multiplied by 1, and the step length is 1; the feature map P20 is 16×16 of 256 channels; inputting the feature map P20 into an eighth activation function layer for activation operation to obtain a feature map P21; the feature map P21 is 16×16 of 256 channels; upsampling the feature map P25 to the same size as the feature map P21 by bilinear interpolation using an image resizing function (tf.image. Resize); splicing the P21 and the P26 after upsampling along the last (channel) dimension to obtain a feature map P27; the profile P27 is 16×16 for 1024 channels; the feature map P27 is input to the fifth attention module to perform an attention operation, so as to obtain a feature map P32, where the feature map P32 is 16×16 of 1024 channels.

In this embodiment, the activation functions are all Relu; the second attention module, the third attention module, the fourth attention module and the fifth attention module have the same principle as the first attention module, and are not described in detail.

The detection head network defines the number (10) of the categories of the objects to be detected and the number (9) of the anchor points at each position, and is specifically set according to actual conditions.

Taking the first type prediction remodelling layer as an example, inputting the feature map P28 into a first standard convolution activation layer to perform convolution and activation operations to obtain a feature map P33, wherein the number of convolution kernels of the first standard convolution activation layer is 90 (9 multiplied by 10), the number of output channels of the convolution layer corresponds to the number of types of each anchor point, the size of the convolution kernels is 1 multiplied by 1, and the step length is 1; the activation function is Sigmoid; obtaining the batch size of the feature map, namely the size of the first dimension, by using tf. Shape (P33) [0] to represent how many images are processed at a time; then, the size of the second dimension is automatically calculated by using-1 representation, namely, the height, the width and the channel number of the feature map are multiplied to obtain the total anchor point number of each image; finally, the number of the categories of the target object to be detected is used for representing the size of the third dimension, namely the category number of each anchor point, and the confidence of which category each anchor point belongs to is represented; the profile after remodeling is P43 (first class predictive remodeling layer); the shapes (batch size, total anchor point number of each image and class number of each anchor point), P44 (second class prediction remodelling layer), P45 (third class prediction remodelling layer), P46 (fourth class prediction remodelling layer), P47 (fifth class prediction remodelling layer) and P43 (first class prediction remodelling layer) are all implemented in the same way, and the output feature map is remodelled to conform to the class prediction format, namely each anchor point corresponds to a class confidence vector, and the probability that the anchor point belongs to each class is indicated.

Taking the first boundary regression remodelling layer as an example, inputting the feature map P28 into a ninth standard convolution layer for convolution operation to obtain a feature map P38, wherein the number of convolution kernels of the ninth standard convolution layer is 36 (9 multiplied by 4), and 4 represents the coordinate number of each anchor point; the number of output channels representing the convolution layer corresponds to the coordinate number of each anchor point, the convolution kernel size is 1 multiplied by 1, and the step length is 1; obtaining the batch size of the feature map, namely the size of the first dimension, by using tf. Shape (P38) [0] to represent how many images are processed at a time; then, the size of the second dimension is automatically calculated by using-1 representation, namely, the height, the width and the channel number of the feature map are multiplied to obtain the total anchor point number of each image; finally, 4 is used for representing the size of the third dimension, namely the coordinate number of each anchor point, and the position of the boundary box of each anchor point is represented; the profile after remodeling is P48 (first boundary regression remodeling layer); the shapes are (batch size, total anchor point number of each image, coordinate number of each anchor point), P49 (second boundary regression remodelling layer), P50 (third boundary regression remodelling layer), P51 (fourth boundary regression remodelling layer), P52 (fifth boundary regression remodelling layer) and P48 (first boundary regression remodelling layer) which are the same in implementation manner, and the output feature map is remodelled to conform to the format of the boundary frame regression, namely, each anchor point corresponds to a coordinate vector and represents the position of the boundary frame of the anchor point.

Using tf.concat function to splice the input feature map list, it can connect multiple tensors along specified dimension to form a new tensor; then designating to splice along a first dimension by using axis=1 parameters, namely adding anchor points of different dimensions to obtain a final anchor point number; the first dimension refers to the total anchor point number of each image; finally, the class prediction feature graphs (P43, P44, P45, P46 and P47) and the bounding box regression feature graphs (P48, P49, P50, P51 and P52) are respectively spliced to obtain two output tensors P53 and P54, wherein the two output tensors are respectively (the size of a batch, the sum of anchor points of all scales, the class number of each anchor point) and (the size of a batch, the sum of anchor points of all scales, 4) in shape, and 4 represents the coordinate number of each anchor point.

In this embodiment, the target recognition network includes a multi-scale feature network and a detection head network, where the multi-scale feature network includes a first standard convolution layer, a first activation function layer, a first maximum pooling layer, a second standard convolution layer, a second activation function layer, a second maximum pooling layer, a third standard convolution layer, a third activation function layer, a third maximum pooling layer, a fourth standard convolution layer, a fourth activation function layer, a fourth maximum pooling layer, a fifth maximum pooling layer, a first attention module, a fifth standard convolution layer, a fifth activation function layer, a second attention module, a sixth standard convolution layer, a sixth activation function layer, a first tensor splice layer, a third attention module, a seventh standard convolution layer, a seventh activation function layer, a second tensor splice layer, a fourth attention module, an eighth standard convolution layer, an eighth activation function layer, a third tensor splice layer, a fifth attention module, a first upsampling layer, a second upsampling layer, and a third upsampling layer; the detection head network comprises a first standard convolution active layer, a ninth standard convolution layer, a first class prediction remodelling layer, a first boundary regression remodelling layer, a second standard convolution active layer, a tenth standard convolution layer, a second class prediction remodelling layer, a second boundary regression remodelling layer, a third standard convolution active layer, an eleventh standard convolution layer, a third class prediction remodelling layer, a third boundary regression remodelling layer, a fourth standard convolution active layer, a twelfth standard convolution layer, a fourth class prediction remodelling layer, a fourth boundary regression remodelling layer, a fifth standard convolution active layer, a thirteenth standard convolution layer, a fifth class prediction remodelling layer, a fifth boundary regression remodelling layer, a fourth tensor stitching layer and a fifth tensor stitching layer.

The step S4 specifically comprises the following steps:

counting different classes of objects to be detected, and counting the number of each class, wherein the method specifically comprises the following steps:

reading an SEM image of the identification result, namely loading the marked gray level image, wherein the value of each pixel represents the brightness, and the range is 0 to 255; defining that categories, such as bacteria, fungi and protozoa, have been identified by the target detection algorithm, marked with red, green and yellow, respectively; a pixel area threshold of the object is defined and a minimum pixel area threshold is set for noise filtering. Contours below this threshold are considered invalid and do not incorporate statistics; defining a conversion ratio of the pixel to the actual length, and assuming that 1 pixel is equal to 0.1 micrometer, calculating the density of the target objects, namely the number of the target objects in unit area; then initializing a counter and an area list, here to store the number and area of objects of each category; then, carrying out contour detection on each category, extracting a target object contour by using a cv2 library function, calculating the area of each target object contour, if the area is larger than a threshold value, considering the target object as an effective target object, adding the quantity and the area of each target object into a counter and an area list respectively, and marking the contour on an original image by using corresponding colors for visualization; then displaying the marked original image, and using matplotlib library functions to see different classes of targets to be distinguished by different colors; then calculating the total target object quantity and the total sample area, adding all types of counters, and multiplying the pixel size of the image by a conversion ratio to obtain the total target object quantity and the total sample area; finally outputting the number, proportion and density of the target objects of each category; and calculating and outputting according to the data of the counter and the area list, and evaluating the environmental impact of different object classes.

Calculating the areas of different classes of objects to be measured, and counting the area of each class, wherein the method specifically comprises the following steps:

reading an SEM image of a target detection result, namely loading a gray level image, wherein the value of each pixel represents brightness and ranges from 0 to 255; to facilitate visualization, each identified target class is assigned a color, e.g., plant root system green, insect debris red, organic matter debris blue; to calculate the coverage area of the target, 1 pixel is assumed to be equal to 0.1 microns. This helps estimate the biological area per unit area, creating a list to store the coverage area for each target class for subsequent statistics and analysis; calculating the coverage area of each category according to the labeling result of the target detection algorithm, namely, multiplying the square of the conversion ratio by the number of pixels with all pixel values not being 0; for ease of identification, the regions of each category are marked with a corresponding color by modifying the pixel color of the marked region; displaying the image by using a matplotlib library function, wherein different classes of objects are distinguished by different colors, providing an intuitive visual effect; outputting the coverage area and the proportion of the target object of each category; according to different categories, its impact on the structure and function of the environment is evaluated.

Carrying out load calculation on different categories of objects to be detected, and counting the quality of each category, wherein the method specifically comprises the following steps:

reading an SEM image of the identification result, namely loading a gray scale image, wherein the value of each pixel represents brightness and the range is 0 to 255; to facilitate visualization, each identified class of objects is assigned a color, e.g., mineral particles are labeled with red, sand with green, clay with blue, humus with yellow; setting a conversion ratio of the pixel to the actual length, and assuming that 1 pixel is equal to 0.1 micrometer, calculating the volume of the target object; defining a density, e.g., grams/cubic micron, for each class of objects for calculating mass; initializing a quality list, and creating a list to store the quality of each category object; calculating the volume of each category according to the labeling result of the target detection algorithm, namely, the number of pixels with the pixel value not being 0 is multiplied by the cube of the conversion ratio, and then calculating the quality of the category according to the density of the category, namely, the volume is multiplied by the density; marking the area of each object category with a respective color to provide an intuitive visual effect; displaying and displaying the marked image: displaying the image using a matplotlib library function, wherein different classes of objects are distinguished in different colors; the mass and proportion of each category is calculated and output from the mass list data and its potential impact on the physical and chemical characteristics of the environment is assessed.

In this example, the biological categories include E.coli, bacteria, fungi, plaque, protist, plant root system, insect debris, organic matter fragments, cells, etc.; non-biological classes include doped silver, mineral particles, sand, clay, humus, and the like; analyzing the SEM image according to the identification result, and selecting one or more kinds of target objects according to actual conditions for statistics; different categories including biological and non-biological, the biological category and the non-biological category can all carry out statistics on the quantity, the area, the load and the like; not only the analysis of statistical quantity, area and loading, but also other measurement analysis may be involved.

Example 2

As shown in fig. 3, the present invention discloses an image recognition system based on deep learning, the system comprising:

an image acquisition module 10 for acquiring a scanning electron microscope image of an environmental sample.

The image processing module 20 is configured to pre-process the scanning electron microscope image to obtain an input image.

The target recognition module 30 is configured to input the input image into a target recognition network for feature extraction, so as to obtain a recognition result; the object recognition network includes a multi-scale feature network and a detection head network.

The statistical analysis module 40 is configured to perform statistical analysis on the target object to be tested according to the identification result, so as to obtain a statistical result.

As an alternative embodiment, the image processing module 20 of the present invention specifically includes:

in the method, in the process of the invention,for denoised image at pixel position +.>Pixel values of (2); />For the original image at pixel position +.>Pixel value of>Is at->All pixel positions within the neighborhood of (a); />Is a normalization factor; />For position->And->Similarity weight between; />Is->A surrounding neighborhood.

，/>

/>

As an alternative embodiment, the multi-scale feature network of the present invention specifically includes:

the first standard convolution sub-module is used for inputting an input image into the first standard convolution layer to carry out convolution operation to obtain a feature map P1; inputting the feature map P1 into a first activation function layer for activation operation to obtain a feature map P2; and inputting the feature map P2 into a first maximum pooling layer to perform maximum pooling operation, and obtaining a feature map P3.

The second standard convolution sub-module is used for inputting the feature map P3 into the second standard convolution layer to carry out convolution operation to obtain a feature map P4; inputting the feature map P4 into a second activation function layer for activation operation to obtain a feature map P5; and inputting the feature map P5 into a second maximum pooling layer to perform maximum pooling operation, and obtaining a feature map P6.

The third standard convolution sub-module is used for inputting the feature map P6 into the third standard convolution layer to carry out convolution operation to obtain a feature map P7; inputting the feature map P7 into a third activation function layer for activation operation to obtain a feature map P8; and inputting the feature map P8 into a third maximum pooling layer to perform maximum pooling operation, and obtaining a feature map P9.

The fourth standard convolution sub-module is used for inputting the feature map P9 into a fourth standard convolution layer to carry out convolution operation to obtain a feature map P10; inputting the feature map P10 into a fourth activation function layer for activation operation to obtain a feature map P11; and inputting the feature map P11 into a fourth maximum pooling layer for maximum pooling operation to obtain a feature map P12.

The first attention sub-module is used for inputting the feature map P12 into a fifth maximum pooling layer to perform maximum pooling operation to obtain a feature map P13; the feature map P13 is input to the first attention module for performing an attention operation, resulting in a feature map P28.

The second attention submodule is used for inputting the feature map P12 into a fifth standard convolution layer to carry out convolution operation to obtain a feature map P14; inputting the feature map P14 into a fifth activation function layer for activation operation to obtain a feature map P15; the feature map P15 is input to the second attention module for performing an attention operation, resulting in a feature map P29.

The third attention submodule is used for inputting the feature map P12 into a sixth standard convolution layer to carry out convolution operation to obtain a feature map P16; inputting the feature map P16 into a sixth activation function layer for activation operation to obtain a feature map P17; performing up-sampling operation on the feature map P15 to obtain a feature map P22; inputting the feature map P17 and the feature map P22 into a first tensor splicing layer for splicing operation to obtain a feature map P23; the feature map P23 is input to the third attention module for performing an attention operation, resulting in a feature map P30.

The fourth attention sub-module is used for inputting the feature map P12 into a seventh standard convolution layer to carry out convolution operation to obtain a feature map P18; inputting the feature map P18 into a seventh activation function layer for activation operation to obtain a feature map P19; performing up-sampling operation on the feature map P23 to obtain a feature map P24; inputting the feature map P19 and the feature map P24 into a second tensor splicing layer for splicing operation to obtain a feature map P25; the feature map P25 is input to the fourth attention module for performing an attention operation, resulting in a feature map P31.

A fifth attention sub-module, configured to input the feature map P12 to an eighth standard convolution layer for performing convolution operation, to obtain a feature map P20; inputting the feature map P20 into an eighth activation function layer for activation operation to obtain a feature map P21; performing up-sampling operation on the feature map P25 to obtain a feature map P26; inputting the feature map P21 and the feature map P26 into a third tensor splicing layer for splicing operation to obtain a feature map P27; the feature map P27 is input to the fifth attention module for performing an attention operation, resulting in a feature map P32.

As an alternative embodiment, the first attention sub-module of the present invention specifically includes:

the feature map acquisition unit is used for acquiring an input feature map; the input profile includes batch, width, height, and channel number.

The coordinate attention graph generating unit is used for generating coordinate attention graphs of all channels according to the input feature graph, and specifically comprises the following steps:

traversing each channel index, and extracting the characteristic map of each channel as a single characteristic map.

Respectively calculating the minimum value and the maximum value of a single feature map in each channel to obtain two scalar quantities; scalar means the range of a single feature map.

Generating two arithmetic series according to the range, the width and the height of the single characteristic diagram; the arithmetic progression indicates the horizontal coordinate value and the vertical coordinate value, respectively.

Splicing the two arithmetic progression with the single feature map to obtain two coordinate vectors; the coordinate vectors represent coordinate vectors in the horizontal direction and the vertical direction, respectively.

Copying the two coordinate vectors along the batch dimension and the channel dimension to obtain two copied coordinate vectors; the duplicate coordinate vector is the same as the lot and lane number of a single feature map.

Performing channel dimension expansion on the two copied coordinate vectors, and returning two final coordinate vectors; the final coordinate vectors represent position information in the horizontal direction and the vertical direction, respectively.

The single feature map and the two final coordinate vectors are stitched in the channel dimension to get an attention map, which is added to the list.

And repeatedly executing the operation until the coordinate attention graphs of all the channels are obtained.

And the attention map integrating unit is used for carrying out convolution operation on the coordinate attention map twice to obtain an integrated attention map with the same channel number as the input characteristic map.

As an alternative embodiment, the statistical analysis module 40 of the present invention specifically includes:

and the counting sub-module is used for counting different types of objects to be measured and counting the number of each type.

And the area calculation sub-module is used for calculating the areas of different types of objects to be measured and counting the areas of each type.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An image recognition method based on deep learning, which is characterized by comprising the following steps:

2. The image recognition method based on deep learning according to claim 1, wherein the preprocessing the scanning electron microscope image to obtain an input image specifically comprises:

；

In the method, in the process of the invention,for denoised image at pixel position +.>Pixel values of (2); />For the original image at pixel position +.>Pixel value of>Is at->All pixel positions within the neighborhood of (a); />Is a normalization factor; />For position->And->Similarity weight between; />Is->A surrounding neighborhood;

；

，/>；

；

in the method, in the process of the invention,for edge detection image in position +.>Pixel values of (2); />For position->Is the distance of the pixel to the nearest edge; />Parameters for controlling the intensity of the characteristic focus; / >In-position for images after feature focus transformationIs a pixel value of (a).

3. The deep learning-based image recognition method according to claim 1, wherein the multi-scale feature network specifically comprises:

4. The image recognition method based on deep learning according to claim 3, wherein the first attention module specifically comprises:

5. The image recognition method based on deep learning according to claim 1, wherein the performing statistical analysis on the object to be detected according to the recognition result to obtain a statistical result specifically includes:

6. An image recognition system based on deep learning, the system comprising:

7. The deep learning based image recognition system of claim 6, wherein the image processing module specifically comprises:

；

in the method, in the process of the invention,for the equalized histogram, the pixel value +. >Is a cumulative distribution function value of (a); />For pixel values in the histogram of the denoised image +.>Probability density function values of (2); />For->An applied restriction function;pixel values of the image after equalization processing; />Is the maximum range of pixel values;

，/>；

；

in the method, in the process of the invention,for edge detection image in position +.>Pixel values of (2); />For position->Is the distance of the pixel to the nearest edge; />Parameters for controlling the intensity of the characteristic focus; />In-position for images after feature focus transformationIs a pixel value of (a).

8. The deep learning based image recognition system of claim 6, wherein the multi-scale feature network specifically comprises:

9. The deep learning based image recognition system of claim 8, wherein the first attention sub-module specifically comprises:

10. The deep learning based image recognition system of claim 6, wherein the statistical analysis module specifically comprises: