CN116824144A

CN116824144A - U-shaped sensing light-weight transducer method for segmenting small lesions of grape leaves

Info

Publication number: CN116824144A
Application number: CN202310789644.2A
Authority: CN
Inventors: 穆维松; 张馨心; 郑海颖; 范梦杨
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-09-29

Abstract

The invention belongs to the technical field of agricultural information, and particularly relates to a U-shaped sensing light-weight transducer method for segmenting small lesions of grape leaves. The method comprises the steps of adopting a lightweight convolutional neural network MobileNetV2, and extracting multi-scale characteristic information through downsampling of a U-shaped pyramid; extracting a low-frequency global feature and a high-frequency local feature map by a context perception enhancement module; introducing a token aggregation strategy, and reducing detail information loss caused by direct aggregation of low-frequency and high-frequency features; the aggregated tokens are directly transmitted to a lightweight segmentation head to implement segmentation tasks. The invention can realize the balance of efficiency and speed so as to solve the problem of small spot segmentation of grape leaves in a complex background of natural fields.

Description

U-shaped sensing light-weight transducer method for segmenting small lesions of grape leaves

Technical Field

The invention belongs to the technical field of agricultural information, and particularly relates to a U-shaped sensing light-weight transducer method for segmenting small lesions of grape leaves.

Background

Grape leaf spot is one of the major factors responsible for reduced yield and quality in grape planting. Moreover, the grape leaf lesions can spread the fungus rapidly throughout the plantation and cause epidemics throughout the field. The distribution of diseases is quickly known by distributing labels to each pixel, the lightweight segmentation model is helpful for quickly diagnosing and monitoring the disease trend on the blade, and the treatment efficiency is improved and the treatment cost is reduced by targeted management measures. However, the heavyweight model is not friendly to these rapidly propagating plant diseases that require timely segmentation and is difficult to deploy on hardware devices with limited resources. In order to improve the segmentation efficiency, the segmentation model needs to be designed accurately, lightweight and fast.

(1) Convolutional neural network

The lightweight visual task has been overwhelmingly dominated by Convolutional Neural Networks (CNNs). The inherent generalized bias and weight sharing characteristics allow the model to characterize learning with fewer parameters. However, they have some problems that limit their performance: 1. the local connectivity of CNNs often prevents modeling long-term dependencies, ignoring fine-grained semantic information in complex contexts. 2. The fixed convolution kernel and weight result in loss of detail information, and the pixel semantics of small target diseases cannot be extracted. The transform-based approach has demonstrated excellent long-range modeling capabilities, a method to learn visual global characterization instead of CNN, benefiting from self-attention mechanisms. However, heavy weights and time-consuming computer mechanisms are less friendly to reasoning in realistic industrial deployment scenarios.

(2)Transformer

In order to reduce the computational efficiency of the model, many variants of the transducer are struggling to free the model from the dilemma of time-consuming computations. In addition, many work extracts lesion low resolution features by introducing convolution operators, downsampling feature maps, employing pyramid hierarchies, and redesigning markers. When the model is compressed to a size suitable for movement, the segmentation performance of the corresponding model is sacrificed, which appears to be a catastrophic failure in terms of the lightweight model. CNNs exhibit impressive performance in terms of their inherently biased architectural design, and thus recent work has attempted to embed the advantages of CNNs into transformers to achieve excellent accuracy-efficiency trade-offs. The MobileFormer utilizes MobileNet and Transformer to realize bidirectional fusion of local and global information with lower calculation cost. Unfortunately, the above approach focuses on designing to capture low frequency global information, ignoring the importance of high frequency local information, which helps to extract the features of small lesions.

(3) Feature aggregation policy

A weakness of directly aggregating high resolution features and low frequency feature information is that details are easily overwhelmed or lost. Some pioneering efforts explored various polymerization schemes to overcome the above problems. ASPP performs dilation convolution with multiple parallel branches with different dilation rates, which enables the model to aggregate local and global background information without significant increase in computational complexity. PPM in PSPNet captures multi-scale context information by pyramid pooling of the input feature map, which enhances model perception of different scale features and produces finer segmentation results than ASPP. To obtain more refined context information, subsequent variants are derived for applications such as DAPPM, feature integration through larger convolution kernels and deeper information flow. However, the depth information of the above method is not processed in parallel, and the number of channels per scale is relatively large, which means that the calculation amount is relatively large.

Disclosure of Invention

The invention discloses a U-shaped sensing light-weight transducer method for dividing grape leaf small lesions, which adopts a light-weight convolutional neural network MobileNet V2 to extract multi-scale characteristic information through downsampling of a U-shaped pyramid; extracting a low-frequency global feature and a high-frequency local feature map by a context perception enhancement module; introducing a token aggregation strategy, and reducing detail information loss caused by direct aggregation of low-frequency and high-frequency features; the aggregated tokens are directly transmitted to a lightweight segmentation head to implement segmentation tasks.

Preferably, the U-shaped pyramid adopts MobileNetV2 to extract characteristic information, and reduces the resolution of original tokens by using an average pooling operator, and splices marks of different scales along the channel dimension to generate new marks, wherein the new marks are used as input of a context perception enhancement module; the extraction of the multi-scale feature information has a low computational complexity since the multi-scale labels are downsampled to a smaller resolution, i.e. the new labels have a large number of channels.

Preferably, the context-aware enhancement module includes a prototype-aware branch and a pixel-aware broadcast branch.

Preferably, the prototype perceived branch downsamples K and V to reduce matrix operations; the convolution layer exchanges information between marks along the spatial dimension, reducing the number of remodelling, and the nonlinear activation layer is replaced by RELU6 and GELU; the speed of the added batch normalization in each convolution is superior to that of the inferred layer normalization mode; fine-grained semantic information in the token is contained by the residual map of the transducer. The prototype perceived branch effectively achieves a global receptive field and enhances low frequency characterization at a lower computational cost.

Preferably, the pixel perceives the broadcasting branch, adopts a convolution type attention mechanism, and effectively mines the context weight through sharing the weight and the pixel perceiving weight; a linear layer is used to generate key K, query Q, and value V, as follows:

Q,K,V＝Linear(X _in )

wherein X is _in Representing features input from the U-shaped pyramid.

Preferably, the pixel perceives a broadcast branch, which is divided into the following steps:

step 1: local features are extracted by a depth convolution (DWconv) operator and shared weights are used on V, as follows:

V＝DWconv(V)

step 2: performing local enhancement processing with pixel perception weights on the Q and the K; obtaining local information of Q and V respectively by using two convolutions with translational invariance; calculating the values of Q and K by the Hadamard product for use as output; the Softmax in the traditional attention is replaced by Tanh and Awish to obtain the pixel perception weight between-1 and 1; the gating mechanism is adopted to obtain the pixel perception weight, so that the pixel perception weight has stronger nonlinearity, and the stronger nonlinearity means higher quality pixel perception weight, and is specifically expressed as follows:

Q _l ＝DWconv(Q)

K _l ＝DWconv(K)

Atten _l ＝Linear(Swinsh(Linear(Q _l ⊙K _l )))

the generated strong nonlinear weight is aggregated with other pixels, and local characteristics are enhanced through Hadamard product operation. The output graph is defined as:

X _local ＝Attn⊙V

preferably, the token aggregation policy effectively aggregates low-frequency global information and high-frequency local information, and reduces the number of channels on each scale; a global averaging pool of different convolution kernels and step sizes is employed to obtain feature maps of different image resolutions. The channel dimensions of different scales are transformed by a 1 x 1 convolution and the feature map is up-sampled. The original features are then aggregated with background information of different scales using a 3 x 3 convolution method. Finally, the feature map is concatenated and compressed using a 1 x 1 convolution. Furthermore, for ease of optimization, a residual map of 1×1 is also introduced. Assuming x is the input, the features for each scale can be expressed as:

where Up represents upsampling.

The invention has the advantages that a light U-shaped perception transducer is provided, which takes the descending samples of the token with different scales to small scales as input and inherits the advantages of CNN and transducer; the core component perception enhancement module employs a parallel architecture in a small scale token n to achieve superior cost effectiveness. The perception enhancement module consists of two branches: the prototype perception branch learns low-frequency global information through downsampling K and V, the pixel perception broadcasting branch adopts a gating mechanism to enhance nonlinearity, and high-frequency local information is mined through sharing weights and context perception weights; a token aggregation strategy is designed to compensate for the sacrificed details without increasing the number of parameters. The invention can realize the balance of efficiency and speed so as to solve the problem of small spot segmentation of grape leaves in a complex background of natural fields.

Drawings

FIG. 1 is a diagram of the overall architecture of a U-shaped perceptual lightweight transducer method for segmenting small lesions of grape leaves;

FIG. 2 is a schematic diagram of a pixel aware broadcast branch;

fig. 3 is a schematic diagram of a token aggregation policy.

Detailed Description

The overall framework diagram of the U-shaped perception lightweight transducer method for segmenting small lesions of grape leaves is shown in figure 1; FIG. 2 is a schematic diagram of a pixel-aware broadcast branch in a context-aware enhancement module of the present invention; fig. 3 is a schematic diagram of a token aggregation policy according to the present disclosure.

In the training stage, the experiment and other comparison method experiments are deployed in a pytorch and a segmentation library for semantic segmentation experiments. All models were trained on NVIDIA Tesla V100 GPU. The present invention follows the same training strategy of the previous work, considering the fairness of the comparison. Specifically, the image is randomly cropped to 512×512. In the training phase AdamW with a weight decay of 0.01 was used to optimize the model of the invention. Training LRT using "poly" LR strategy

(lr＝baselr×(1-spoch/maxiter) ^power ) Wherein the "poly" LR strategy factor is set to 1, the initial learning rate is 6×10 ^-6 A total of 16 tens of thousands of iterations.

The present invention evaluates the design architecture on three data sets, including a field-PV data set, a plant village data set, and a hybrid-PV data set. The Field-PV dataset was collected in an OLYMPUS OM-D camera used by forestry and fruit tree research institute at Beijing, forestry, academy of sciences, china. 400 original images containing the natural scene of grape gray mold are shot together. Plant Village is a public, fair data set that is specifically used for crop pest identification. The dataset consisted of 54303 high resolution images, including different disease categories and healthy leaves of 38 plants. These images were obtained in a controlled laboratory. We utilized 1383 grape black measurements images and 1180 Zhang Putao black rot images. Syn-PV is a natural field image synthesized from plant village segmentation images obtained from a controlled laboratory by background replacement. A background replacement method is adopted to synthesize the grape disease image with a complex background. All datasets were manually annotated with disease areas and leaf areas by using labelme tools on the collected images. The annotated data is saved in JavaScript object (.json) format. The data is then converted to the PASCAL VOC 2012 data format, which has semantic tags for foreground and background objects. The invention uses Augmentor modules to perform geometric transformations such as random left/right flipping, random clipping, random sampling, color and brightness enhancement or reduction, etc. In the training process, the invention applies a basic and powerful data enhancement method in the semantic segmentation library.

To evaluate the effectiveness of the U-shaped perceptual Transformer, the model was compared to other segmentation methods. Including 3 classical segmentation methods such as deep labv3+, UNet, PSPNet.4 weight scale segmentation methods based on transducer such as PVT2, dual-ViT, segformer, segnext. The 7 lightweight segmentation methods based on the transducer, such as Seaformer, AFFormer, poolFormer, efficientFormer, LVT, nextViT, topformer.

The evaluation index adopts four indexes of accuracy, ioU, recall rate and Dice to measure the performance of the model. Meanwhile, parameters (parameters) of each model, gigabit floating point Operation Seconds (GFLOPs), fps and occupied memory are analyzed.

TABLE 1 quantitative comparison of grape mosaic Virus on plant village datasets based on CNN and transducer methods

Table 2 quantitative comparison of grape leaf and background on plant village dataset based on CNN and transducer methods

TABLE 3 quantitative comparison of grape leaf disease and background in field-PV datasets based on CNN and transducer methods

Experimental results show that the segmentation performance of the method is superior to that of the most advanced transducer method and the method based on deep learning at present. The invention has the advantages that the image segmentation performance and the training and running cost are comprehensively considered, the invention has the optimal performance in the complex small grape leaf spot segmentation task, and the balance of the segmentation performance and the speed is realized.

Claims

1. A U-shaped sensing light-weight transducer method for segmenting small lesions of grape leaves is characterized in that a light-weight convolutional neural network MobileNet V2 is adopted, and multi-scale characteristic information is extracted through downsampling of a U-shaped pyramid; extracting a low-frequency global feature and a high-frequency local feature map by a context perception enhancement module; introducing a token aggregation strategy, and reducing detail information loss caused by direct aggregation of low-frequency and high-frequency features; the aggregated tokens are directly transmitted to a lightweight segmentation head to implement segmentation tasks.

2. The method of claim 1, wherein the U-shaped pyramid uses MobileNetV2 to extract feature information and uses an averaging pooling operator to reduce the resolution of the original tokens, concatenating labels of different dimensions along the channel dimension to generate new labels, which are used as inputs to the context awareness enhancement module.

3. The method of claim 1, wherein the context aware enhancement module comprises a prototype aware branch and a pixel aware broadcast branch.

4. A prototype perceived branch according to claim 3, characterized in that K and V are downsampled, the convolution layers exchange information between the labels along the spatial dimension, the nonlinear active layer is replaced by RELU6 and GELU, a batch normalization is added in each convolution, and fine-grained semantic information in the token is accommodated by the residual mapping of the Transformer.

5. A pixel-aware broadcast branch according to claim 3, wherein a convolution type attention mechanism is employed to efficiently mine context weights by sharing weights and pixel-aware weights; a linear layer is used to generate key K, query Q, and value V, as follows:

Q,K,V＝Linear(X _in )

wherein X is _in Representing features input from the U-shaped pyramid.

6. A pixel-aware broadcast branch according to claim 3, characterized by the following steps:

step 6.1: local features are extracted by a depth convolution (DWconv) operator and shared weights are used on V, as follows:

V＝DWconv(V)

step 6.2: performing local enhancement processing with pixel perception weights on the Q and the K; obtaining local information of Q and V respectively by using two convolutions with translational invariance; calculating the values of Q and V by Hadamard product for use as output; the Softmax in the traditional attention is replaced by Tanh and Awish; a gating mechanism is employed to obtain pixel perceptual weights.

7. The method of claim 1, wherein the token aggregation policy effectively aggregates low frequency global information and high frequency local information, reducing the number of channels on each scale; a global averaging pool of different convolution kernels and step sizes is employed to obtain feature maps of different image resolutions.