CN117173122A

CN117173122A - Lightweight ViT-based image leaf density determination method and device

Info

Publication number: CN117173122A
Application number: CN202311123510.3A
Authority: CN
Inventors: 代国威; 樊景超; 王朝雨
Original assignee: Agricultural Information Institute of CAAS
Current assignee: Agricultural Information Institute of CAAS
Priority date: 2023-09-01
Filing date: 2023-09-01
Publication date: 2023-12-05
Anticipated expiration: 2043-09-01
Also published as: CN117173122B

Abstract

The invention provides an image leaf density measuring method, device, electronic equipment and storage medium based on lightweight ViT, wherein the method comprises the following steps: acquiring an original image; uniformly dividing a plurality of regions of interest into an original image, labeling the regions of interest, and establishing a corresponding relation between the regions of interest and leaf density; respectively adopting pixel enhancement space and space enhancement space and weather data enhancement to carry out data enhancement on the marked original image; dividing the enhanced image into a training set, a verification set and a test set, and training a preset visual perception model of leaf density by using the training set and the verification set to obtain a trained visual perception model of leaf density; and inputting the test set into the trained visual perception model of the leaf density to obtain the leaf density of each region of interest. The invention adopts double enhancement space expansion and weather data enhancement expansion training samples, enhances the generalization capability and robustness of the visual perception model of the leaf density, and improves the leaf density measuring effect.

Description

Lightweight ViT-based image leaf density determination method and device

Technical Field

The invention relates to the technical field of orchard plant protection, in particular to an image leaf density measuring method and device based on a lightweight ViT.

Background

Grape is one of four fruit trees in the world, and has wide planting area and high economic benefit. According to the latest statistics of international grape and wine organizations, the world grape planting area has stabilized at about 740 ten thousand hectares in recent years ^[1] . Manual spraying operations are a critical operation for grape planting, but due to their inefficiency and high workload, grape growers are considering innovative mechanized solutions to reduce operating costs. Secondly, the accurate application amount spraying operation is beneficial to reducing the application amount of pesticides in accurate agriculture and accurate fruit planting, because only 30-40% of pesticides are deposited on the target, and most of pesticides are lost around, pollution is easily caused to influence ecology ^[2] . Organic agriculture is widely practiced worldwide, requiring higher levels of control of pathogen and pest infestation in organic vineyards ^[3] . Therefore, the rapid and accurate automatic spraying operation can reduce the application amount of pesticides or fertilizers, is beneficial to improving the quality and efficiency of grape planting and protects farmersIs healthy.

The target recognition and positioning vision system of the agricultural plant protection spraying robot provides necessary state information for spraying operation, but is limited by the interference of the growing environment of the grape and environmental factors such as grape trellis, branches, illumination and the like, and the measurement and acquisition of the grape leaf density are relatively difficult. Convolutional neural networks, which are a general machine learning model, have been successfully applied in a variety of agricultural environments ^[4–7] . Compared with the traditional manual feature extraction ^[8,9] Convolutional neural networks are computational models composed of a set of stacked processing layers that are capable of designing feature representations at increasing levels of abstraction. In this way, the entire learning process details the autonomous hidden representation of features present in the input data ^[10,11] . Therefore, each layer can gradually represent more complex characterization, namely the required abstract level, and the network generates low-level and high-level semantic features which stably and accurately reflect the whole data set, thereby avoiding the instability of synthesizing the known manual feature information, and effectively overcoming the problems of low working efficiency and low classification accuracy caused by the traditional manual extraction of the features ^[12,13] . In the practical situation, to achieve a better feature classification effect, it is important to select a proper sample capacity, but the data sample needs to be collected by considering the factors such as environment, time period and cost, so that the data enhancement technology is widely adopted in numerous deep learning researches at home and abroad, and expands the data and solves the problem of overfitting ^[14–16] . Because of the complex and harsh natural environment of agriculture, the traditional data enhancement method can introduce redundant characteristics to negatively influence the classification effect. In addition, the agricultural target task has higher dependence on weather conditions than other tasks, so that the data enhancement method is needed to be modified according to actual agricultural conditions, and the requirements of the vineyard spraying operation are met. Recently, vision converters (Vision Transformer, viT) have proven effective for different agricultural tasks, and the ViT architecture exhibits excellent performance compared to the most advanced convolutional networks ^[17,18] . However, the direct application of ViT to grape leaf densitometry means that more data samples are needed to support ViT pair of featuresLearning potential of (a) ^[19,20] . In addition, the application of the deep learning model to the agricultural robot needs to consider the consumption of computational resources required by the model, and is particularly applied to the intelligent and informationized processing based on the platform of the Internet of things. Since ViT is too large in parameter and high Floating point operations (FLOPs) are not suitable for deployment to agricultural equipment, it is considered a new trend in modern precision agriculture based on agricultural robot model development to improve performance by exploring a lightweight ViT method.

Reference to the literature

[1]ZHU M,LIU Z,ZENG Y,et al.Nordihydroguaiaretic acid reduces postharvest berry abscission in grapes[J].Postharvest Biology and Technology,2022,183:111748.

[2]ABBAS I,LIU J,FAHEEM M,et al.Different sensor based intelligent spraying systems in Agriculture[J].Sensors and Actuators A:Physical,2020,316:112265.

[3]BEAUMELLE L,GIFFARD B,TOLLE P,et al.Biodiversity conservation,ecosystem services and organic viticulture:A glass half-full[J].Agriculture,Ecosystems&Environment,2023,351:108474.

[4]YAN J,ZHAO J,CAI Y,et al.Improving multi-scaledetection layers in the deep learning network for wheat spike detection based on interpretive analysis[J].Plant Methods,2023,19(1):46.

[5]ATTRI I,AWASTHI L K,SHARMA T P,et al.A review of deep learning techniques used in agriculture[J].Ecological Informatics,2023,77:102217.

[6]ZHANG J,QI C,MECHA P,et al.Pseudo high-frequency boosts the generalization of a convolutional neural network for cassava disease detection[J].Plant Methods,2022,18(1):136.

[7]DAI G,FAN J,DEWI C.ITF-WPI:Image and text based cross-modal feature fusion model for wolfberry pest recognition[J].Computers and Electronics in Agriculture,2023,212:108129.

[8]PINTO D L,SELLI A,TULPAN D,et al.Image feature extraction via local binary patterns for marbling score classification in beef cattle using tree-based algorithms[J].Livestock Science,2023,267:105152.

[9]XU P,FU L,XU K,et al.Investigation into maize seed disease identification based on deep learning and multi-source spectral information fusion techniques[J].Journal of Food Composition and Analysis,2023,119:105254.

[10]GILL H S,MURUGESAN G,MEHBODNIYA A,et al.Fruit type classification using deep learning and feature fusion[J].Computers and Electronics in Agriculture,2023,211:107990.

[11]DAI G,HU L,FAN J,et al.A Deep Learning-Based Object Detection Scheme by Improving YOLOv5 for Sprouted Potatoes Datasets[J].IEEE Access,2022,10:85416–85428.

[12]GAJERA H K,NAYAK D R,ZAVERI M A.A comprehensive analysis of dermoscopy images for melanoma detection via deep CNN features[J].Biomedical Signal Processing and Control,2023,79:104186.

[13]JIANG Y,CHEN Y,YANG F,et al.State of health estimation of lithium-ion battery with automatic feature extraction and self-attention learning mechanism[J].Journal of Power Sources,2023,556:232466.

[14]ZHOU Z,ZHANG J,GONG C,et al.Automatic tunnel lining crack detection via deep learning with generative adversarial network-based data augmentation[J].Underground Space,2023,9:140–154.

[15]DANG F,CHEN D,LU Y,et al.YOLOWeeds:A novel benchmark of YOLO object detectors for multi-class weed detection in cotton production systems[J].Computers and Electronics in Agriculture,2023,205:107655.

[16]ZHANG T,DAI J,SONG W,et al.OSLPNet:A neural network modelfor street lamp post extraction from street view imagery[J].Expert Systems with Applications,2023,231:120764.

[17]SALAMAI A A,AJABNOOR N,KHALID W E,et al.Lesion-aware visual transformer network for Paddy diseases detection in precision agriculture[J].European Journal of Agronomy,2023,148:126884.

[18]HEO J,SEO S,KANG P.Exploring the differences in adversarial robustness between ViT-and CNN-based models using novel metrics[J].Computer Vision and Image Understanding,2023,235:103800.

[19]GAO Y,LI X,GAO L.A Multi-level spatial feature fusion-based transformer for intelligent defect recognition with small samples toward smart manufacturing system[J].International Journal of Computer Integrated Manufacturing,Taylor&Francis,2023,0(0):1–14.

[20]WENSEL J,ULLAH H,MUNIR A.ViT-ReT:Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos[J].IEEE Access,2023,11:72227–72249.

Disclosure of Invention

The invention aims to provide a lightweight ViT-based image leaf density measuring method, a lightweight ViT-based image leaf density measuring device, electronic equipment and a storage medium, so that the problems in the prior art are solved.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

in one aspect, the invention provides a lightweight ViT-based image leaf density measurement method, which comprises the following steps:

acquiring an original image;

uniformly dividing the original image into a plurality of regions of interest, labeling the regions of interest, and establishing a corresponding relation between the regions of interest and leaf density;

Respectively adopting pixel enhancement space and space enhancement space and weather data enhancement to carry out data enhancement on the marked original image to obtain an enhanced image;

dividing the enhanced image into a training set, a verification set and a test set, and training a preset visual perception model of leaf density by using the training set and the verification set to obtain a trained visual perception model of leaf density; the visual perception model of the preset leaf density is formed by constructing a GLDCNet model comprising a lightweight ViT framework;

and inputting the test set into the trained visual perception model of the leaf density to obtain the leaf density of each region of interest, and evaluating the accuracy of the measurement result.

Illustratively, the dividing the enhanced image into a training set, a validation set, and a test set includes:

dividing the enhanced image into training data and test data according to the proportion of 90% to 10%;

dividing the training data into a training set and a verification set, and taking the test data as a test set.

Illustratively, the pixel enhancement space contains six data enhancement operations, brightness, contrast, hue separation, gaussian blur, gaussian noise, and sharpness, respectively; the space enhancement space comprises eight data enhancement operations, namely rotation, scaling, translation to an x axis, translation to a y axis, affine transformation along the x axis, affine transformation along the y axis, horizontal overturning and vertical overturning; the weather data enhancement comprises six data enhancement operations, namely shooting blurring, device interference, lens shielding, actual fog, solar irradiation and plant-to-plant shielding;

The method for enhancing the data of the marked original image by adopting pixel enhancement space, space enhancement space and weather data enhancement to obtain an enhanced image comprises the following steps:

the pixel enhancement space and the space enhancement space comprise fourteen data enhancement operations, and the fourteen data enhancement operations are divided into 7 data enhancement branches and 1 original image branch;

the data enhancement branch sequentially executes data enhancement operations of less than 2 on the original image, and the original image branch does not perform any operation on the original image;

the enhancement operation which is sequentially executed is obtained by sampling from two enhancement spaces, and the enhancement operation is required to be executed again according to random ordering after sampling, so as to obtain a first enhancement image;

randomly executing the weather data enhancement operation for five times on the original image to obtain a second enhanced image;

and combining the first enhanced image and the second enhanced image to form a final enhanced image.

Illustratively, the weather data enhancement includes six data enhancement operations, including, respectively, shoot blurring, device interference, lens occlusion, real fog, sun exposure, and plant-to-plant occlusion:

the method comprises the steps of respectively adopting image motion blur simulation to shoot blur, adopting an image reduction and amplification image simulation device to interfere, adopting an image mud splash conversion to simulate lens shielding, adopting image random fog conversion to simulate real fog, adopting image illumination to simulate solar irradiation, and adopting image shadow conversion to simulate plant shielding.

Exemplary, the uniformly dividing the original image into a plurality of regions of interest, labeling the regions of interest, and establishing a correspondence between the regions of interest and leaf density includes:

if the region of interest has no leaves, marking the leaf density of the region of interest as 0% -10%;

if the leaf coverage area of the region of interest does not exceed half of the area of the region of interest, marking that the leaf density of the region of interest is 15% -33%;

if the leaf coverage area of the region of interest exceeds half of the area of the region of interest but is smaller than the whole area of the region of interest, marking the leaf density of the region of interest as 45% -66%;

and if the leaves of the region of interest cover the whole area of the region of interest, marking that the leaf density of the region of interest is 85% -100%.

Illustratively, the method further comprises:

controlling the amount of the spraying agent or fertilizer according to the leaf density of each region of interest.

Another aspect of the present invention provides a lightweight ViT based image leaf densitometry apparatus, the apparatus comprising:

The acquisition module is used for acquiring an original image;

the labeling module is used for uniformly dividing the original image into a plurality of regions of interest, labeling the regions of interest and establishing a corresponding relationship between the regions of interest and leaf density;

the image enhancement module is used for carrying out data enhancement on the marked original image by adopting a pixel enhancement space and a space enhancement space and weather data enhancement to obtain an enhanced image;

the training module is used for dividing the enhanced image into a training set, a verification set and a test set, and training a preset visual perception model of leaf density by utilizing the training set and the verification set to obtain a trained visual perception model of leaf density; the visual perception model of the preset leaf density is formed by constructing a GLDCNet model comprising a lightweight ViT framework;

and the testing module is used for inputting the testing set into the trained visual perception model of the leaf density to obtain the leaf density of each region of interest, and evaluating the precision of the measurement result.

Another aspect of the present invention provides an electronic device, including: one or more processors;

and a storage unit for storing one or more programs that, when executed by the one or more processors, enable the one or more processors to implement the method as described above.

Another aspect of the invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor is capable of implementing a method as described above.

The beneficial effects of the invention are as follows:

according to the invention, the sample capacity of the leaf density image is enlarged by a method of combining the dual enhancement space expansion and the weather data enhancement method, so that the generalization capability and the robustness of the visual perception model of the leaf density are enhanced; the visual perception model with the leaf density obtained by training the lightweight ViT model automatically and effectively extracts the high-frequency local feature representation, the grape leaf density feature of the region of interest is formed by mixing high-frequency information and low-frequency information by utilizing a double-branch structure, the semantic analysis of the feature extraction layer is analyzed by adopting a t-SNE and histogram method, and the transparency of the model is improved from the multidimensional and frequency domain distribution space. The method can be used for efficiently obtaining the image leaf density data.

Drawings

FIG. 1 is a schematic flow chart of an image leaf density measurement method based on a lightweight ViT;

FIG. 2 is a schematic diagram of a lightweight ViT-based image leaf density measuring device according to the present invention;

FIG. 3 is a schematic view of four main sample images of a grape leaf density classification dataset in a lightweight ViT-based image leaf density measurement method according to the present invention, (a) grape leaf images with leaf densities greater than 90%; (b) a grape leaf image having a leaf density greater than 50%; (c) She Midu less than 50% grape leaf images; (d) no grape leaf image;

FIG. 4 is a schematic view of a sample image of a region of interest division of grape leaf density in a lightweight ViT-based image leaf density measurement method;

fig. 5 is a schematic diagram of a network framework structure of a grape leaf density classification model based on ViT (GLDCNet) in the lightweight ViT-based image leaf density measurement method;

fig. 6 is a schematic diagram of a sample image of a dual enhancement spatial expansion data enhancement method in an image leaf density measurement method based on a lightweight ViT, wherein a first image (a thick frame image) with the right number of a second row is an original image branch;

FIG. 7 is a schematic diagram of a sample image of a weather data enhancement method adapted to an agricultural operating environment in a lightweight ViT-based image leaf density measurement method according to the present invention;

FIG. 8 is a histogram analysis chart of semantic feature vectors of different grape leaf density images obtained by adopting the lightweight ViT-based image leaf density determination method;

FIG. 9 is a schematic diagram of a confusion matrix on a grape leaf density dataset test set based on a lightweight ViT image leaf density measurement method;

fig. 10 is a schematic diagram of a clustering visualization result of a grape leaf density image dataset obtained by adopting the lightweight ViT-based image leaf density measurement method under a t-SNE algorithm.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the invention.

Based on the problems existing in the prior art, the invention provides a light-weight ViT-based vineyard accurate leaf density measuring method. Firstly, in order to overcome the limitation of a ViT classification model that a large data set is required, a fusion data enhancement method of dual enhancement space expansion and adaptation to an agricultural operation environment is adopted to expand the existing grape leaf density data set. Grape leaf density was then identified using the GLDCNet model based on the lightweight ViT architecture.

As shown in fig. 1, the invention provides an image leaf density measurement method based on a lightweight ViT, which comprises the following steps:

s1, acquiring an original image.

The basic data of the test of the invention adopts a vineyard intelligent and accurately sprayed vineyard leaf density image data set (leaf density is not determined) which contains 475 original images, the size of which is 640 x 480 pixels, and the JPG format, and the data is collected from the vineyard by using an RGB stereo camera arranged on a static robot manipulator and stored in a video format. The data is processed according to sampling every five frames, and the images are extracted from one shot in the record file, so that the correlation between the images is reduced, and the images with similar comments are avoided. As shown in fig. 3, the grape leaf density image data is divided into four main parts according to the composition density of grape leaves, the four main leaf density image data are the whole image processing process of the robot spraying operation, and the accurate identification of the grape leaf density in the image in different operation processes is an important factor for controlling the accurate spraying operation; wherein the She Midu value is divided by visual inspection by a human.

S2, uniformly dividing the original image into a plurality of regions of interest, labeling the regions of interest, and establishing a corresponding relation between the regions of interest and the leaf density.

In a spray irrigation system, the control of the sprayer needs to consider the leaf area index, and the larger the leaf area is, the more medicament or fertilizer is sprayed on the leaves, so that the plant can be better helped to accelerate disease control and nutrient absorption. By classifying the canopy of the regional leaf density, the controller adjusts the pesticide dosage to the dosage to which the actual leaf area belongs. Therefore, by adopting the Labelme image annotation tool, each image is manually annotated through the experience of plant protection experts, each equally divided region is connected with a specific leaf density value, and as the label information generated by Labelme contains more redundant data, the further development of a machine learning method is not facilitated, and a specific label file only keeps the leaf density information of a corresponding region. As shown in fig. 4, the test set up of the present invention had three spray heads in the spray system, which corresponded to specific three spray coverage areas, so that the grape leaf image was defined as three areas of interest, the size of which should be kept consistent considering the ease of the control system adjusting the spray amount, and then the leaf density values of each area were divided according to the leaf density requirements.

Specifically, by observing and analyzing the grape leaf image data, the grape leaf density can be divided into four levels, each level corresponds to one leaf density, and the classifier selected by the invention is divided into four types of leaf densities, which are respectively: 0%, 33%, 66% and 100%, specifically interpreted that when the region has no leaves, it is considered to be 0%. When there are some leaves, but a small number, never more than half of the total area, it is 33%. If the majority of this area is covered by grape leaves, but not completely, it is considered to be 66%, and when the entire area is filled by grape leaves, it is considered to be 100%.

And 3, carrying out data enhancement on the marked original image by adopting pixel enhancement space, space enhancement space and weather data enhancement to obtain an enhanced image.

The rich data is the root of the deep learning to solve the computer vision problem, and the data enhancement technology is a data processing method commonly used in the deep learning, so that the diversity of the data can be improved, and the performance of a training model is further enhanced. The grape leaf density image data set of the invention comprises 475 original images, and the original image data is divided into three components before being input into a network by utilizing the GLDCNet lightweight ViT architecture, so that the total number of actual input images is 1425. ViT lacks generalized bias when trained on small datasets, resulting in poor generalization performance, but ViT performs generally better than convolutional neural networks when tested on large datasets ^[22] (reference [22 ]]：USMAN M,ZIA T,TARIQ A.Analyzing Transfer Learning of Vision Transformers for Interpreting Chest Radiography[J]Journal of Digital Imaging,2022,35 (6): 1445-1462.). Since the method structure of ViT is utilized for classifying the grape leaf density, further expansion processing of the original data set is required. Blind purpose makingThe invention provides a fusion data enhancement method for doubly enhancing spatial expansion and adapting to agricultural operation environments.

Illustratively, the pixel enhancement space contains six data enhancement operations, brightness, contrast, hue separation, gaussian blur, gaussian noise, and sharpness, respectively; the space enhancement space comprises eight data enhancement operations, namely rotation, scaling, translation to an x axis, translation to a y axis, affine transformation along the x axis, affine transformation along the y axis, horizontal overturning and vertical overturning; the weather data enhancement comprises six data enhancement operations, namely shooting blurring, device interference, lens shielding, actual fog, solar irradiation and plant-to-plant shielding; the method for enhancing the data of the marked original image by adopting pixel enhancement space, space enhancement space and weather data enhancement to obtain an enhanced image comprises the following steps: the pixel enhancement space and the space enhancement space comprise fourteen data enhancement operations, and the fourteen data enhancement operations are divided into 7 data enhancement branches and 1 original image branch; the data enhancement branch is formed by sequentially executing data enhancement operations of less than 2 on the original image, and the original image branch does not perform any operation on the original image; the enhancement operation which is sequentially executed is obtained by sampling from two enhancement spaces, and the enhancement operation is required to be executed again according to random ordering after sampling, so as to obtain a first enhancement image; randomly executing the weather data enhancement operation for five times on the original image to obtain a second enhanced image; and combining the first enhanced image and the second enhanced image to form a final enhanced image.

Specifically, the dual enhancement spatial expansion is that the original image is subjected to automatic data enhancement processing in the pixel enhancement space and the spatial enhancement space. As shown in table 1, for a pixel enhancement space containing six data enhancement operations, the spatial enhancement space contains eight data enhancement operations, for a total of 14 data enhancement operations. To better accommodate the field of agricultural image analysis, a novel operational sampling strategy is proposed. The 14 data enhancement operations are divided into a data enhancement branch and an original image branch, wherein for each data enhancement operation, the data enhancement branch consists of sequentially executing data enhancement operations within 2, and the original image branch does not perform any operation on the original image. The enhancement operations performed sequentially are sampled from two enhancement spaces, and the enhancement operations need to be performed in random order after sampling. The magnitude parameters of each data enhancement operation method of the dual enhancement spatial expansion method are shown in table 1, and the specific execution probability of each method is 0.2, and the data set is processed according to 7 enhancement branches and one original image branch, and the enhanced data set contains 3800 image samples, and the total number of actual input images is 11400. The invention realizes the data enhancement operations based on the albuminous software package, and the albuminous is a common data enhancement library and can realize various data enhancement operations. Fig. 6 shows the resulting sample image.

Table 1 is a dual enhancement spatial extension data enhancement method.

The dual enhancement spatial expansion does some work in the aspect of data volume expansion, but does not consider the data deviation caused by the image background and the specific environment, and therefore, a data enhancement method which is called weather data enhancement and is suitable for the agricultural operation environment is adopted.

As shown in table 2, the weather data enhancement method is composed of six data enhancement operations. Specifically, applying motion blur to simulate an agricultural robot moving job can blur an image captured by a camera. The problem of interference between the image quality simulation robot sensor and the actuator is reduced by reducing and enlarging the image. The mud splash conversion can simulate the lens shielding of mud splash caused by the spraying operation of the robot. The random fog transformation is used for simulating fog generated in natural environment due to large temperature and humidity. Adopting illumination to simulate the sun illumination of natural environment. The shielding problem that the intervals between the grape plants are close is simulated through shadow transformation. The actual effect of the six data enhancement operations on the image is shown in fig. 7.

The six data enhancement methods of table 2 were randomly combined, and each method was performed with a probability of chance set to 0.5, and 5 enhancements were performed per image. And processing 475 grape leaf density images to finally obtain 2375 image samples. 3800 image samples of the dual enhancement spatial expansion method are accumulated, the total number of the image samples is 6175, and the total number of the images actually input into the network model is 18525.

Table 2 is a data enhancement method adapted to the agricultural work environment.

Step 4, dividing the enhanced image into a training set, a verification set and a test set, and training a preset visual perception model of leaf density by using the training set and the verification set to obtain a trained visual perception model of leaf density; the visual perception model of the preset leaf density is constructed and formed based on a GLDCNet model comprising a lightweight ViT framework.

For ease of testing, the enhanced data set was divided into three parts, training, validation and retention test sets, respectively, with the training validation and test data sets divided into 90% and 10%, respectively.

The invention applies the computer vision technology to the automatic spraying robot, and the calculated consumption and the accuracy of the vision perception model are required to be considered. As shown in fig. 5, the present invention proposes to utilize a lightweight ViT architecture called GLDCNet for processing image processing tasks facing complex agricultural environments. The image data is acquired by shooting a specific grape leaf image by an agricultural robot, a single Zhang Putao leaf image before inputting a network needs to be uniformly divided into three parts from top to bottom, the three parts of the single input image at the same time are independent processing processes, the recognition result of the model on the input part corresponds to the specific grape leaf density, and the three obtained grape leaf density information is calculated and used for the agricultural robot to control the spraying operation of different grape leaf canopy areas. Specifically, the visual perception model for the grape leaf density is composed of four stages, each Stage comprises ColBlock and ConvFFN, the initial PatchEmbedding is used for converting a 2-dimensional image into a corresponding 1-dimensional PatchEmbedding so as to obtain a Token vector, and the Classifier at the tail of the model is used for completing the output of the grape leaf density classification information.

ColBlock solves the problem that the model can perceive high frequency and low frequency information. It consists of a local branch and a global branch. In the global branch, K and V are first downsampled, and then standard attention procedures are performed on Q, K and V to extract low frequency global information. The low frequency global information can be expressed as:

X _global ＝Attntion(Q _g Pool(K _g )Pool(V _g )) (1)

Wherein X is _global Representing low frequency global information, Q _g Representing low frequency global informationQuery, V _g Representing a low frequency global information value. Attntion represents a standard attention that captures the interdependencies between inputs and enables flexible weighting of different parts of the inputs so that the model can focus on the most relevant information. Pool represents a pooling operation to reduce the number of parameters and to achieve downsampling of K and V. However, while global branching can effectively capture low frequency global information, the ability to process high frequency local information is inadequate. To solve this problem, high-frequency local information is processed by introducing local branches and using local feature aggregation of local self-attention and global sharing weights. The same linear transformation as the standard attention needs to be used to obtain Q, K and V before the local self-attention and local feature aggregation process can be performed. The high frequency local information can be expressed as:

X _local ＝Self-Attn☉V _s (2)

wherein X is _local Representing high frequency local information, self-Attn representing Self-attention, as well as Hadamard product (Hadamard product), V _s Representing local feature aggregation. For local feature aggregation, as shown in equation (3), deep convolution (Depthwise Convolution, DWconv) is employed to aggregate local information and globally share weights. For local self-attention, as shown in the formula (4) and the formula (5), the number of Token channels is the number, the number of the Hadamard product is the number, the number V is the value, the number Q, the number K and the number V are different vectors of a transducer, the local information is aggregated by adopting DWconv on the number Q and the number K firstly, the Hadamard product of the number Q and the number K is calculated, the context perception weight is obtained on the calculated product by using FC linear transformation, and the value range of the weight is between-1 and 1. Finally, local feature enhancement is carried out on the generated weight through Swish and Tanh activation functions. Compared with the common local self-attention generation process, the Swish and Tanh activation functions can obtain stronger nonlinearity instead of the unique Softmax nonlinear operator in the common self-attention, thereby leading to the generation of higher-quality context-aware weights. Self-Attn _t Representing a self-care value.

V _s ＝DWconv(V) (3)

Self-Attn _t ＝FC(Swish(FC(DWconv(Q)☉DWconv(K)))) (4)

ColBlock ultimately requires the fusion of high frequency and low frequency information. Specifically, it is necessary to process local branches and global branches in a channel dimension connection, and then apply a full connection layer to the fused channel dimension. Finally, the data before entering the local branch and the global branch are connected by adopting residual connection similar to ResNet so as to complete extraction of important characteristic information.

X _out ＝FC(Concat(X _local X _global )) (6)

Wherein Xout represents the final output result and Concat represents the join operation.

ConvFFN replaces the problem of inability to fuse local information in the feed forward neural network (Feedforward Neural Network, FFN). ConvFFN has two structures, intra-level and downsampled, which use deep convolution (DWconv) for a portion after the GELU activation function, thereby enabling ConvFFN to better aggregate local information. Such a design enables ConvFFN to be downsampled directly during FFN without introducing a PatchMerge module to integrate image local information to obtain global perception. ConvFFN in stage: this type of ConvFFN directly makes use of skipped connections. Jump connectivity refers to a technique in which input information is passed directly to deeper layers, bypassing some layers in a neural network. The transfer structure in the level can transfer and integrate local information more efficiently, thereby realizing better information transfer in the network. Downsampling ConvFFN: this type of ConvFFN employs DWconv and full connection layers in the jump connection for downsampling and upstroke operations of the input signal, respectively. DWconv is a form of depth convolution that can efficiently process local information, while fully connected layers help map features to higher dimensional representations. By means of the design of such a jump connection, the network can better integrate local information and perform an efficient information transfer between different phases of the network.

And 5, inputting the test set into the trained visual perception model of the leaf density to obtain the leaf density of each region of interest, and evaluating the accuracy of the measurement result.

In the research process of the invention, the test is carried out in a Jupiter Notebook of Kagle, a Pytorch 1.13.1+cu117 deep learning frame and a computer vision library OpenCV are adopted as main algorithms, and are used and accelerated by a GPU, so that Matplotlib is selected for image drawing, and a display card is NVIDIA TESLA P and 16GB of display memory. Considering the input image size and the calculation resource limitation of the Kaggle platform, the batch size of the network training is 32, the optimizer selects AdamW, the weight attenuation is 0.005, the momentum parameter is set to 0.9, and the dynamic learning rate adjustment mode selects a method called cosine annealing hot restart strategy (Cosine Annealing Warm Restart). By default, the input size of the original image uniformly divided into 3 parts is adjusted to 224×224. In addition, the training process uses an early stop mode to monitor model output, which can effectively prevent overfitting.

The measurement and classification of grape leaf density can be evaluated with the aid of different performance indicators. The test selects Accuracy, precision and Sensitivity for evaluation. In this case, the measure of the classification performance of the grape leaf density is highly dependent on the accuracy, i.e. the ratio of the number of correctly predicted grape leaf densities to the total number of samples of the test dataset, also called detection probability; the accuracy is defined as the ability of the classifier to accurately classify the grape leaf density; recall, i.e., sensitivity, reflects the ability of the classifier to detect leaf density of grapes, which is a primary performance indicator for measuring the ability to check. Ma Xiusi correlation coefficient (MCC) is a measure of the quality of binary (two classes) classification, reflecting the importance of accuracy and recall at the same time. It is especially when grape leaf density class is unbalanced to evaluate the performance of the classification algorithm, and can reflect the comprehensive performance of the classifier from another aspect. In addition, the calculated number of floating point numbers (FLPs) and the parameter number (parameters) of the model serve as important indexes for evaluating the complexity of the algorithm.

Wherein TP (True Positive) denotes the number of positive samples (true positives) that are exactly divided; TN (True Negative) the number of divided negative samples (true negative); FP (False Positive) the number of samples (false positives) into which the error is divided; FN (False Negative) the number of negative samples (false negatives) into which the error is divided.

Where the denominator is the number of samples that are all actually positive examples and the numerator is the number that is correctly predicted in these samples.

In the new demand mode of agriculture 4.0, automated spraying operations are a very complex task in precision agriculture, which requires the combination of a computer vision-based system to distinguish plant leaf density and to perform the spraying operations accordingly in real time. Aiming at the accurate measurement of grape leaf density, the invention provides an image leaf density measurement method based on a lightweight Vision Transformer (ViT) framework, and the method designs a fusion data enhancement method which comprises dual enhancement spatial expansion and weather data enhancement expansion, wherein the former adopts pixel enhancement and spatial enhancement to process an original image, and the latter realizes data enhancement from the experience angle adapting to an agricultural operation environment, and the two are fused to enlarge the sample capacity of a grape leaf density image, so that the generalization capability and the robustness of a model are enhanced. The lightweight ViT model automatically and effectively extracts high-frequency local feature representations and utilizes a dual-branch structure to mix high-frequency and low-frequency information to form grape leaf density features of the region of interest. The semantic analysis of the feature extraction layer adopts a t-SNE and histogram method to analyze, and the transparency of the model is improved from the multidimensional and frequency domain distribution space. Test results show that the fusion data enhancement method can effectively improve the model identification precision, and the accuracy of the data enhancement method included in comparison is respectively improved by 0.55% and 3.46%. The density accuracy of the identified four types of grape leaves exceeds 94%, and the MCC reaches 90.39%. Furthermore, proposed lightweight ViT has at least 0.34% improvement in accuracy over popular MobileViT, with FLOPs of only 0.6G. The proposed method has high recognition speed and high precision, can provide effective technical support for the plant protection spraying robot, and improves the income of the farmers on the basis of reducing pesticide residues.

As shown in fig. 2, another aspect of the present invention provides an image leaf density measurement apparatus based on a lightweight ViT, the apparatus comprising:

the acquisition module 100 is configured to acquire an original image. The raw image data is collected from the vineyard using an RGB stereo camera mounted on a static robotic manipulator, and the data is stored in video format.

The labeling module 200 is configured to uniformly divide the original image into a plurality of regions of interest, label the regions of interest, and establish a correspondence between the regions of interest and the leaf density. Each image was manually annotated by the experience of plant protection specialists using the Labelme image annotation tool to link each aliquoting region to a specific leaf density value.

The image enhancement module 300 is configured to perform data enhancement on the annotated original image by using the pixel enhancement space and the spatial enhancement space and the weather data enhancement, so as to obtain an enhanced image. The method for enhancing the fusion data by dual enhancement space expansion and adaptation to the agricultural operation environment expands the existing grape leaf density data set, overcomes the limitation of a ViT classification model that a large data set is needed, and enhances the generalization capability of the model.

The training module 400 is configured to divide the enhanced image into a training set, a verification set and a test set, and train a preset visual perception model of leaf density by using the training set and the verification set to obtain a trained visual perception model of leaf density; the visual perception model of the preset leaf density is constructed and formed based on a GLDCNet model comprising a light magnitude ViT framework. GLDCNet adopts convolutional neural network and transducer to combine the design, contains simple structure and utilizes shared weight and context perception weight to draw high frequency local characteristic expression effectively, and especially, the two branch structure of careful design can mix high frequency and low frequency information and carry out information fusion. The GLDCNet and fusion data enhancement method is fused, a certain advantage is achieved in accurately measuring the grape leaf density, the calculation cost is controlled, and a new mode of agricultural intelligent production by utilizing the Internet of things technology can be well achieved.

And the test module 500 is used for inputting the test set into the trained visual perception model of the leaf density to obtain the leaf density of each region of interest, and evaluating the accuracy of the measurement result. Through experimental result analysis, the average accuracy rate and MCC of the identification of the grape leaf density based on the GLDCNet of Vision Transformer are 95.11% and 90.45%, respectively. The recognition accuracy of the original data set is improved by 7.23% on average after the fusion data enhancement processing, which shows that the method improves the generalization and the robustness of the GLDCNet model.

Another aspect of the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, is capable of implementing a method as described above.

By adopting semantic feature histogram analysis, three regions of interest consisting of grape leaf images have different leaf densities under visual perception. Extracting feature vectors from different leaf density areas by a proposed grape leaf density classification method, removing a classification layer before the process, and obtaining important semantic features by the feature vectors obtained by the feature extraction layer under the measurement learning and dimension reduction of a principal component analysis (Principal Component Analysis, PCA) algorithm. In order to analyze the interrelationship between important feature vectors of different leaf densities, a histogram visualization mode is selected for semantic feature analysis, and the sampling interval bin is 50, because the minimum better vegetation index description is selected, longer descriptors are avoided, and the spatial distribution difference of all grape leaf density feature vectors can be displayed.

Fig. 8 shows the results of a histogram statistically derived test of the semantic features of different grape leaf density images, with the three components of the top image of fig. 8 (a-c) corresponding to the top-down three grape leaf density image semantic feature histograms, respectively. It can be seen that the second part of the leaf density semantic histogram of fig. 8 (a) forms a significant difference from the first and third part of the histogram, because the second part of the leaf density image visually has a grape leaf image, but these grape leaves are not actually meaningful in the four classes of leaf densities defined in this work, and secondly, have some similarity in the semantic feature distribution but also a difference in the frequency of the feature distribution compared to the lowest 33% grape She Midu semantic histogram of fig. 8 (b). Fig. 8 (b) and fig. 8 (c) in which 66% and 100% of the grapes She Midu semantic histograms, the higher the leaf density, the more concentrated the spatial distribution of the semantic histograms in the middle, while for 100% of the grapes She Midu semantic histograms, the richer the difference in the spatial distribution of the high-frequency salience is more obvious, which demonstrates the process of the model to distinguish the different leaf density images from the semantic histogram perspective. In conclusion, 4 types of leaf density images have different semantic histogram spatial distributions, the important feature vectors extracted by PCA dimension reduction are separable for different leaf density images, if a solution mode of soft landing by using a classifier is not adopted, different leaf densities can be distinguished by directly adopting histogram matching features, and a certain practical foundation is laid as a semantic feature friend of a visual model.

And performing performance analysis on the leaf density identification, and verifying the performance of the proposed leaf density identification method in the aspect of grape leaf density classification and verifying the stability of the performance of the model in an independent test set. All experiments are realized on the data set processed by the fusion data enhancement method for the dual enhancement space expansion and the adaptation to the agricultural operation environment. The results of the four canopy densities of the grape Leaf are shown in table 3, wherein She Midu (Leaf density) of 0% of the canopy of the grape Leaf achieves an accuracy of 97.63% at the highest, 92.84% at the lowest, 95.30% at the average of the overall Leaf density and 90.39% at the 66% canopy density. The 66% and 100% densities of the canopy of grape leaves are lower than the average accuracy, which is confirmed from the histogram analysis of the semantic feature vectors of fig. 8 (b) and 8 (c), when the continuous background is more than the canopy of grape leaves, the proposed classification model of grape leaf density is mistaken for the covered background of grape leaves, resulting in recognition errors, which occur particularly easily when the background is the grape leaves of other plants. Next, the performance stability of the model was verified in the dataset by using the k-fold cross-validation concept, the results of the 4-fold cross-validation are shown in Table 4, with average accuracy and MCC of 95.11% and 90.45%, respectively, and the accuracy and MCC fluctuation in the cross-validation were not more than 2%. In addition, 200 grape leaf density samples were randomly drawn from the test set for testing, and the confusion matrix estimation of fig. 9 showed very good performance, notably the model was stable in recognition performance for four types of grape leaf densities with an average accuracy of between 93.50% and 95.30%.

Table 3 Performance evaluation for each class on the grape leaf Density dataset

Table 4 shows the results of the k-fold cross-validation based grape leaf density test

Illustratively, the method further comprises:

Agricultural robots and automatic spraying devices focus on the low cost and light weight of automatic control systems, and in order to thoroughly compare the performance of proposed methods on grape leaf density image datasets, we choose as popular lightweight network structures as possible for verification. Currently, these computer vision models can be divided into two broad categories, namely convolutional neural network-based or vision attention-based hybrid models. For this purpose, the performance of the two classes SOTA models described above, obtained from the TIMM and Torchvision libraries, will be compared, each using a pre-trained model as a basis for fine tuning, as shown in table 5. There are the shufflenet v2, mobilenet v3, efficientNet, regNetY based on convolutional neural network and the MobileViT and MobileViTV2 based on the mixed model of visual attention. Of the convolutional network based baselines, the best performance was obtained by Efficient Net (accuracy: 96.08%, MCC: 91.23%), the worst performance of MobileNet V3 (accuracy: 94.12%, MCC: 89.24%), the decrease in MobileNet V3 accuracy by 1.96% and the decrease in MCC by 1.99%. Of the ViT base lines, the proposed method performed best, with 95.11% accuracy obtained and 90.45% MCC, while the MobileViTV2 gave suboptimal performance (accuracy: 94.77%, MCC: 90.33%). In combination, the proposed method reduces FLOPs by 14.3% compared with the EfficientNet model, reduces parameter by 45.8%, and improves accuracy by at least 0.34% compared with the MobileViT model.

Table 5 shows generalized performance of proposed methods with respect to a lightweight SOTA deep learning model.

Data enhanced performance analysis

The invention will verify the performance of proposed fusion data enhancement methods on grape leaf density datasets. The data set which does not use the fusion data enhancement method is an original data set, the data set which adopts the double enhancement space expansion processing is called a one-stage data set, on the other hand, the data set which is processed by the data enhancement method and is suitable for the agricultural operation environment is called a two-stage data set, and the data set which comprises the one-stage data set and the two-stage data set is called a two-stage data set, namely the data set which is processed by the fusion data enhancement method.

The fusion data enhancement method performs two types of method performance comparisons by enhancing the data set, see table 6. The result shows that the dual enhancement spatial expansion data enhancement method can greatly improve the precision, and the accuracy of the density of the four types of grape leaves in the comparison original data set is improved by 6.68% on average, wherein the accuracy of the density of the four types of grape leaves is over 94%. The accuracy of the data set enhanced by the weather data enhancement method is 92.42%, the accuracy of the density of the grape leaves is averagely improved by 3.77% compared with that of the original data set of the four types of grape leaves, the accuracy of the grape leaves in the four types of density is more than 90.58%, and the accuracy of the data enhancement method is reduced by 2.91% compared with that of the data enhancement method with dual enhancement space expansion. The original data set adopting the fusion data enhancement method obtains 95.88 percent of accuracy, which is improved by 7.23 percent compared with the original data set, and the accuracy of the data sets in the comparison stage 1 and 2 is respectively improved by 0.55 percent and 3.46 percent. Compared with a weather data enhancement method, the performance of the dual enhancement spatial expansion method is better, the dual enhancement spatial expansion method utilizes 14 data enhancement operations of two enhancement spaces in total for improving the sample capacity, the weather data enhancement method enhances the generalization capability of the model from the experience perspective, and the two enhancement spaces are fused and have the mutual complement effect. Second, for raw data sets, less raw sample size is the primary cause of lower accuracy. In addition, the vineyard is lack of management, the growth difference among different plant leaves is large, and the problems of mechanical shake and environmental interference of the plant protection robot are unavoidable. Therefore, the dual enhancement spatial expansion data enhancement method and the weather data enhancement method fully consider the problems, so that the performance of identifying the grape leaf density is improved.

Table 6 shows comparison of ablation experiments of fusion data enhancement method

The spatial distribution of the grape leaves in the vine can be regarded here as a high-dimensional data relationship. High-dimensional data is inconvenient to directly observe to find its hidden nature. Aiming at the characteristics of multidimensional and nonlinear characteristics of the grape leaf density characteristic data, a t-distribution random neighborhood embedding (t-distributed stochastic neighbor embedding, t-SNE) algorithm adopts a nonlinear dimension reduction method to map high-dimensional data into a three-dimensional or two-dimensional space, so that the distribution situation of the characteristic data can be intuitively displayed in a perspective view or a plan view. Therefore, two-dimensional t-SNE visualization is carried out on the extracted data of the network feature layer, and further the transparency and the interpretability of the proposed grape leaf density recognition model are improved.

The visual results are shown in fig. 10, the two-dimensional visual results fig. 10 (a) can better show four kinds of grape leaf density semantic feature clustering results, sample points of different clusters are less overlapped with each other, two clustering results appear in 100% grape leaf density represented by triangles, wherein the clustering results occupying fewer sample points indicate that the part of semantic features are better distinguished and the type difference is obvious, meanwhile, 66% grape leaf density represented by small light circles appears the same semantic sample, the sample points of four kinds of leaf density semantic feature are integrally progressive in sequence, and She Midu% to 100% mutually overlapped semantic sample points prove key sample points for distinguishing grape leaf density critical areas. The three-dimensional visualization result shown in fig. 10 explains the semantic feature sample points overlapped in the visualization of fig. 10 (a) from a high-dimensional space, and the isolated or overlapped sample points can be seen to be separable, so that the feature extraction capability of the model in high-frequency and low-frequency feature local information is proved.

In order to meet the development requirement of accurate spraying of plant protection agricultural robots, the invention provides a GLDCNet model based on a lightweight ViT architecture for accurate leaf density measurement of vineyards. Specifically, GLDCNet is designed by combining convolutional neural network and transducer, and comprises a simple structure to effectively extract high-frequency local feature representation by using shared weight and context sensing weight, and particularly, a well-designed double-branch structure can mix high-frequency information and low-frequency information and perform information fusion. Next, a fusion data enhancement method of dual enhancement spatial expansion and adaptation to an agricultural operation environment is proposed, and experiments are performed using a grape leaf density image dataset having three regions of interest. And (3) designing a contrast test and verifying the precision, analyzing test results, and conclusion shows that: the average accuracy of identification of grape leaf density based on GLDCNet of Vision Transformer and MCC were 95.11% and 90.45%, respectively. The recognition accuracy of the original data set is improved by 7.23% on average after the fusion data enhancement processing, which shows that the method improves the generalization and the robustness of the GLDCNet model. Compared with basic traditional data enhancement, the application of fusion data enhancement can improve the expression capability of features, adapt to efficient operation of spraying equipment in an agricultural complex environment, and explain a model from two aspects of feature layer data dimension and sampling frequency domain interval through a t-SNE and feature semantic histogram method. The GLDCNet and fusion data enhancement method is fused, a certain advantage is achieved in accurately measuring the grape leaf density, compared with the popular lightweight ViT network MobileViT, the accuracy is improved by at least 0.34%, the GLDCNet and fusion data enhancement method has the lowest FLPs, the calculation cost is controlled, and a new mode of developing agricultural intelligent production by utilizing the Internet of things technology can be better achieved. The dual enhancement spatial expansion and the weather data enhancement method combined enhancement data have the function of mutual complementation. In sum, the identification accuracy of the leaf density of the four types of grapes is at least 93.50%, and the method has obvious advantages in developing a leaf density identification spraying system.

On the basis of the work of the invention, the method can be used as the output of a plant visual perception system and can determine the references of specific dosages of air flow, water flow and water tightness by combining the related technical design of the existing plant protection robot. The purpose of reducing the loss of pesticides and fertilizers in spraying is achieved, and the accurate spraying and the long-term continuous high-speed development of green agriculture are realized.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.

Claims

1. A lightweight ViT-based image leaf density measurement method, the method comprising:

acquiring an original image;

2. The method of claim 1, wherein the partitioning the enhanced image into a training set, a validation set, and a test set comprises:

3. The method of claim 1 or 2, wherein the pixel enhancement space comprises six data enhancement operations, respectively brightness, contrast, hue separation, gaussian blur, gaussian noise and sharpness; the space enhancement space comprises eight data enhancement operations, namely rotation, scaling, translation to an x axis, translation to a y axis, affine transformation along the x axis, affine transformation along the y axis, horizontal overturning and vertical overturning; the weather data enhancement comprises six data enhancement operations, namely shooting blurring, device interference, lens shielding, actual fog, solar irradiation and plant-to-plant shielding;

4. The method of claim 3, wherein the weather data enhancement comprises six data enhancement operations, including, respectively, shoot blur, device interference, lens occlusion, real fog, solar illumination, and plant-to-plant occlusion:

5. The method of claim 1, wherein the uniformly dividing the original image into a plurality of regions of interest, labeling the regions of interest, and establishing the correspondence between the regions of interest and leaf density comprises:

6. The method according to claim 1 or 2 or 5, further comprising:

7. An image leaf density measurement device based on lightweight ViT, the device comprising:

the acquisition module is used for acquiring an original image;

8. An electronic device, comprising: one or more processors;

a storage unit for storing one or more programs, which when executed by the one or more processors, enable the one or more processors to implement the method of any one of claims 1 to 6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, is capable of realizing the method according to any one of claims 1 to 6.