CN115439744A

CN115439744A - Tea tender shoot lightweight detection method based on target detection in complex environment

Info

Publication number: CN115439744A
Application number: CN202211038343.8A
Authority: CN
Inventors: 吴伟斌; 李�杰; 赵新; 姚焙火; 高昌伦; 孙顺利; 韩重阳; 唐婷; 吴贤楠; 沈梓颖; 刘逸飞
Original assignee: South China Agricultural University
Current assignee: South China Agricultural University
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-12-06

Abstract

The invention discloses a tea tender shoot lightweight detection method in a complex environment based on target detection, which comprises the following steps: s1, collecting images of tea tender shoots, and constructing an original data set; s2, manually marking one bud and one leaf and two leaves of the tender tea shoots in the original data set by using an image marking tool; s3, carrying out lightweight improvement on YOLOv4, adopting GhostNet as a main feature extraction network, and adding a channel attention mechanism in the PANet of YOLOv 4; s4, carrying out clustering analysis on the tea tender shoot image data set to obtain an anchor according with the tea data set; s5, setting parameters for the improved YOLOv4, training and storing an optimal model; and S6, inputting the image to be detected into the optimal model for forward reasoning, and then carrying out non-maximum inhibition post-processing to obtain the tender bud of the tea in the image to be detected. The invention improves the trunk characteristic extraction network part of the target detection model YOLOv4, adds a channel attention mechanism, and realizes the light weight accurate detection of the tea tender shoots under the complex tea garden environment.

Description

Tea tender shoot lightweight detection method in complex environment based on target detection

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a tea tender shoot lightweight detection method in a complex environment based on target detection.

Background

The tea consumption market is hot and the demand of consumers for famous tea is increasing day by day. At present, the picking of famous tea mainly adopts manual tea picking, the labor intensity is high, the efficiency is low, the picking efficiency can be improved by adopting mechanical picking of a robot, and the labor cost can be greatly reduced.

The visual module of the tea picking robot is one of the important components, and the recognition precision and speed determine the picking effect of the subsequent mechanical arm. The design of the robot vision system in the agricultural production mainly comprises the traditional image processing method and the deep learning-based method. The traditional image processing method mainly identifies tea tender shoots by color features, shape features, texture features and the like of the tea tender shoots. The traditional image processing method has higher requirements on factors such as natural environment, such as illumination, visual angle and the like. According to the tea tender shoot detection method based on deep learning, most of data sets come from single tea shoots, and the data sets are rarely made in a complex tea garden environment.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a tea tender shoot lightweight detection method based on target detection in a complex environment.

In order to achieve the purpose, the invention adopts the following technical scheme:

a tea tender shoot lightweight detection method under a complex environment based on target detection comprises the following steps:

s1, collecting tender shoot images of tea leaves in a tea garden, and constructing an original data set;

s2, manually marking one bud and one leaf and two leaves of the tender tea shoots in the original data set by using an image marking tool;

s3, carrying out lightweight improvement on a target detection model YOLOv4, adopting GhostNet as a main feature extraction network, and adding a channel attention mechanism in the PANet of the YOLOv 4;

s4, carrying out cluster analysis on the tea tender shoot image data set to obtain an anchor according with the tea data set;

s5, setting parameters for the improved target detection model YOLOv4, training and storing the optimal model;

s6, inputting the image to be detected into the stored optimal model for forward reasoning, and then carrying out non-maximum value inhibition post-processing to obtain the tea tender shoot in the image to be detected, thereby completing the detection of the tea tender shoot.

Further, step S1 specifically includes:

dividing tea tender shoots into one shoot and one leaf and two shoots, and shooting images at 45-90 degrees at a distance of 50CM above the tea tender shoots by using a camera to construct an original data set; wherein, the time for shooting the images is morning, noon, evening and night, and the light supplement shooting is adopted at night;

the weather for taking the image includes sunny days and rainy days.

Further, step S2 specifically includes:

the tender shoots of the tea leaves in the original image were labeled using image labeling software labeling, one shoot and one leaf labeled as tea 11, and one shoot and two leaves labeled as tea 12.

Further, in step S3, the improvement of the target detection model YOLOv4 in light weight specifically includes:

ghostnet is used for replacing the original DarkNet53, and the improved model consists of the Ghostnet, the Neck and the Head;

the method comprises the following steps that a Ghost Module is adopted by the Ghost Net to replace common convolution, the Ghost Module firstly compresses the number of channels by adopting 1-1 convolution to generate a partial real feature layer, then deep separable convolution is carried out on each channel of the real feature layer to obtain a Ghost feature map, and finally the real feature layer and the Ghost feature layer are spliced to obtain a finished output feature layer; the GhostNet is composed of a plurality of GhostNet modules, the GhostNet is divided into 5 stages, the characteristic diagram output of the first stage is 208 × 208, the characteristic diagram output of the second stage is 104 × 104, the characteristic diagram output of the third stage is 52 × 52, the characteristic diagram output of the fourth stage is 26 × 26, and the characteristic diagram output of the fifth stage is 13 × 13;

the G-block is a bottleneck structure constructed by a Ghost Module and is divided into two parts:

the first part is a bottleneck structure with the step length of 1, the width and the height of an input characteristic layer are not compressed for deepening the depth of a network, a main part utilizes 2 Ghost modules to extract the characteristics of the input characteristic layer, and finally the output of the main part and a residual error edge are added;

the second part is a bottleneck structure with the step length of 2, the main part firstly utilizes a Ghost Module to carry out feature extraction, then adopts a depth separable convolution with the step length of 2 to carry out width and height compression on an input feature layer, and finally utilizes the Ghost Module to carry out feature extraction again, and an output result and a residual error edge are added.

Further, the Neck part consists of a spatial pyramid pooling SPP and a path aggregation network (PANet);

the SPP contains 3 largest pooling layers with pooling nuclei of 5 x 5, 9 x 9, 13 x 13 size, respectively; the maximum pooling layer padding with pooling core size of 5 × 5 is 2, the step size is 1, the maximum pooling layer padding with pooling core size of 9 × 9 is 4, and the step size is 1; the maximum pooling layer padding with pooling core size of 13 × 13 is 6, and the step size is 1;

the PANet is a path aggregation network, and performs information fusion on three scale feature maps generated by a trunk feature extraction network GhostNet in two path directions from top to bottom and from bottom to top; the PANet comprises the steps of carrying out convolution and up-sampling operation on an SPP output result, splicing the result with a result obtained by carrying out convolution operation on 26 × 26 characteristic layers of GhostNet, and carrying out convolution for 5 times on the spliced result;

performing convolution and up-sampling operation on the 5-time convolution result, splicing the result with the result of the 52 x 52 characteristic layer of the GhostNet after the convolution operation, and performing convolution for 5 times on the spliced result; the result after the 5 convolutions has two go-throughs, the first go-through being the input of the 52 x 52 feature layer of the Head part for detecting small targets, the second go-through being for down-sampling; splicing the downsampled result with the result after the first 5-time convolution, performing 5-time convolution after the splicing, wherein the result has two outgoing directions, the first outgoing direction is used as the input of a 26 × 26 characteristic layer of the Head part and is used for detecting a target, the second outgoing direction is used for performing downsampling operation and is then spliced with the output of the SPP, and the spliced result is subjected to 5-time convolution and is used as the input of a 13 × 13 characteristic layer of the Head part;

in step S3, the channel adding attention mechanism specifically includes:

the position added is before five convolutions of PANet;

the channel attention mechanism firstly carries out global average pooling on the input feature layer, then carries out full connection twice, then takes sigmoid once to obtain the weight of each channel of the input feature layer, and finally multiplies the weight by the input feature layer.

Further, the Head part is used for dividing the whole image into S × S small grids, performing non-maximum suppression on the 3 feature layers of 13 × 13, 26 × 26, and 52 × 52 by using the IOU of the predicted value and the real value, removing the invalid prediction frame, and finally obtaining the identified target category and the corresponding confidence.

Further, the modified model input size is 416 × 3, and the loss function CIOU of the model is composed of a category loss, a confidence loss, and a position loss:

L _CIOU ＝1-CIOU

wherein ρ ² (b,b ^gt ) Respectively representing Euclidean distances of central points of the prediction frame and the real frame, c representing a diagonal distance of a minimum packet area capable of simultaneously containing the prediction frame and the real frame, gamma being a quantity for measuring aspect ratio consistency, alpha being a penalty parameter of gamma, w ^gt And h ^gt Representing the width and height of the real box, w and h representing the width and height of the prediction box.

Further, step S4, performing cluster analysis on the anchor by using K-means to obtain the prior frame size of the tea tender shoot data set;

wherein, the step of clustering analysis of the anchor by adopting K-means comprises the following steps:

s41, determining 9 clusters according to the Anchor requirement of YOLOv 4;

s42, selecting K samples from the tea tender shoot data set as initial cluster centers;

s43, dividing the samples into the nearest cluster centers according to the Euclidean distance between each sample in the data set and the K cluster centers, and taking the mean value of the samples of each cluster as a new sample center;

and S44, repeating the steps S42 to S43 to update the cluster centers until the cluster centers are unchanged, stopping clustering, and obtaining a new anchor belonging to the tea tender shoot sample.

Further, in step S5, the setting of the training parameters specifically includes:

setting an experiment optimizer as SGD, the batch size as 18, the total iteration times as 500 epochs, the learning rate as 1 × e-2 and the minimum value as 0.01 time of the maximum value by adopting a cosine annealing algorithm;

the specific steps of training are as follows:

and (5) training to reach the set iteration times, finishing the training, selecting the model with the optimal performance, and storing the structure and the weight of the model.

Further, step S6 specifically includes:

inputting the image to be detected into a stored optimal model to carry out forward reasoning to obtain n initial candidate results, and carrying out non-maximum inhibition filtering with the threshold value of 0.5 on the initial candidate results to obtain a bud and a bud of the tender tea shoot;

the image to be detected is an image of the tender shoots of the tea leaves in the collected tea garden.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention considers different natural environments to complete the production of the data set and improves the generalization capability of the model.

2. The YOLOv4 network is improved in light weight, the original Darknet53 is replaced by Ghostnet, and a channel attention mechanism is added, so that the calculation amount is greatly reduced, the recognition precision is ensured, the recognition performance of tea tender shoots under the complex natural environment of the tea garden is improved, and the hardware cost of model deployment is reduced.

3. The neural network adopted by the invention has the capability of characteristic learning, the characteristics of one bud and one leaf and two leaves of the tender tea leaf bud do not need to be artificially designed, and the neural network has higher robustness and accuracy compared with the traditional technology.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of an improved object detection model;

FIG. 3a is a first part of a block structure;

FIG. 3b is a second part of the block structure;

FIG. 4 is a schematic diagram of a channel attention mechanism;

FIG. 5 is an image of tea shoots to be detected;

fig. 6 is a detection effect diagram.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the embodiments of the present invention are not limited thereto.

Examples

As shown in figure 1, the tea tender shoot lightweight detection method based on target detection in a complex environment comprises the following steps:

s1, collecting tea tender shoot images and constructing an original data set; in this embodiment, the following are specifically mentioned:

dividing tea tender shoots into one shoot and one leaf and two shoots, taking images about 50cm away from the upper parts of the tea tender shoots, and taking effective data of more than 5000 pieces at different angles of 45-90 degrees to construct a data set; in order to enhance the robustness of the visual recognition model, the data acquisition time is morning, noon, evening and night, wherein the light supplement shooting is adopted at night. Different lighting conditions are involved. The collected weather comprises sunny weather and rainy weather, and different environmental interference factors of the tea robot during working are fully considered.

S2, manually marking one bud, one leaf and two leaves of the tender tea shoot in the original data set by using an image marking tool; in this embodiment, the following are specifically mentioned:

and (3) labeling the tender shoots of the tea leaves in the images in the original data set by using open source image labeling software labellimg, wherein one shoot is labeled as tea 11, and two leaves of one shoot are labeled as tea 12.

In the labeling process, a plurality of interference factors of the dormancy stage and the germination stage exist, and the interference factors need to be clearly distinguished, so that the labeling is mainly performed on one bud and one leaf and two leaves which are strong and suitable for picking. Because the tea leaves are disordered in the growth process, the tea leaves are treated by the methodThe tea tender shoots can be shielded mutually, and more than 70% of tea tender shoots shielded in the labeling process are ignored and are not labeled. The information of the tea tender shoot label mainly comprises important information such as the name of an image, the size and the target category of the image, coordinates of the upper left corner and the lower right corner and the like. The coordinate information is [ x ] ₁ ,y1.x ₂ ,y ₂ ]Wherein the coordinate of the upper left corner is x ₁ And y ₁ The coordinate of the lower right corner is x ₂ And y ₂ 。

S3, carrying out lightweight improvement on a target detection model YOLOv4, adopting GhostNet as a main feature extraction network, and adding a channel attention mechanism in the PANet of the YOLOv 4; the method specifically comprises the following steps:

the target detection model YOLOv4 is improved in a light weight mode, and the original Darknet53 is replaced by Ghostnet so as to reduce the deployment cost of the visual model; a channel attention mechanism is added to the PANet of the target detection model YOLOv4 to improve the situation of partial feature loss caused by using Ghostnet as the main feature extraction.

The improved model input size is 416 × 3, as shown in fig. 2, and is composed of three parts, that is, ghostnet, neck, and Head, and the specific meaning is as follows:

the core idea of the GhostNet is to use some operations with lower computation to generate redundant feature maps, and use the GhostModule to replace the ordinary convolution. The method comprises the steps of firstly compressing the number of channels by adopting a convolution of 1 x 1 to generate a part of real feature layers, then carrying out deep separable convolution on each channel of the real feature layers to obtain a Ghost feature map, and finally splicing the real feature layers and the Ghost feature layers to obtain a finished output feature layer. The GhostNet is composed of a plurality of ghostmodules, and is divided into 5 stages, wherein the characteristic diagram output of the first stage is 208 × 208, the characteristic diagram output of the second stage is 104 × 104, the characteristic diagram output of the third stage is 52 × 52, the characteristic diagram output of the fourth stage is 26 × 26, and the characteristic diagram output of the fifth stage is 13 × 13.

The Ghost bottleeck (G-block) is a bottleneck structure constructed by a Ghost Module, and can be divided into two parts:

the first part is a bottleneck structure with step length of 1, as shown in fig. 3a, the width and height of the input feature layer are not compressed for deepening the depth of the network, the trunk part utilizes 2 Ghost modules to perform feature extraction on the input feature layer, and finally the output of the trunk part and the residual margin are added.

The second part is a bottleneck structure with step length of 2, as shown in fig. 3b, the main part firstly utilizes a Ghost Module to perform feature extraction, then adopts a depth separable convolution with step length of 2 to perform width and height compression on an input feature layer, and finally utilizes the Ghost Module to perform feature extraction again, and the output result and the residual margin are added.

The hack consists of Spatial Pyramid Pooling (SPP) and path aggregation network (PANet). SPP is spatial pyramid pooling, which allows for the conversion of feature maps of any size to a feature vector of a fixed size using maximal pooling. The SPP contains 3 largest pooling layers with pooling nuclei of 5 × 5, 9 × 9, 13, respectively. The maximum pooling layer padding with pooling core size of 5 × 5 is 2, the step size is 1, the maximum pooling layer padding with pooling core size of 9 × 9 is 4, and the step size is 1; the maximum pooling level padding for pooling core size 13 × 13 is 6 with step size 1. The SPP module can effectively increase the receiving range of the backbone features, and the most important context features are obviously separated.

The PANET is a path aggregation network, and performs information fusion on three scale feature maps generated by a main feature extraction network (GhostNet) in two path directions from top to bottom and from bottom to top. The PANet comprises the steps of carrying out convolution and up-sampling operation on an SPP output result, splicing the result with a result obtained by carrying out convolution operation on a 26 × 26 characteristic layer of GhostNet, and carrying out convolution for 5 times on the spliced result (a first convolution module for 5 times);

performing convolution and up-sampling operation on the 5-time convolution result, splicing the result with the result of the 52 × 52 feature layer of the GhostNet after the convolution operation, and performing convolution for 5 times on the spliced result (a second 5-time convolution module); the 5-time convolution result has two destination directions, the first destination direction is used as the input of the 52 × 52 feature layer of the Head and is mainly used for detecting small targets, the second destination direction is used for down-sampling, the down-sampling result is spliced with the first 5-time convolution result, 5-time convolution is performed after splicing (a third 5-time convolution module), the result also has two destination directions, the first destination direction is used as the input of the 26 × 26 feature layer of the Head and is mainly used for detecting targets, the second destination direction is used for down-sampling operation and is then spliced with the output of the SPP, and the spliced result is subjected to 5-time convolution (a fourth 5-time convolution module) and is used as the input of the 13 × 13 feature layer of the Head.

Channel attention was added at a position before five convolutions of PANet. The channel attention mechanism mainly aims to obtain the weight of each channel of the input feature layer, so that the weight of the channel which needs attention more is improved. Firstly, performing global average pooling on an input feature layer, then performing full connection twice, then taking sigmoid once to obtain the weight of each channel of the input feature layer, and finally multiplying the weight by the input feature layer. After the attention mechanism is added, the information loss caused by the light weight backbone characteristic extraction network can be made up, and the identification effect of the tea tender shoots is improved. As shown in fig. 4, a diagram of a channel attention mechanism.

The Head part divides the whole image into S multiplied by S small squares, carries out non-maximum inhibition by using IOU of predicted value and true value on the upper layer of 3 characteristic graphs of 13 multiplied by 13, 26 multiplied by 26 and 52 multiplied by 52, removes invalid prediction frames, and obtains the identified object type and corresponding confidence.

S4, carrying out clustering analysis on the tea tender shoot image data set to obtain an anchor according with the tea data set;

the prior frame adopted by YOLOv4 is obtained by clustering the COCO data set, so that each feature map contains 3 prior frames with different scales, and the size of the prior frame affects the convergence speed of the model and the accuracy of prediction. The tea bud target in the tea bud data set manufactured in the embodiment is mainly a medium-small target, and the sizes of the targets in the COCO data set are various, so that the initial prior frame of the YOLOv4 is not suitable for detecting the tea buds, and the prior frame size of the tea bud data set is obtained by adopting a K-means clustering algorithm. Wherein, the step of clustering analysis of the anchor by adopting K-means comprises the following steps:

s41, determining 9 clusters according to the Anchor requirement of YOLOv 4;

and S44, repeating the steps S42 to S43 to update the cluster center until the cluster center is unchanged, stopping clustering, and obtaining a new anchor belonging to the tea tender shoot sample.

in this embodiment, the training parameters are set as:

setting an experiment optimizer as SGD, the batch size as 18, the total iteration number as 500 epochs, the learning rate adopting a cosine annealing algorithm, the maximum learning rate as 1 × e-2 and the minimum value as 0.01 times of the maximum value.

And stopping training when the training reaches the set iteration times, selecting the model with the optimal performance, and storing the structure and the weight of the model.

S6, inputting the image to be detected into the stored optimal model for forward reasoning, and then carrying out non-maximum value inhibition post-processing to obtain the tea tender shoot in the image to be detected, thereby completing the detection of the tea tender shoot. In this embodiment, the following are specific:

and (3) acquiring images of tea tender shoots in the tea garden by using image acquisition equipment, inputting the images to be detected into a stored model, and performing forward reasoning to obtain N candidate results. In the detection process, the phenomenon that the candidate result always comprises the same target overlapping occurs, so that the non-maximum value is adopted for processing, and the threshold value selected by the non-maximum value is 0.4, so that the final identification effect is optimal.

As shown in fig. 5 and fig. 6, which are the tea shoot image and the detection effect image to be detected respectively.

The method expands the data of the tea tender shoot original data set by using a data enhancement algorithm, improves the trunk characteristic extraction network part of the existing target detection model YOLOv4, adds an attention mechanism part and a K-means clustering analysis anchor part to be suitable for the tea tender shoot data set under the complex environment of the tea garden, sets training parameters to carry out iterative training on the network, and thus realizes accurate detection of the tea tender shoots under the complex tea garden environment.

It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A tea tender shoot lightweight detection method in a complex environment based on target detection is characterized by comprising the following steps:

s6, inputting the image to be detected into the stored optimal model for forward reasoning, and performing non-maximum inhibition post-processing to obtain the tea tender shoot in the image to be detected, so as to complete detection of the tea tender shoot.

2. The tea shoot lightweight detection method based on target detection under the complex environment as claimed in claim 1, wherein the step S1 specifically comprises:

dividing tea tender shoots into one shoot and one leaf and two shoots, and shooting images at 45-90 degrees by using a camera at a position 50CM away from the upper part of the tea tender shoots to construct an original data set; wherein, the time for shooting the images is morning, noon, evening and night, and the light supplement shooting is adopted at night;

the weather for taking the image includes sunny days and rainy days.

3. The tea shoot lightweight detection method based on the target detection under the complex environment as claimed in claim 1, wherein the step S2 is specifically as follows:

the tea shoots in the original image were labeled using image labeling software labeling, one shoot per leaf labeled tea 11 and one shoot per leaf labeled tea 12.

4. The tea shoot lightweight detection method based on target detection in the complex environment as claimed in claim 1, wherein in the step S3, the lightweight improvement of the target detection model YOLOv4 is specifically as follows:

5. The tea shoot lightweight detection method based on the target detection in the complex environment as claimed in claim 4, wherein the Neck part is composed of a spatial pyramid pooling SPP and a path aggregation network PANet;

the SPP contains 3 largest pooling layers with pooling nuclei of 5 × 5, 9 × 9, 13 × 13, respectively; the maximum pooling layer padding with pooling core size of 5 × 5 is 2, the step size is 1, the maximum pooling layer padding with pooling core size of 9 × 9 is 4, and the step size is 1; the maximum pooling layer padding with pooling core size of 13 × 13 is 6, and the step size is 1;

the PANet is a path aggregation network, and performs information fusion on three scale feature maps generated by a trunk feature extraction network GhostNet in two path directions from top to bottom and from bottom to top; the PANet comprises the steps of performing convolution and up-sampling operation on an SPP output result, splicing the result with a result obtained by performing convolution operation on a 26 x 26 characteristic layer of GhostNet, and performing convolution for 5 times on the spliced result;

in step S3, the channel attention adding mechanism specifically includes:

the position added is before five convolutions of PANet;

6. The tea shoot lightweight detection method based on the target detection in the complex environment as claimed in claim 5, wherein the Head part is used for dividing the whole image into S × S small squares, performing non-maximum suppression by using IOUs with predicted values and real values on 3 feature image layers of 13 × 13, 26 × 26 and 52 × 52, removing invalid prediction frames, and finally obtaining the identified target class and the corresponding confidence coefficient.

7. The tea shoot lightweight detection method based on target detection under the complex environment as claimed in claim 4, wherein the modified model input size is 416 x 3, and the loss function CIOU of the model is composed of category loss, confidence loss and position loss:

L _CIOU ＝1-CIOU

where ρ is ² (b,b ^gt ) Respectively representing Euclidean distances of central points of the prediction frame and the real frame, c representing a diagonal distance of a minimum packet area capable of simultaneously containing the prediction frame and the real frame, gamma being a quantity for measuring aspect ratio consistency, alpha being a penalty parameter of gamma, w ^gt And h ^gt Representing the width and height of the real box, w and h representing the width and height of the prediction box.

8. The tea shoot lightweight detection method based on the target detection under the complex environment as claimed in claim 1, wherein step S4 specifically adopts K-means to perform cluster analysis on an anchor to obtain a prior frame size of a tea shoot data set;

s41, determining 9 clusters according to the Anchor requirement of YOLOv 4;

9. The tea shoot lightweight detection method based on target detection in the complex environment as claimed in claim 1, wherein in step S5, the training parameters are specifically set as follows:

setting an experiment optimizer as SGD, the batch size is 18, the total iteration times are 500 epochs, the learning rate adopts a cosine annealing algorithm, the maximum learning rate is 1 × e-2, and the minimum value is 0.01 time of the maximum value;

the specific steps of training are as follows:

and when the training reaches the set iteration times, finishing the training, selecting the model with the optimal performance, and storing the structure and the weight of the model.

10. The tea shoot lightweight detection method based on target detection in a complex environment as claimed in claim 1, wherein the step S6 specifically comprises:

inputting an image to be detected into a stored optimal model for forward reasoning to obtain initial n candidate results, and performing non-maximum inhibition filtering on the initial candidate results with a threshold value of 0.5 to obtain a first bud and a second bud of tea tender shoots;

the image to be detected is an image of the tender shoots of tea leaves in the collected tea garden.