CN117893823A

CN117893823A - Apple maturity detection method based on Swin Transformer

Info

Publication number: CN117893823A
Application number: CN202410080533.9A
Authority: CN
Inventors: 方子睿; 周琼; 张友华; 郑安平; 李明; 薛佳毅; 李紫萱; 刘浩楠; 彭冠仪
Original assignee: Anhui Agricultural University AHAU
Current assignee: Anhui Agricultural University AHAU
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-04-16

Abstract

The invention discloses a method for detecting apple maturity based on a Swin Transformer, which comprises the following steps: step 1, acquiring an apple image, and marking the artificial maturity to obtain a data set; dividing the data set into a training set, a verification set and a test set, and then carrying out data enhancement; step 2, generating an improved YOLOv8 model, wherein a Backbone network of a Backbone in the improved YOLOv8 model is a Swin Transformer attention mechanism module, and a Neck part is a progressive feature pyramid network AFPN; step 3, training the improved YOLOv8 model by using a training set to obtain a maturity optimal prediction model; and 4, inputting the apple image to be detected into a maturity optimal prediction model to obtain the maturity of each apple in the apple image. The invention can improve the accuracy of identifying the maturity of apples in an open scene.

Description

Apple maturity detection method based on Swin Transformer

Technical Field

The invention relates to the field of apple maturity detection methods, in particular to an apple maturity detection method based on a Swin Transformer.

Background

The existing apple maturity is based on manual judgment, and the problems of large subjective factors, non-uniform standards, high labor intensity and the like exist. The apple maturity detection is carried out by utilizing computer vision, so that accurate maturity detection can be ensured, timely picking can be better carried out, and the waste of apples caused by untimely picking is avoided

In the aspect of the computer vision technology detection of the maturity of apples, the existing YOLOv8 provides an advanced computer vision technology, and can realize rapid and accurate detection. However, as with all techniques, existing YOLOv8 suffers from several shortcomings and optimizations:

first, existing YOLOv8 suffers from preprocessing effects for some images with complex background and illumination variations, resulting in subsequent feature extraction and degradation of classifier performance.

Second, for the detection of densely distributed apples, some difficulties may occur with the existing YOLOv 8. Since the design principle of the algorithm is to divide the target into larger grids, for some densely distributed apples, accurate detection and positioning may be difficult. This may have a certain impact on the accuracy of the evaluation of apple maturity.

Third, existing YOLOv8 also presents some challenges in handling apples with occlusions. When the apple is partially blocked by other objects, the detection accuracy of the existing YOLOv8 may be greatly reduced. This is mainly because the algorithm does not take into account the effect of occlusion on apples during the feature extraction and classification stage, and therefore further improvements in the algorithm are needed to enhance the detection capability of occluded apples.

Fourth, although the existing YOLOv8 has a great improvement in speed and accuracy, the detection effect on small targets is still not ideal. Because the resolution of the extracted feature map of the network used by the existing YOLOv8 is low, the detection accuracy of a small target is relatively low. To address this problem, attempts may be made to use higher resolution feature maps or to add special small object detection modules to improve small object detection capabilities.

Fifth, the conventional YOLOv8 has a certain limitation in classification accuracy. Although the existing YOLOv8 can classify apples, classification accuracy may be affected for some apple varieties with similar characteristics. This requires further improvements in algorithms to increase the accuracy of classification.

Sixth, existing YOLO v8 does not introduce a mechanism of attention that may not adequately focus on different scales and important features, e.g., existing YOLO v8 focuses primarily on the shape and color of apples, while ignoring other possible features such as texture and texture of apples. In apple maturity detection, specific maturity characteristics are critical to classification and localization, requiring special attention to improve accuracy and robustness.

Disclosure of Invention

The invention provides a method for detecting apple maturity based on a Swin Transformer, which aims to solve the problems of low detection accuracy and precision of the existing method for detecting apple maturity based on YOLO v 8.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a method for detecting apple maturity based on Swin Transformer comprises the following steps:

step 1, acquiring a plurality of apple images, and marking the maturity of each apple image manually based on the maturity characteristics of the apples, thereby acquiring a data set; after dividing the data set into a training set, a verification set and a test set, respectively carrying out data enhancement on the training set, the verification set and the test set;

step 2, generating an improved YOLOv8 model, wherein the improved YOLOv8 model comprises a Backbone network of backbones, a Neck part and an output layer, and the method comprises the following steps of:

the improved YOLOv8 model has a Backbone network as a Swin transform attention mechanism module, wherein the Swin transform attention mechanism module comprises a patch Partition module, a Linear embedding layer and an even number of convolution blocks which are different in size and are sequentially cascaded, each convolution block except the first convolution block is respectively provided with a patch embedding layer, and each convolution block is respectively subjected to downsampling through the patch embedding layer; in the Swin transform attention mechanism module, a patch Partition module blocks each input apple image, then each block is flattened in the channel direction, linear transformation is carried out on flattened channel data by a Linear embedding layer, the flattened channel data is sent into a first convolution block for maturity feature extraction, and maturity feature graphs extracted by the first convolution block are downsampled to corresponding convolution blocks in sequence through each patch embedding layer for maturity feature extraction, and different maturity feature graphs are extracted by each convolution block;

the Neck part in the improved YOLOv8 model is a progressive feature pyramid network AFPN, and the progressive feature pyramid network AFPN performs depth fusion on feature graphs with different maturity extracted from each apple image by a Backbone network by applying a self-attention mechanism on a plurality of scales and feature layers;

the output layer in the improved YOLOv8 model outputs the deep fused features obtained by the progressive feature pyramid network AFPN, namely the maturity prediction result;

step 3, training the improved YOLOv8 model by utilizing the training set obtained in the step 1 after data enhancement, wherein each training period adjusts the super parameters of the model through a verification set, and the generalization of the model obtained in each training period is evaluated through a test set, so that a maturity optimal prediction model is obtained;

and step 4, inputting the apple image to be detected into the maturity optimal prediction model obtained in the step 3, and detecting the maturity of each apple in the apple image by using the maturity optimal prediction model.

Further, in step 1, according to the size, color, shape, texture and texture characteristics of apples in the apple image, the apples in the apple image are labeled with artificial maturity according to three types of completely mature, semi-mature and immature labeling frames, so that a data set is obtained.

Further, when the data enhancement is performed in the step 1, based on the linear transformation enhancement contrast method, the contrast value parameter in the linear transformation enhancement contrast method is set as a random contrast value parameter, the brightness value parameter in the linear transformation enhancement contrast method is set as a random brightness value parameter, and the random contrast value parameter and the random brightness parameter are controlled by adopting a random function, so that the enhancement of the data in the training set, the test set and the verification set is completed.

Further, in step 2, the progressive feature pyramid network AFPN filters features in the multi-level fusion process using an adaptive spatial fusion operation.

Further, in step 3, the loss function of each training period adopts an MPDIoU loss function.

Aiming at the first problem in the background technology, namely the problem that the image with complex background and illumination change affects the preprocessing effect, the invention can lead the image to show better effect under different illumination conditions by using image normalization, thereby solving the problem.

Aiming at the second problem in the background technology, namely the problem that the existing YOLOv8 of the apples which are densely distributed is difficult to accurately detect and position, the invention improves the detection and positioning capability of the model to dense targets by using the Swin transducer attention mechanism module with stronger feature extraction capability as the back bone part of the improved YOLOv8 model, thereby solving the problem.

Aiming at the third problem in the background art, namely the problem that the detection accuracy of the existing YOLOv8 is reduced when the apple is partially blocked by other objects, the invention generates training data with a blocking target by using an improved image contrast data enhancement technology so as to improve the adaptability of the model to the blocking target, thereby solving the problem.

Aiming at the fourth problem in the background technology, namely the problem that the existing YOLOv8 has an unsatisfactory detection effect on a small target, the invention uses AFPN as a feature extraction part to extract feature information with different scales. Such characteristic information may be used to detect objects of different sizes. And improves the detection capability of small objects, thereby solving this problem.

Aiming at the fifth problem in the background technology, namely the problem that the classification accuracy of the existing YOLOv8 to apples with similar characteristics is affected, the invention fuses two adjacent low-level characteristics through AFPN, and brings higher-level characteristics into the fusion process in an asymptotic manner, thereby avoiding semantic gaps between non-adjacent levels. Therefore, the semantic gap between different layers of features can be reduced, the feature fusion effect is improved, and the classification precision of apples with similar features is improved, so that the problem is solved.

Aiming at the sixth problem in the background technology, namely the problem that the existing YOLOv8 has insufficient attention to different scales and important features, the invention improves the background part, introduces a Swin transducer attention mechanism module to extract more important maturity features, and improves the accuracy and generalization capability of the model, thereby solving the problem.

In summary, the method can accurately identify and detect the maturity information and the position information of apples in the input image, detect the maturity of apples in a real growth state and pick the apples in a positioning way, improve the accuracy of identifying the maturity of apples by a target detection model and the possibility of lossless positioning picking in an open scene, reduce the false detection rate of the existing YOLOv8 model and improve generalization.

Drawings

FIG. 1 is a flow chart of the method of an embodiment of the present invention.

FIG. 2 is a diagram of the Swin transducer attention mechanism module architecture according to an embodiment of the present invention.

Fig. 3 is a graph of a conventional YOLOv8 model versus a fully mature apple under natural conditions.

FIG. 4 is a graph of an embodiment of the invention for improving the YOLOv8 model against fully mature apples in natural scenes.

Fig. 5 is a graph of the conventional YOLOv8 model versus mixed maturity apple detection under normal conditions.

FIG. 6 is a graph of the improved YOLOv8 model versus mixed maturity apple detection under normal conditions in accordance with an embodiment of the present invention

Fig. 7 is a graph of a conventional YOLOv8 model versus mixed maturity apple detection in a partially occluded scene.

FIG. 8 is a graph of mixed maturity apple detection for a partially occluded scene using an improved YOLOv8 model in accordance with an embodiment of the present invention.

Fig. 9 is a graph of the conventional YOLOv8 model versus full maturity apple detection in a light-blocking scene.

FIG. 10 is a graph of the improved YOLOv8 model versus full maturity apple detection in a light-blocking scene according to an embodiment of the present invention.

Fig. 11 is a progressive block diagram of an Asymptotic Feature Pyramid Network (AFPN) according to an embodiment of the present invention.

Fig. 12 is a block diagram of adaptive spatial fusion of an Asymptotic Feature Pyramid Network (AFPN) in accordance with an embodiment of the present invention.

Fig. 13 is a diagram of a conventional YOLOv8 model for apple maturity detection architecture.

FIG. 14 is a graph showing the structure of apple maturity for the improved Yolov8 model of this example.

Detailed Description

The invention will be further described with reference to the drawings and examples.

As shown in fig. 1 and 14, the embodiment discloses an apple maturity detection method based on Swin Transformer, which comprises the following steps:

step 1, acquiring a plurality of apple images, and marking the maturity of each apple image manually based on the maturity characteristics of the apples, thereby acquiring a data set; after the data set is divided into a training set, a verification set and a test set, data enhancement is respectively carried out on the training set, the verification set and the test set.

In this embodiment, a plurality of apple images are randomly collected in an experimental field, and normalized and unified size and resolution are performed on each apple image. And then, using LabelImg marking software, marking each apple image according to three types of marking frames of complete maturity, semi-maturity and immature on the basis of maturity characteristics (including size, color, shape, texture and texture characteristics of apples) of the apples, namely, marking each apple in the apple image into three types of complete maturity, semi-maturity and immature according to the size, color, shape, texture and texture characteristics of each apple displayed in the apple image, and marking the marking frames, thereby obtaining a data set.

In this embodiment, the resulting dataset was calculated as per 6:2: the 2 scale is divided into a training set, a validation set and a test set.

In this embodiment, data enhancement is performed on data in the data sets, that is, the training set, the verification set, and the test set, respectively. In the data enhancement, the distinguishing property of different areas of the image is adjusted or reduced, so that the subsequent model can extract characteristics across different data, and is better suitable for various illumination conditions and scene changes. By applying different degrees of contrast enhancement to one image, a plurality of modified images are thereby generated as enhancement data for the apple image data in the respective dataset.

The existing contrast enhancement methods include a linear transformation contrast enhancement method, a nonlinear transformation contrast enhancement method and the like, and the most common method is the linear transformation contrast enhancement method, wherein the linear transformation is mainly used for adjusting the contrast and brightness of an image, and is applicable to an image with the gray level of the image concentrated in a small range, and the contrast of the image can be enhanced after the linear transformation is applied, so that the image is clearer.

In the existing linear transformation contrast enhancement method, two parameter values, namely a contrast value and a brightness value, in a contrast linear transformation formula are fixed, so that the enhancement result of the existing linear transformation contrast enhancement method is fixed, and the parameter values need to be manually adjusted once for obtaining different enhancement pictures.

In this embodiment, the existing linear transformation enhancement contrast method is improved by introducing two random variables of m and n, where m represents a random contrast value and n represents a random luminance value, based on the existing linear transformation enhancement contrast method. The diversity of the data set is further increased by using a random function to control the two parameters m and n in the enhancement process. The improved contrast data enhancement formula is as follows:

G(i,j)＝m×μ(i,j)+n

wherein: i and j refer to pixel coordinate values of the original image respectively, i represents an abscissa, j represents an ordinate, μ (i, j) represents each pixel traversing the original image, and a new contrast output value G (i, j) of each pixel is obtained through an improved contrast data enhancement formula.

M and n are controlled by a random function, so that m epsilon [0.3,0.5,0.7,1.2,1.4,1.6], n is a random integer in the range of [8,12], thereby adjusting parameters of the contrast of the data set image by controlling m and adjusting the brightness of the image by controlling n.

Compared with the existing linear transformation contrast enhancement method, the embodiment controls the random contrast value m and the random brightness value n through the random function, so that data enhancement in a data set is realized, different contrast values can be randomly generated in the enhancement process, and therefore, the method is more convenient, the enhancement time is quick, and the generalization is stronger. The training time of the subsequent model is shorter, the memory of the model is smaller, the model is light, and consistent results can be generated with less expanded data and shorter training time.

And 2, generating an improved YOLOv8 model.

The existing YOLOv8 model overall architecture comprises a Backbone network of backbones, a Neck part and an output layer. The improved YOLOv8 model overall architecture of the embodiment also comprises a Backbone network, a Neck part and an output layer, which are the same as the existing YOLOv8 model. The improved YOLOv8 model of this example differs from the existing YOLOv8 model as follows:

(1) The background part of the YOLOv8 model of the existing YOLOv8 model is a feature extraction network responsible for extracting features from the input image. The backbox section of YOLOv8 employs a network structure of dark-53, which contains multiple convolutional layers and pooling layers. The following is a detailed description:

convolution layer: the convolution layer is a key component for extracting image features, and filters an input image through convolution operation to extract local features. In the backlight section of YOLOv8, the convolutional layers are stacked together in a certain order to form a deep network. The convolution layers can learn features with different levels and scales, so that abundant feature information is provided for subsequent recognition tasks.

Pooling layer: the pooling layer is used for downsampling the output of the convolution layer, reducing the dimension and the calculated amount of data and simultaneously retaining important characteristic information. In the backlight section of YOLOv8, the pooling layer is used to reduce the image resolution for processing in higher layer networks.

In processing input apple maturity images, the existing YOLOv8 model generally adopts a preprocessing step to unify the sizes and formats of different input images. The method comprises the following steps:

scaling and cropping: since the size of the input image may be different, it is necessary to scale the image to the size required by the model for unified processing. YOLOv8 typically presets an input image to be 640x640 pixels in size. If the original image is smaller than this size, magnification is required; if the original image is larger than this size, cropping is required.

Normalization: normalization is the conversion of the pixel values of an image from an integer range of 0-255 to a floating point number range of 0-1. By doing so, the training of the network can be more stable, and the convergence rate is increased. The normalized calculation formula is:wherein mean and std are the mean and standard deviation of the pixel values of the image, respectively; normalized_image represents a normalized image, and image represents a pixel value of each pixel point of the image before normalization.

Through the preprocessing steps, the existing YOLOv8 model can uniformly process input images with different sizes and formats, extract abundant and effective characteristic information, transmit the extracted characteristic information to a subsequent characteristic fusion module, and then output maturity information by a detection module.

The Backbone network of the YOLOv8 model modified in this embodiment is a Swin Transformer attention mechanism module. The Swin transducer attention mechanism module is a deep learning model based on a self-attention mechanism, and can effectively capture context information in an image and improve the perception capability of the model on the image. According to the embodiment, the backbone network is replaced, so that the capturing capability of the model on the maturity characteristics of the apples can be improved.

Transfomers were originally applied in the field of Natural Language Processing (NLP) and proved to have many advantages. The transducer is not only powerful in terms of global context modeling, but also excellent in terms of building long-range dependencies. With the rapid development of transfomers in the NLP field, there is a great deal of interest in the field of computer vision. The Swin transducer attention mechanism module is considered the first successful attempt to introduce it into computer vision, which enables the transducer model to flexibly process images of different scales by applying a hierarchical structure similar to CNN.

The Swin transducer attention mechanism module uses window self-attention to reduce computational complexity. window self-attention is a special self-attention mechanism that takes into account feature interactions within the local window, while ignoring information between different windows. In the Swin Transformer, each input apple image is first divided into several non-overlapping windows, and the features within each window are subject to independent self-attention calculations. The calculation mode can reduce the calculation complexity and improve the model reasoning speed. Meanwhile, because information interaction between different windows is ignored, the capturing capability of the model on the global features can be influenced.

In order to solve the problems of window self-attribute, the Swin transform Attention mechanism module also introduces a self-Attention mechanism (Shifted Window Self-attribute) based on sliding windows, and information is transferred between different windows in a sliding window mode, so that the global feature representation capability of the model is improved. In Swin transform, window self-section corresponds to the patch Embedding and transform convolution block portion of the model structure (see FIG. 2 (a)). The patch Embedding is responsible for dividing the input image into several non-overlapping windows and Embedding the features within each window into a vector representation. The transducer convolution block comprises a plurality of self-attention mechanisms and a feedforward neural network layer, and is used for processing the embedded features and performing deep learning tasks. In the transform convolution block, each self-attention mechanism may use feature interactions within a local window to calculate, thereby implementing window self-attention.

The self-attention after one-pass window offset is recalculated by adopting a shifted window self-attention mode, so that the convolution blocks in the Swin transform attention mechanism module are all in pairs (W-MSA+SW-MSA is a pair), the number of convolution blocks of the Swin transform with different sizes is even, and the number of convolution blocks cannot be odd. The Swin transducer attention mechanism module performs local self-attention computation in non-overlapping window regions, which reduces the computational complexity of the numbers from a square relationship to a linear relationship. Information interaction between non-overlapping windows is then achieved using a shifted window multi-headed self-attention (SW-MSA).

As shown in fig. 3, in the YOLOv8 model modified in this embodiment, the Swin transform attention mechanism module serving as a Backbone network of a backhaul includes a patch Partition module, a Linear coding layer, and an even number of convolution blocks of different sizes and cascaded in sequence, where each of the other convolution blocks except one convolution block is downsampled by the patch coding layer.

Specifically, each input apple image is firstly segmented by a patch Partition module, namely, each 4x4 adjacent pixel is a patch, then each patch is flattened (flat) in the channel direction by the patch Partition module, and each patch is flattened into a one-dimensional vector, wherein the one-dimensional vector contains channel data of all pixels in the patch. Thus, flattening here refers to flattening the pixel data within each patch into one-dimensional vectors, which are then passed as input to the Linear embedding layer. Assuming that RGB three-channel pictures are input, each patch has 4×4=16 pixels, and each pixel has R, G, B values which are flattened to be 16×3=48, so the image shape is changed from [ H, W,3] to [ H/4, W/4,48] by the patch Partition, wherein H and W are the height and width of the original apple image, respectively, and are divided into 4 equal parts after the patch Partition.

Then, the Linear conversion is performed on the channel data of each pixel by the Linear coding layer, that is, after the R, G, B value of each pixel is flattened and linearly converted, the channel data are converted into feature vectors with C channels, and the Linear conversion is performed on the channel data by the Linear coding layer, so that the dimension and the characteristics of the channel data are changed from 48 to C, that is, the image shape is changed from [ H/4, W/4,48] to [ H/4, W/4, C ], wherein C is the number of channels after conversion, and C is a super parameter.

Finally, the Linear transformed data output by the Linear embedding layer is sent to a first convolution block for maturity feature extraction, and the maturity feature map extracted by the first convolution block is downsampled to the corresponding convolution block by each patch Merging layer in sequence for maturity feature extraction, which is specifically described as follows:

the first convolution block receives as input data processed by the Linear processing layer. In the first convolution block, the self-attention mechanism and feedforward neural network layer performs a series of transformations on the input data to extract maturity features. The output of the first convolution block is the extracted maturity signature.

The patch Merging layer is responsible for downsampling the output of the last convolution block to match the input size of the next convolution block. The down-sampling process is to reduce the output of the last convolution block by half in the spatial dimension while increasing the number of channels. The purpose of this is to reduce the amount of computation and the number of parameters while maintaining the features.

The second convolution block receives as input the output of the first convolution block and the downsampled data from the patch berging layer. In the second convolution block, the self-attention mechanism and feedforward neural network layer performs a series of transformations on the input data to further extract maturity features. The output of the second convolution block is the extracted maturity signature.

The subsequent convolution blocks (third and fourth in fig. 2) repeat the above process, and each convolution block receives the output of the previous convolution block and the data downsampled by the patch Merging layer as inputs, and extracts the corresponding maturity feature map. The output of the last convolution block (fourth in fig. 2) is the finally extracted maturity signature, and these signatures can be used for subsequent apple maturity detection tasks.

In the whole data processing process, the Swin transducer performs block processing and flattening operation through a patch Partition module, and then a plurality of transducer convolution blocks are utilized for feature extraction. By the downsampling operation of the patch Merging layer, gradual downsampling of the feature map and increase of the channel number are realized, so that the calculated amount and the parameter number are reduced while the features are maintained. The structure enables the improved YOLOv8 model obtained by taking the Swin Transformer as a backbone network to have excellent performance and high-efficiency processing speed on the apple maturity task.

(2) The Neck of the existing Yolov8 employs a PANet structure, such as the feature fusion part of FIG. 13. As can be seen from fig. 13, the backup finally passes through an SPPF (SPP Fast, layer9 in the drawing), after which H and W have been downsampled by 32. Correspondingly, layer4 is downsampled by 8 and Layer6 is downsampled by 16. The inputs were set at 640x640, resulting in resolution of Layer4, layer6, layer9 of 80 x 80, 40x 40, and 20 x 20, respectively.

And taking Layer4, layer6 and Layer9 as inputs of the PANnet structure, carrying out up-sampling and channel fusion, and finally sending three output branches of the PANet into a detection head for Loss calculation or result calculation. Unlike FPN (unidirectional, top-down), PANet is a two-way path network. Compared to FPN, PANet introduces a bottom-up path that makes it easier for underlying information to pass on top of higher layers. .

In this embodiment, the negk part in the modified YOLOv8 model is a progressive feature pyramid network AFPN, where the progressive feature pyramid network AFPN is used to connect the backhaul Backbone network and the output layer, and the progressive feature pyramid network AFPN performs deep fusion on feature graphs with different maturity extracted from each apple image by the backhaul Backbone network by applying a self-attention mechanism on multiple scales and feature layers.

In this embodiment, in order to better integrate the feature information of different scales and different feature layers, a brand new feature integration module, namely a progressive feature pyramid network AFPN (Attention Feature Pyramid Network), is adopted. The progressive feature pyramid network AFPN achieves depth fusion of different features by applying a self-attention mechanism on multiple scales and feature layers. The progressive feature pyramid network AFPN is beneficial to direct feature fusion crossing non-adjacent levels and suppresses information contradiction between different levels of features, thereby preventing loss or degradation of feature information in transmission and interaction processes. The progressive feature pyramid network AFPN can provide richer feature expression, and the recognition accuracy of the model on the maturity of apples is improved.

Existing feature pyramid networks typically upsample High-Level features generated by the Backbone network into Low-Level features.

In the embodiment, in the process of extracting the features of the progressive feature pyramid network AFPN in the backstone from bottom to top, a fusion process is started by combining two Low-Level features with different resolutions in a first stage, then the High-Level features are gradually brought into the fusion process, and finally the top-Level features of the backstone are fused. The fusion mode can avoid larger semantic difference between non-adjacent layers. In this process, the Low-Level feature is fused with semantic information from the High-Level feature, which is fused with detailed information from the Low-Level feature. Due to their direct interaction, information loss or degradation in multi-level transmission is avoided.

Element summation is not an efficient method in the whole feature fusion process of the progressive feature pyramid network AFPN, because there may be a contradiction of different targets at a certain position between the layers. To address this problem, the present embodiment utilizes an adaptive spatial fusion operation to filter features in a multi-level fusion process, which can preserve useful information for model fusion.

In this embodiment, the data processing process of the progressive feature pyramid network AFPN is:

(1) The AFPN first extracts the multi-level features.

As with many feature pyramid network-based object detection methods, features of different levels need to be extracted from the Backbone prior to feature fusion. The framework extracts the last layer of features from each feature layer of the back bone, resulting in a set of features of different dimensions, denoted as { C ] ₂ ,C ₃ ,C ₄ ,C ₅ }, wherein C ₂ Representing a first low-level feature, C ₃ Representing a second low-level feature C ₄ Representing high-level features C ₅ Representing the highest level features. To perform feature fusion, low-level feature C is first used ₃ Input into the feature pyramid network, then add C ₄ Finally add C ₅ 。

After the feature fusion step, a set of multi-scale features { Q }, is generated ₃ ,Q ₄ ,Q ₅ }，Q ₃ 、Q ₄ 、Q ₅ Respectively represent C ₃ ,C ₄ ,C ₅ And corresponding output generated through the characteristic pyramid network. Apply the convolution with Stride 2 to Q ₅ Then apply a Stride of 1 to generate Q ₆ ，Q ₆ Representing multi-scale features Q ₅ Multi-scale features are produced by convolving with a Stride of 2 followed by convolving with a Stride of 1. This ensures a uniform output. The last set of multi-scale features is { Q ₃ ,Q ₄ ,Q ₅ ,Q ₆ Corresponding feature Stride is {8,16,32,64} pixels.

(2) And (5) feature fusion.

The new portion of YOLOv8 after modification of this embodiment is shown in fig. 11, and the adaptive spatial fusion structure is shown in fig. 12. AFPN progressively integrates low-level, high-level, and top-level features during the bottom-up feature extraction of a backhaul network. Specifically, AFPN initially fuses low-level features, then fuses deep features, and finally fuses highest-level features, i.e., the most abstract features. The semantic gap between non-adjacent hierarchical features is greater than the semantic gap between adjacent hierarchical features, especially the bottom and top features. This directly results in poor fusion of non-adjacent hierarchical features.

Since the architecture of AFPN is progressive, this will bring the semantic information of the features of different levels closer together in a progressive fusion process, alleviating the above-mentioned problem. For example, C ₂ And C ₃ Feature fusion between them reduces their semantic gap. Due to C ₃ And C ₄ Is an adjacent hierarchical feature, thus reducing C ₂ And C ₄ Semantic gap between them. To align the dimensions and prepare for feature fusion, the present embodiment performs downsampling using different convolution kernels and Stride depending on the required downsampling rate: 2 downsampling is achieved using a 2 x 2 convolution with Stride of 2 and 4 downsampling is achieved using a 4x4 convolution with Stride of 4. After feature fusion, the present embodiment continues to learn features using 4 residual units, which are similar to the ResNet residual network. Each residual unit comprises 2 3 x3 convolutions.

In the multi-level feature fusion process, the embodiment utilizes a plurality of ASFFs to distribute different space weights for features of different levels, thereby enhancing the importance of the maturity features of the key layer and reducing the influence of contradiction information from different targets.

As shown in fig. 12, this embodiment merges 3 levels of features into the apple maturity detection task. Let theRepresenting the eigenvector at position (i, j) from the mth layer to the nth layer,/>Representing an output feature vector, obtained by adaptive spatial fusion of multiple levels of features, and derived from a linear combination of feature vectors>And->The following are provided:

wherein:and->Spatial weights of apple maturity features at the nth level, respectively representing 3 levels, and three spatial weights satisfy +.>

Therefore, the replacement of the neg part of YOLOv8 with AFPN (Asymmetric Feature Pyramid Network) network in this embodiment can bring the following advantages:

A. stronger feature extraction capability: the AFPN is a novel characteristic pyramid network, and can better extract and fuse characteristic information with different scales and different layers. This helps to improve the accuracy and robustness of the model in identifying apple maturity.

B. Better semantic information extraction: by adopting an asymmetric design, the AFPN can extract features with more semantic information on different scales, which is helpful for the model to better understand and identify the maturity of apples.

C. Higher computational efficiency: the AFPN adopts a more effective feature fusion strategy, so that unnecessary calculation amount can be reduced, and the calculation efficiency of the model is improved. This helps to improve the real-time performance and performance of the model in practical applications.

(3) The detector Head portion (i.e., output layer) of the existing YOLOv8 model employs an anchor-free approach and a Decoupled-Head (Decoupled-Head) approach. This means that the classification and detection head is separated, there are no more previous objectness branches, only decoupled classification and regression branches. In addition, regression branching uses the integral form representation set forth in Distribution Focal Loss. The design ensures that the detection accuracy and the detection speed of the YOLOv8 are improved remarkably. For each grid cell, detecting the head portion generates a plurality of candidate bounding boxes (anchors) and calculates their confidence scores. And then, screening the generated bounding box by using a non-maximum suppression (NMS) algorithm, and finally obtaining an accurate target detection result.

In this embodiment, the output layer in the improved YOLOv8 model is the same as the detection head part of the existing YOLOv8 model, and the output layer outputs the deep fused features obtained by the progressive feature pyramid network AFPN, that is, the maturity prediction result.

And 3, training the improved YOLOv8 model by using the training set obtained in the step 1 after data enhancement, adjusting the super parameters of the model through a verification set in each training period, and evaluating the generalization of the model obtained in each training period through a test set, thereby obtaining the optimal maturity prediction model.

In this embodiment, the conventional bounding box loss function of the existing YOLOv8 model is replaced by an MPDIoU (Multi-scale Pixel Density IoU) loss function.

The existing boundary box regression loss function of the YOLOv8 model adopts CIoU, and has the problems of sensitivity to abnormal values, incapability of reflecting prediction accuracy and the like. The MPDIoU used in this embodiment can more truly reflect the difference between the prediction frame and the real frame as a new loss measurement, can perform more accurate regression and evaluation on targets with different scales, different densities and different shapes, can improve the training effect of boundary frame prediction, improves the accuracy and robustness of the model to apple maturity detection, and improves the convergence speed and prediction accuracy.

The penalty function MPDIOU is for minimumThe main diagonal distance between the prediction frame and the real frame (namely the labeling frame in the apple maturity training set) is converted to ensure that the coordinates of the upper left corner and the lower right corner of the prediction frame are respectively as follows The upper left and lower right corner coordinates of the real frame are +.>

Wherein: a represents a prediction frame, and B represents a real frame; w and h represent the width and height respectively in the input apple image,andthe definition is as follows:

in this embodiment, after each training period is finished, the performance of the model is estimated by using the verification set based on the cross-verification mode, and the model parameters are adjusted according to the estimation result. And evaluating the generalization of the model obtained in each training period through the test set, thereby obtaining the optimal maturity prediction model.

The improved YOLO v8 model in the embodiment can be independently deployed on hardware after training, real-time apple video streams to be detected are collected through the hardware, the maturity of apples is detected in real time by using an improved YOLO v8 algorithm, and a result is returned to the front end.

As shown in fig. 3, fig. 4, fig. 5, fig. 6, fig. 7, fig. 8, fig. 9 and fig. 10, it can be seen from fig. 3 that the detection accuracy of the existing model on the fully mature apples in the natural scene needs to be improved, and from fig. 4, the detection accuracy of the model in the embodiment on the fully mature apples in the natural scene is generally higher than that of the existing model by about 3%; in fig. 5, it can be seen that the existing model has to be improved in the detection accuracy of the mixed maturity apples in the normal scene, while in fig. 6, the embodiment model has to be improved in the detection accuracy of the mixed maturity apples in the normal scene; in fig. 7, it can be seen that the existing model has low detection accuracy on the mixed apples in the partial shielding scene, while in fig. 8, it can be seen that the model of the embodiment has improved detection accuracy on the apples in the scene shielded by the blades or other apples; fig. 9 shows that the existing model is not high in accuracy of detecting the full-ripeness apples in the shading scene, and fig. 10 shows that the full-ripeness apples in the shading scene are detected with the existing model in a significantly improved accuracy. In conclusion, the detection accuracy of the improved YOLOv8 model is improved, and the detection problem of apple fruits under the conditions of mixed maturity, shielding by branches and other apples and shading can be effectively solved.

The preferred embodiments of the present invention have been described in detail above with reference to the accompanying drawings, and the examples described herein are merely illustrative of the preferred embodiments of the present invention and are not intended to limit the spirit and scope of the present invention. The individual technical features described in the above-described embodiments may be combined in any suitable manner without contradiction, and such combination should also be regarded as the disclosure of the present disclosure as long as it does not deviate from the idea of the present invention. The various possible combinations of the invention are not described in detail in order to avoid unnecessary repetition.

The present invention is not limited to the specific details of the above embodiments, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the protection scope of the present invention without departing from the scope of the technical concept of the present invention, and the technical content of the present invention is fully described in the claims.

Claims

1. The apple maturity detection method based on the Swin Transformer is characterized by comprising the following steps of:

2. The method for detecting apple maturity based on Swin transducer according to claim 1, wherein in step 1, according to the size, color, shape, texture and texture characteristics of apples in the apple image, the apples in the apple image are marked with artificial maturity according to three marking frames of complete maturity, semi-maturity and immature maturity, thereby obtaining a data set.

3. The method for detecting apple maturity based on Swin transducer according to claim 1, wherein when data enhancement is performed in step 1, the enhancement of data in training set, test set and verification set is completed by setting the contrast value parameter in the linear transformation enhancement contrast method as random contrast value parameter, setting the brightness value parameter in the linear transformation enhancement contrast method as random brightness value parameter, and controlling the random contrast value parameter and the random brightness parameter by adopting a random function.

4. The method for apple maturity detection based on Swin Transformer according to claim 1, wherein in step 2, said progressive feature pyramid network AFPN utilizes adaptive spatial fusion operations to filter features in a multi-level fusion process.

5. The method for apple maturity detection based on Swin transducer of claim 1, wherein in step 3, the loss function of each training period is an MPDIOU loss function.