CN117095153A

CN117095153A - Multi-mode fruit perception system, device and storage medium

Info

Publication number: CN117095153A
Application number: CN202311057126.8A
Authority: CN
Inventors: 陈文骏; 饶元; 王坦; 王玉伟; 王丰仪; 徐峰; 金�秀; 张雨; 柳迎春; 张武
Original assignee: Anhui Agricultural University AHAU
Current assignee: Anhui Agricultural University AHAU
Priority date: 2023-08-15
Filing date: 2023-08-21
Publication date: 2023-11-21

Abstract

The invention discloses a multi-mode fruit sensing system, a multi-mode fruit sensing device and a storage medium, and belongs to the technical field of computer vision. Aiming at the problem that a fruit perception model in the prior art is poor in perception effect in a low-light environment, the method acquires fruit plant video stream data at multiple angles, constructs a multi-mode fruit image dataset, accesses a multi-mode visual data fusion encoder into a feature extraction pyramid structure of a target detection model to obtain a multi-mode visual data fusion backbone network, accesses the multi-mode visual data fusion backbone network into a target detection model head and then trains to obtain a trained fruit detection module, predicts the multi-mode fruit image data through the fruit detection module, and finally processes a prediction result through the fruit perception module to obtain fruit position and category information. The device can realize high-precision detection in a real complex environment, and meets the requirements of edge lightweight deployment and high-precision fruit perception.

Description

Multi-mode fruit perception system, device and storage medium

Technical Field

The present invention relates to the field of computer vision, and more particularly, to a multi-modal fruit perception system, apparatus, and storage medium.

Background

Fruit detection is an important research direction in the technical field of computer vision, and aims to automatically identify and detect different types of fruits and vegetables through computer algorithms and technologies. In modern agricultural production, tomatoes are a widely planted and consumed vegetable crop, and how to improve picking efficiency is a problem for large-area cultivation farms. The existing tomato picking and ripeness classifying processes are all carried out manually, and a great deal of labor is consumed. The process is time-consuming and labor-consuming, subjective and lacks uniform selection basis, so that the quality of agricultural products depending on tomatoes with different maturity is affected.

With the rapid development of artificial intelligence technology, it has become possible to classify fruits of different maturity using computer vision and machine learning methods. For example, the lightweight tomato real-time detection method based on improved YOLO and mobile deployment published by Taiheng Zeng et al in Computers and Electronics in Agriculture provides a lightweight tomato target detection algorithm based on YOLOv5, and by replacing the focal layer and backbone network of YOLOv5 and combining channel pruning and optimizing original super-parameters, the parameter quantity is reduced, but the model accuracy is reduced compared with that of an original model. The anchor-free detector for tomato detection, published by Guoxu Liu et al under Frontiersin Plant Science, introduces a convolution block attention module into the backbone network of the active-DLA 34, but the detection rate decreases when the overlap or occlusion area is higher. In addition, picking in overcast and rainy days, picking in a low-light environment and the like are problems frequently encountered in practical picking environments, RGB images often have a large amount of noise, and the existing tomato detection algorithm is difficult to avoid performance degradation only depending on the RGB images.

To solve this problem, there have been current research on using multi-modal data to improve algorithm performance. For example, H.Gan et al, computers and Electronics in Agriculture, published on color and thermal images to detect immature green citrus fruit, devised a new color-heat joint probability algorithm that effectively fuses information from color and thermal images, but only in the early morning can an ideal thermal image be acquired. Improved YOLOv5 tomato cluster detection and counting based on RGB-D fusion published by jiachong Rong et al under Computers and Electronics in Agriculture provides an improved YOLOv5 for fusing RGB images and depth images to reduce false recognition of background tomatoes, but only all tomato clusters with different maturity can be detected, so that the practical use value is low. Real-time defect detection is carried out on green coffee beans by using near infrared snapshot hyperspectral imaging published by Shih-Yu Chen et al in Computers and Electronics in Agriculture, a multi-mode real-time coffee bean defect detection algorithm based on deep learning is developed, but hyperspectral imaging has higher requirements on environment.

In summary, the difficulties in fruit detection in the practical application environment in the prior art include: (1) The fruit detection performance is reduced due to the fact that fruits are different in size, serious in shielding and overlapping conditions and interference of background factors in a close planting environment, visible light visual data are easily interfered by external factors, and stability of a fruit detection model is difficult to guarantee. (2) The multi-mode data is selected, part of modes can effectively improve detection performance but the acquisition requirement is relatively higher, and the other part of modes have relatively lower acquisition requirement but limited performance improvement. (3) The method for realizing high-performance multi-mode fusion is divided into three types of input fusion, feature fusion and decision fusion aiming at the fruit detection work of multi-mode visual data at present; for different fruit perception tasks, a proper fusion method needs to be selected to improve the robustness of the model. (4) On the basis of realizing high-precision fruit perception, the requirement of light-weight deployment of edges is met; the existing fruit perception model either realizes high-precision detection but needs high-performance equipment, or realizes a lightweight model but has insufficient precision. Therefore, how to design a feature extraction architecture for efficiently and reasonably extracting and fusing input data, and a high-performance lightweight fruit perception model with good performance in a real production environment are the problems to be solved at present.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problem that a fruit perception model in the prior art is poor in perception effect in a low-light environment, the invention provides a multi-mode fruit perception system, a device and a storage medium, visible light, depth and near infrared multi-mode image data are introduced, a multi-mode fruit perception system is constructed based on a multi-mode visual data fusion encoder and a lightweight fruit detection module, real-time perception and classification of fruits are realized, and the multi-mode fruit perception system has the characteristics of capability of edge lightweight deployment and high-precision fruit perception.

2. Technical proposal

The aim of the invention is achieved by the following technical scheme.

A multi-modal fruit perception system, the system comprising a data acquisition module, a fruit detection module, and a fruit perception module;

the data acquisition module acquires video stream data of fruit plants at multiple angles and constructs a multi-mode fruit image data set;

the fruit detection module comprises a multi-mode visual data fusion encoder, a multi-mode visual data fusion backbone network and a target detection model; accessing the multi-mode visual data fusion encoder into a feature extraction pyramid structure of the target detection model to obtain a multi-mode visual data fusion backbone network, then accessing the multi-mode visual data fusion backbone network into the head of the target detection model, and then training to obtain a trained fruit detection module, and predicting multi-mode fruit image data through the fruit detection module;

and the fruit perception module is used for processing the prediction result through the fruit perception module to obtain fruit position and category information.

Further, shooting the fruit plants under different illumination conditions by using a multi-mode sensing device to obtain visible light image data, depth image data and near infrared image data, wherein the visible light image data, the depth image data and the near infrared image data form a multi-mode fruit image data set.

Further, the multi-modal visual data fusion encoder includes a visible light visual data feature extraction path, a depth and near infrared visual data feature extraction path, a residual aggregation network module, and a cross-domain attention module.

Further, the computing function of the cross-domain attention module is:

f ₂ ＝(C·SiLU(A·SiLU(f ₁ )+B) ^T +D) ^T

wherein f ₁ A first feature map representing the output of the cross-domain attention module, f ₂ A second feature map representing the output of the cross-domain attention module, f ₂ ^h Representing the splitting of the second feature map along the height direction, f ₂ ^w Represents the splitting of the second characteristic diagram along the width direction, f ₃ Representing the final feature map output by the cross-domain attention module, alpha and beta each representing an independent learnable vector, delta ₁ 、δ ₂ 、δ ₃ 、δ ₄ And delta ₅ Each representing an independent convolution calculation layer, W representing the width of the feature map of the input cross-domain attention module, H representing the height of the feature map of the input cross-domain attention module, i representing a certain width of the feature map of the input cross-domain attention module, j representing a certain height of the feature map of the input cross-domain attention module, x representing a certain channel of the feature map of the input cross-domain attention module, c representing the number of channels of the feature map of the input cross-domain attention module, x _c A feature diagram representing an input cross-domain attention module, h representing a column of the feature diagram of the input cross-domain attention module, w representing a row of the feature diagram of the input cross-domain attention module, A, B each representing a column having a vector x _c Learning parameters of full-join mapping of the same dimension, C, D all represent a full-join mapping with x _c ^T And the learning parameters of full-connection mapping with the same dimension, wherein T represents transposition operation on the feature map of the current position.

Further, inputting the visible light image data into a visible light visual data feature extraction path for feature extraction to obtain a first feature map;

inputting the depth image data and the near infrared image data into a depth and near infrared visual data feature extraction path for feature extraction to obtain a second feature map;

merging the first feature map and the second feature map, and inputting the merged first feature map and the merged second feature map into a residual aggregation network module for calculation to obtain a third feature map;

and inputting the third feature map into a cross-domain attention module to calculate to obtain a final feature map.

Further, the multi-mode visual data fusion backbone network comprises a multi-mode visual data fusion encoder, a feature extraction pyramid structure of a target detection model and a cross-domain attention module; the feature extraction pyramid structure of the target detection model comprises a first original feature extraction layer, a second original feature extraction layer, a first maximum pooling layer and a second maximum pooling layer in a backbone network of the target detection model.

Further, after the first original feature extraction layer and the second original feature extraction layer of the target detection model are processed, a cross-domain attention module is respectively added behind the first original feature extraction layer and the second original feature extraction layer, and the first feature extraction layer and the second feature extraction layer are obtained; the multi-mode visual data fusion encoder, the second maximum pooling layer, the second feature extraction layer, the first maximum pooling layer and the first feature extraction layer form a multi-mode visual data fusion backbone network;

and taking the final feature map as a large-scale feature map output by the multi-mode visual data fusion backbone network, and sequentially passing the large-scale feature map through a second maximum pooling layer, a second feature extraction layer, a first maximum pooling layer and a first feature extraction layer to obtain a middle-scale feature map and a small-scale feature map.

Further, the multi-modal fruit image dataset is divided into training set data and verification set data; and (3) accessing the multi-mode visual data fusion backbone network into the head of the target detection model, inputting training set data for training, and inputting verification set data for evaluation to obtain the fruit perception module.

A multi-modal fruit perception device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the multi-modal fruit perception system when executing the program.

A computer readable storage medium storing computer executable instructions that when executed by a processor implement steps of the multimodal fruit perception system applied to a data acquisition module, a fruit detection module, or a fruit perception module.

3. Advantageous effects

Compared with the prior art, the invention has the advantages that:

(1) According to the multi-mode fruit sensing system, the device and the storage medium, the multi-mode fruit sensing system can achieve excellent detection performance under various real and complex scenes by combining the depth image data and the near infrared image data to assist the visible light image data in fruit detection.

(2) According to the multi-mode fruit perception system, the device and the storage medium, in the multi-mode visual data fusion encoder, the visible light visual data characteristic extraction passage, the depth and near infrared visual data characteristic extraction passage and the residual aggregation network module are combined, so that the multi-mode information is fused efficiently, and the multi-mode information fusion encoder has good multi-mode characteristic collision resistance.

(3) According to the multi-mode fruit sensing system, the multi-mode visual data fusion encoder and the cross-domain attention module, the cross-channel, direction sensing and position sensing information can be captured, richer characteristic information expression is realized, and the system is helped to locate and identify fruit targets more accurately.

(4) According to the multi-mode fruit perception system, the multi-mode fruit perception device and the storage medium, the residual aggregation network and the cross-domain attention module are designed, so that the parameter number of the fruit detection module is effectively reduced, and the light-weight requirement of the system on the edge equipment is improved.

Drawings

FIG. 1 is a flow chart of a multi-modal fruit perception system construction in accordance with an embodiment of the present invention;

FIG. 2 is a first schematic diagram of capturing video streaming data of fruit plants according to an embodiment of the present invention;

FIG. 3 is a second schematic diagram of capturing video streaming data of fruit plants according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of training video stream data of fruit plants according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a multi-modal fruit perception system according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a multi-modal visual data fusion encoder according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a multi-layer perceptron module in accordance with an embodiment of the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and the accompanying specific examples.

Examples

As shown in fig. 1, a multi-modal fruit sensing system is provided in this embodiment. The multi-mode fruit sensing system comprises a data acquisition module, a fruit detection module and a fruit sensing module. The data acquisition module is used for acquiring video stream data of the fruit plants at multiple angles and constructing a multi-mode fruit image data set; the fruit detection module comprises a multi-mode visual data fusion encoder, a multi-mode visual data fusion backbone network and a target detection model, wherein the multi-mode visual data fusion encoder is connected into a feature extraction pyramid structure of the target detection model to obtain the multi-mode visual data fusion backbone network, the multi-mode visual data fusion backbone network is connected into the head of the target detection model and then trained to obtain a trained fruit detection module, and multi-mode fruit image data are predicted through the fruit detection module; and the fruit perception module is used for processing the prediction result through the fruit perception module to obtain fruit position and category information.

In this embodiment, first, in the data acquisition module, video stream data of fruit plants are acquired at multiple angles, and a multi-mode fruit image dataset is constructed. Shooting fruit plants under different illumination conditions, wherein the different illumination conditions comprise conditions of forward light on sunny days, backward light on overcast days, and the like, and conditions of weak light at night, an artificial light source, a sodium light source, and the like, in the embodiment, images in video stream data of the fruit plants are extracted at intervals of 10 frames to obtain visible light image data, depth image data and near infrared image data, the visible light image data, the depth image data and the near infrared image data of the same frame form a multi-mode fruit image data sample, and all the samples form a multi-mode fruit image data set.

In this embodiment, the multi-mode sensing device is manually held to collect video stream data of multi-angle fruit plants under conditions including different distances, different heights, and the like. The method comprises the steps of using a machine to fix multi-mode sensing equipment to conduct data collection, collecting simulation pictures of fruit picking equipment, then utilizing an interface of the multi-mode sensing equipment to conduct multi-mode data space alignment, and conducting manual data annotation to construct a multi-view multi-mode fruit image data set.

Specifically, a camera with visible light, depth and near infrared multi-mode sensing functions is used for collecting multi-angle fruit plant video stream data of different plants under different time periods and different illumination conditions, and the visible light, depth and near infrared visual data are extracted from the multi-angle fruit plant video stream data. As shown in fig. 2, the sensing device on a single plant is schematically shown in a shooting position, and the Azure Kinect device is manually held at a position about 1 meter from the fruit plant, and shooting is performed along a row of fruit plants at different angles and different distances. The camera is stable when being manually held, ensures that the picture is clear, and can contain a small part of motion blur samples. In addition, as shown in fig. 3, the Azure Kinect apparatus was fixed on a mechanical vehicle using a bracket, photographed in translation along a row of fruit plants at a constant rate and a fixed height, and simulated actual detection pictures of the picking apparatus were collected. The camera repeatedly moves for a plurality of times, the video stream data of the fruit plants with multiple visual angles of the plurality of rows of fruit plants are collected, and a multi-mode fruit image data set with multiple visual angles is constructed. Further, traversing the obtained multi-angle fruit plant video stream data, separating visible light image data, depth image data and near infrared image data of each frame in the multi-angle fruit plant video stream data by using a ffmpeg tool, and sequentially merging and storing all mode data in RGB, depth, IR folders. It should be noted that, the separation of video stream data using the ffmpeg tool is the prior art.

In this embodiment, the Azure Kinect apparatus is an existing apparatus, the acquisition frame number of the Azure Kinect apparatus is 30FPS, the resolution is 720P, the depth mode is nfov_2x2BINNED, the inertial measurement unit is LSM6DSMUS/1.6kHZ, and the field angle of interest is 90 ° ×59 °/75 ° ×65 °.

It is worth to say that, in order to realize high-precision fruit detection, the training data is diversified, and in the data acquisition process, the data with variability should be acquired as much as possible. Thus, the different time periods, the different lighting conditions and the different distances in the present embodiment are specifically expressed as: according to different sunlight intensities at different times, the acquisition period comprises 7-8 am, 10-11 am, 2-3 pm and 6-7 pm; according to different illumination conditions, the illumination conditions are collected under the conditions of forward light, reverse light and the like on sunny days and overcast days, weak light, artificial light sources, sodium light sources and the like at night. Furthermore, in the embodiment, the generalization capability and the robustness of the multi-mode fruit perception system can be improved by collecting abundant and various data.

As shown in fig. 4, the obtained multi-modality fruit plant image dataset is subjected to a data cleaning operation, repeated, invalid multi-modality fruit image datasets are manually cleaned, the remaining multi-modality fruit image datasets are re-labeled, and spatial alignment is performed using a camera built-in interface. The multi-modal fruit image dataset is traversed using a LabelMe tool, and the fruits in each visible light image dataset are manually labeled one by one with segmentation data and maturity data. And finally, traversing the fruit segmentation information, calculating the circumscribed rectangle of the fruit segmentation polygon, expanding 15 pixels upwards, expanding 10 pixels downwards, leftwards and rightwards respectively, and generating fruit target detection labeling information, thereby constructing a multi-mode fruit image dataset. In the embodiment, the target fruit segmentation polygons are uniformly and outwards expanded to generate the target detection annotation frame, so that the accuracy and uniformity of annotation data are improved. In this example, the images were traversed using a LabelMe tool and fruit segmentation data were labeled as prior art.

It should be noted that, in this embodiment, if the fruit perception system is constructed by using only visible light image data, noise is easily generated due to the influence of ambient light, so that the performance of the fruit perception system is reduced. Therefore, in the embodiment, the fruit detection is performed by combining the depth and the near infrared image data to assist the visible light image data, so that excellent detection performance of the fruit perception system under various real and complex scenes can be realized.

Further, the fruit detection module comprises a multi-mode visual data fusion encoder, a multi-mode visual data fusion backbone network and a target detection model. As shown in fig. 5, the multi-mode visual data fusion encoder is connected to the feature extraction pyramid structure of the target detection model to obtain a multi-mode visual data fusion backbone network, and then the multi-mode visual data fusion backbone network is connected to the head of the target detection model for training to obtain a trained fruit detection module.

Specifically, as shown in fig. 6, the multi-modal visual data fusion encoder includes a visible light visual data feature extraction path, a depth and near infrared visual data feature extraction path, a residual aggregation network module, and a cross-domain attention module.

In this implementation, the calculation function of the cross-domain attention module is:

f ₂ ＝(C·SiLU(A·SiLU(f ₁ )+B) ^T +D) ^T

Further, for the visible light visual data feature extraction path, comprising:

s1: constructing a Conv module using a normal convolution, batch normalization layer (Batch Normalization, BN) and a SiLU (Sigmoid Linear Unit, siLU) activation function;

s2: replacing a Conv module of a visible light visual data characteristic extraction path in a backbone network of a target detection model (YOLO model) with the constructed Conv module;

s3: inputting visible light image data into a visible light visual data feature extraction path, and calculating to obtain a first feature map C ₀ 。

In this embodiment, the target detection model (YOLO model) used is a YOLOv7-tiny model. In addition, in the prior art, the Conv module includes a common convolution, a batch normalization layer and a LeakyReLU (Leaky Rectified Linear Unit, leakyReLU) activation function, and in this embodiment, in the constructed Conv module, the SiLU activation function is a combination of Sigmoid and LeakyReLU, which has the characteristics of no upper bound, no lower bound, smoothness and non-monotonic, and further, in the constructed Conv module, the effect of the SiLU activation function on the deep model is better than that of the LeakyReLU activation function. Thus, the use of the SiLU activation function can improve the overall performance of the model relative to the original LeakyReLU activation function.

As shown in fig. 6 and 7, for the depth and near infrared visual data feature extraction path, comprising:

s1: splitting the depth image data and the near infrared image data into matrices with the size of 4x4, arranging and flattening the matrices in sequence from left to right and from top to bottom to obtain a preprocessed characteristic map M ₁ ；

S2: constructing a full connection feature extraction residual layer by using a cross-block residual layer and a cross-channel residual layer;

s3: constructing a multi-layer perceptron module by using three fully connected feature extraction residual layers, and inputting a preprocessed feature map M ₁ Calculating to obtain a second characteristic diagram M finally output by the multi-layer perceptron module ₀ 。

In this embodiment, the depth and near infrared visual data feature extraction path is composed of a Conv module and a multi-layer perceptron module. The Conv module is used for splitting the input multi-mode fruit image, so that the computation complexity of the multi-layer perceptron module is reduced. For the multi-layer perceptron module, a full-connection feature extraction residual layer is constructed through a cross-block residual layer and a cross-channel residual layer, and then the multi-layer perceptron module is constructed through three full-connection feature extraction residual layers. Therefore, the depth and near infrared visual data feature extraction path can be used for rapidly and effectively extracting the global features of the depth and near infrared data through the built multi-layer perceptron module.

For a residual aggregation network module, comprising:

s1: the constructed Conv module is adopted to input a first characteristic diagram C output by the multi-layer perceptron module and the visible light visual data characteristic extraction path ₀ And a second characteristic map M ₀ Calculating to obtain a characteristic diagram E ₁ ；

S2: inputting the obtained characteristic diagram E by adopting the constructed Conv module ₁ And a first characteristic diagram C ₀ Sum of (a) and (b) calculate to obtain a feature map E ₂ ；

S3: repeating the steps, and calculating to obtain a characteristic diagram E ₃ 、E ₄ 、…、E ₇ 、E ₈ Wherein E is ₄ From E ₂ And E is ₃ After addition, is processed by Conv module, E ₆ From E ₄ And E is ₅ After addition, is processed by Conv module, E ₈ From E ₆ And E is ₇ After addition, the mixture is processed by Conv module;

s4: inputting the obtained characteristic diagram E by adopting the constructed Conv module ₁ 、E ₂ 、…、E ₇ 、E ₈ Forming a residual aggregation network module, and outputting a third characteristic diagram E by the residual aggregation network module ₀ 。

In this embodiment, the residual aggregation network module is composed of eight Conv modules with the same number of feature processing channels and one Conv module with eight times of the number of feature processing channels; each feature map (E) of the eight cascade Conv modules output ₁ 、E ₂ 、…、E ₇ 、E ₈ ) The combined characteristic processing channels are input into a Conv module with the eight times of the characteristic processing channels to obtain a third characteristic diagram E output by a residual aggregation network module ₀ . It should be noted that, in the embodiment, the Conv modules in the cascade of eight identical feature processing channels and the Conv module of one octave feature processing channel are both the constructed Conv modules. Therefore, in this embodiment, the aggregation effect of the residual aggregation network module is more excellent by using eight Conv modules with the same number of feature processing channels and one Conv module with eight times of the number of feature processing channels.

Thus, the visible light image data is input into a visible light visual data feature extraction path to perform feature extraction to obtain a first feature map C ₀ The method comprises the steps of carrying out a first treatment on the surface of the Inputting the depth image data and the near infrared image data into a depth and near infrared vision data feature extraction path for feature extraction to obtain a second feature map M ₀ The method comprises the steps of carrying out a first treatment on the surface of the Will first specialSign C ₀ And a second characteristic map M ₀ After merging, inputting the merged images into a residual aggregation network module to calculate a third characteristic diagram E ₀ The method comprises the steps of carrying out a first treatment on the surface of the Will third feature map E ₀ Input into a cross-domain attention module to obtain a final feature image X ₀ 。

Further, for a multi-modal visual data fusion backbone network, comprising: the multi-mode visual data fusion encoder, the cross-domain attention module and the feature extraction pyramid structure of the target detection model. The feature extraction pyramid structure of the target detection model comprises a first original feature extraction layer F in a backbone network of the target detection model ₀ Second original feature extraction layer F ₁ First maximum pooling layer MP ₁ And a second maximum pooling layer MP ₂ 。

Thus, for the first raw feature extraction layer F of the object detection model ₀ And a second original feature extraction layer F ₁ After processing, at the first original feature extraction layer F ₀ And a second original feature extraction layer F ₁ Respectively adding a cross-domain attention module to obtain a first feature extraction layer P ₀ And a second feature extraction layer P ₁ . It should be noted that, for the first original feature extraction layer F ₀ And a second original feature extraction layer F ₁ The processing refers to the first original feature extraction layer F ₀ And a second original feature extraction layer F ₁ The Conv module in (a) is replaced by the constructed Conv module. For the obtained first feature extraction layer P ₀ And a second feature extraction layer P ₁ The multi-layer perception machine module can capture information of cross-channel, direction perception and position perception, realize richer characteristic information expression and inhibit the problem of overfitting of the multi-layer perception machine module in the multi-mode visual data fusion encoder. Thus, the multi-modal visual data fusion encoder, the second max-pooling layer MP ₂ Second feature extraction layer P ₁ First maximum pooling layer MP ₁ And a first feature extraction layer P ₀ And forming a multi-mode visual data fusion backbone network. In the embodiment, the multi-mode visual data fusion backbone network can realize richer characteristic information expression by capturing cross-channel, direction sensing and position sensing information, and helps multi-mode fruitsThe real sensing system locates and identifies fruit objects more accurately.

In this embodiment, the cross-domain attention module in the multi-mode visual data fusion encoder is used as the bottom layer P of the feature extraction pyramid structure of the target detection model ₂ Will finally feature map X ₀ Large-scale feature map N output as multi-mode visual data fusion backbone network ₂ Large scale feature map N ₂ Sequentially passing through the second maximum pooling layer MP ₂ And a second feature extraction layer P ₁ First maximum pooling layer MP ₁ And a first feature extraction layer P ₀ Obtaining a mesoscale characteristic diagram N ₁ And a small scale feature map N ₀ . Thus, large scale feature map N ₂ Mesoscale feature map N ₁ And a small scale feature map N ₀ The multi-mode visual data fusion backbone network is output.

Further, in the fruit detection module, the fruit perception module is trained using the multimodal fruit image dataset. Specifically, after the multi-mode visual data fusion backbone network is connected to the head of the target detection model (YOLO model), training set data is input for training, and it is noted that the Conv module in the head of the target detection model (YOLO model) is replaced by the constructed Conv module, so that the fruit detection module is obtained. In this embodiment, the multimodal fruit image dataset is divided into training set data and verification set data, and the fruit perception module is trained and verified using the training set data and the verification set data.

In this embodiment, three loss functions are used to calculate the training loss, and the calculation formula is:

wherein, loss _Total Representing the total Loss function, loss _box Representing frame loss, i.e. CIoU loss function, h ^input Representing the height, w, of the input fruit plant image ^input Representing the width of the input fruit plant image, loss _conf Representing confidence loss, i.e. BCEWITHLogitsLossLoss function, class represents the number of classes, loss _cls Representing a classification penalty, i.e., another BCEWithLogitsLoss function.

It should be noted that, in the training process, the threshold α is set to be 0.09, and in the training process, the change of the loss value of the verification set after each iterative training of the fruit perception module is recorded. If the loss value is continuously lower than the threshold value alpha, the fruit perception module is proved to reach an ideal state, if the loss value is reduced to a lower value, the fruit perception module starts to rise, the fruit perception module is proved to be fitted, and the parameters are fine-tuned and then retrained.

In this embodiment, the average accuracy (Average Precision) is calculated by interpolation, and the fruit sensing module with the highest average accuracy is obtained. The average accuracy is the area under the calculated P-R curve, and is used for measuring the judgment standard between the accuracy (Precision) and the Recall (Recall), and in this embodiment, the calculation formula of the average accuracy is:

wherein, AP represents average accuracy, p represents accuracy, and r represents recall. In this embodiment, the average accuracy is used as an evaluation index, a multi-mode fruit image dataset is input, the performance of the fruit perception module is evaluated, a threshold value β is set, if the average accuracy of the multi-mode fruit image dataset is smaller than β, the key super-parameters are optimized, the data training is input again, and the final fruit perception module is obtained when the average accuracy of the multi-mode fruit image dataset is larger than β. The key super-parameters include optimizer choice, learning rate adjustment function, number of model training iterations, initial learning rate, data enhancement scale, momentum factor, and input image size. In this embodiment, the optimizer is set as Adam optimizer, the learning rate adjustment function is a cosine annealing function, the number of model training iterations is 200, the initial learning rate is 0.01, the data enhancement ratio is 0.9, the momentum factor is 0.937, and the input image size is 640 pixels x640 pixels.

During the test, the average accuracy AP of the fruit perception module on the multimodal fruit image dataset was recorded with the intersection ratio IoU set to 0.5. The value of the threshold value beta is determined by the detection performance of the existing target detection model. In this embodiment, as shown by calculation using the same data, the average accuracy of the YOLOv5s model is 0.9627, the average accuracy of the YOLOv 7-tiniy model is 0.9573, the average accuracy of the yolox-tiniy model is 0.9688, and the average accuracy of the YOLOv8s model is 0.9716, so that the threshold β is set to be 0.95, and if the average accuracy exceeds the threshold β under the condition that the intersection ratio IoU is set to be 0.5, the accuracy of the fruit detection module reaches the requirement, and the ideal fruit perception module is obtained. If the average accuracy rate does not reach the threshold value beta under the condition that the intersection ratio IoU is set to 0.5, the accuracy of the fruit sensing module does not reach the requirement, the key super-parameters are required to be reconfigured for retraining until the average accuracy rate exceeds the threshold value beta under the condition that the intersection ratio IoU is set to 0.5.

Thus, in this embodiment, the multi-mode fruit image data is input to the fruit detection module to obtain the prediction result, and the prediction result is processed in the fruit sensing module to finally obtain the spatial position and the category information of the target fruit. In this embodiment, the fruit perception module decodes the prediction result into spatial position and category information of the fruit. It should be noted that, in this embodiment, the processing of the prediction result by the fruit perception module may be calculated by the prior art.

Therefore, the multi-mode fruit perception system provided by the embodiment solves the problem that the traditional detection method has remarkable performance degradation in real environments such as fruit overlapping, branch and leaf shielding, weak light environment and the like, solves the problem that edge deployment depends on a high-precision light weight algorithm, is beneficial to improving the precision and stability of machine picking, has the characteristics of strong generalization capability and high detection precision, and provides a feasible scheme for detection and classification of other crops.

In addition, the embodiment also provides a multi-mode fruit sensing device, which is a computer device and comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor. The steps of the multi-modal fruit perception system as in the present embodiment are implemented when the processor executes the program. The computer device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including a stand-alone server or a server cluster composed of a plurality of servers) that may execute a program, or the like. The computer device of the present embodiment includes at least, but is not limited to: a memory, a processor, and the like, which may be communicatively coupled to each other via a system bus. The memory (i.e., readable storage medium) includes flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. The memory may be an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device, or an external storage device of the computer device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device. Of course, the memory may also include both internal storage units of the computer device and external storage devices. In this embodiment, the memory is typically used to store an operating system and various application software installed on the computer device. In addition, the memory can be used to temporarily store various types of data that have been output or are to be output. The processor may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device, in this embodiment, the processor is used to run program code or process data stored in a memory.

The present embodiment also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the multi-modal fruit perception system applied to a data acquisition module, a fruit detection module, or a fruit perception module.

The foregoing has been described schematically the invention and embodiments thereof, which are not limiting, but are capable of other specific forms of implementing the invention without departing from its spirit or essential characteristics. The drawings are also intended to depict only one embodiment of the invention, and therefore the actual construction is not intended to limit the claims, any reference number in the claims not being intended to limit the claims. Therefore, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical scheme are not creatively designed without departing from the gist of the present invention, and all the structural manners and the embodiment are considered to be within the protection scope of the present patent. In addition, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" preceding an element does not exclude the inclusion of a plurality of such elements. The various elements recited in the product claims may also be embodied in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. The multi-mode fruit sensing system is characterized by comprising a data acquisition module, a fruit detection module and a fruit sensing module;

2. The multi-modal fruit perception system according to claim 1, wherein the multi-modal fruit plants are captured under different lighting conditions using the multi-modal perception device to obtain visible light image data, depth image data, and near infrared image data, the visible light image data, the depth image data, and the near infrared image data comprising a multi-modal fruit image dataset.

3. The multi-modal fruit perception system according to claim 2, wherein the multi-modal visual data fusion encoder comprises a visible light visual data feature extraction pathway, a depth and near infrared visual data feature extraction pathway, a residual aggregation network module, and a cross-domain attention module.

4. A multi-modal fruit perception system as claimed in claim 3, wherein the computation function of the cross-domain attention module is:

f ₂ ＝(C·SiLU(A·SiLU(f ₁ )+B) ^T +D) ^T

wherein f ₁ A first feature map representing the output of the cross-domain attention module, f ₂ A second feature map representing the output of the cross-domain attention module, f ₂ ^h Representing the splitting of the second feature map along the height direction, f ₂ ^w Represents the splitting of the second characteristic diagram along the width direction, f ₃ Representing the final feature map output by the cross-domain attention module, alpha and beta each representing an independent learnable vector, delta ₁ 、δ ₂ 、δ ₃ 、δ ₄ And delta ₅ Each representing an independent convolution calculation layer, W representing the width of the feature map of the input cross-domain attention module,h represents the height of the feature map of the input cross-domain attention module, i represents a certain width of the feature map of the input cross-domain attention module, j represents a certain height of the feature map of the input cross-domain attention module, x represents a certain channel of the feature map of the input cross-domain attention module, c represents the channel number of the feature map of the input cross-domain attention module, x _c A feature diagram representing an input cross-domain attention module, h representing a column of the feature diagram of the input cross-domain attention module, w representing a row of the feature diagram of the input cross-domain attention module, A, B each representing a column having a vector x _c Learning parameters of full-join mapping of the same dimension, C, D all represent a full-join mapping with x _c ^T And the learning parameters of full-connection mapping with the same dimension, wherein T represents transposition operation on the feature map of the current position.

5. A multi-modal fruit perception system as claimed in claim 4 wherein,

inputting visible light image data into a visible light visual data feature extraction path to perform feature extraction to obtain a first feature map;

6. The multi-modal fruit perception system according to claim 5, wherein the multi-modal visual data fusion backbone network comprises a multi-modal visual data fusion encoder, a feature extraction pyramid structure of a target detection model, and a cross-domain attention module; the feature extraction pyramid structure of the target detection model comprises a first original feature extraction layer, a second original feature extraction layer, a first maximum pooling layer and a second maximum pooling layer in a backbone network of the target detection model.

7. A multi-modal fruit perception system as claimed in claim 6 wherein,

after the first original feature extraction layer and the second original feature extraction layer of the target detection model are processed, a cross-domain attention module is respectively added behind the first original feature extraction layer and the second original feature extraction layer, and the first feature extraction layer and the second feature extraction layer are obtained; the multi-mode visual data fusion encoder, the second maximum pooling layer, the second feature extraction layer, the first maximum pooling layer and the first feature extraction layer form a multi-mode visual data fusion backbone network;

8. The multi-modal fruit perception system according to claim 7, wherein the multi-modal fruit image dataset is divided into training set data and verification set data; and (3) accessing the multi-mode visual data fusion backbone network into the head of the target detection model, inputting training set data for training, and inputting verification set data for evaluation to obtain the fruit perception module.

9. A multimodal fruit perception device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the multimodal fruit perception system of any of claims 1-8 when the program is executed.

10. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the multimodal fruit perception system of any of claims 1-8 applied to a data acquisition module, a fruit detection module, or a fruit perception module.