CN116977862A

CN116977862A - Video detection method for plant growth stage

Info

Publication number: CN116977862A
Application number: CN202311066878.0A
Authority: CN
Inventors: 谷延锋; 王远航; 高国明; 张志琦
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-10-31

Abstract

The invention discloses a plant growth stage video detection method, and relates to a plant growth stage video detection method. The invention aims to solve the problem that the existing plant growth stage detection algorithm cannot accurately divide the plant growth stage. The process is as follows: 1: constructing a neural network model to obtain a trained neural network model; the process is as follows: 11: acquiring a training set; 12: constructing an I3D network, inputting a plant growth stage video into the I3D network, and extracting a plant growth stage video feature sequence; 13: encoding the feature sequence by using a transducer network encoder, and respectively outputting an action category and an action boundary by using an action classification head and a boundary regression head of a decoder; 14: obtaining a trained neural network model; 2: and obtaining the action category and action boundary corresponding to the video of the plant growth stage to be detected. The invention belongs to the field of time sequence behavior detection.

Description

Video detection method for plant growth stage

Technical Field

The invention belongs to the field of time sequence behavior detection, relates to video detection of a plant growth stage, and particularly relates to a method for detecting a video of a plant growth stage based on a transducer.

Background

Agriculture is the root cause of human survival, the source of clothing and food. Since ancient times, research on grain and cash crop production has not been stopped, and the progress and development of agriculture is continuously promoted. The method effectively detects the growth and development stage of crops in time, and plays an important role in breeding and cultivating crops and protecting and utilizing agricultural resources. By detecting the growth stage in which the plant is located, one can take appropriate management measures to promote healthy growth of the plant and maximize yield. For example, the plant has different requirements on environmental conditions in different growth stages, and by detecting the growth stages of the plant, the optimal conditions of illumination, temperature, humidity, nutrition and the like required by the plant can be determined so as to provide a proper environment to promote the growth and development of the plant; different crops have different harvesting time, and the growth speed and the development progress of the plants can be predicted by detecting the growth stage of the plants, so that peasants and researchers are helped to accurately predict the harvesting time and harvest arrangement is made.

The method for detecting the plant growth stage in the traditional agriculture is simple in operation, low in cost and suitable for different types of plants to a certain extent by observing the external morphology and characteristics of the plants so as to perform preliminary judgment on the plant growth stage. However, the conventional method relies on subjective judgment of observers, different observers have different understanding and evaluation on the morphology and characteristics of plants, and there are cases of differences and misjudgment between individuals; the traditional method needs to spend a great deal of time and labor to observe and measure the plant samples, and has great workload; conventional methods often fail to provide accurate growth stage information, especially in situations where ambiguous regions exist between different stages of plant development, judgment can be difficult.

Modern agriculture utilizes the internet of things technology and advanced image processing algorithm to obtain more accurate plant growth stage information. Assessing the growth state and health of the plant by measuring a physiological parameter of the plant; the farmland with large area can be monitored in real time through the remote sensing technology, and the growth state information of crops is provided; by building agricultural data models and algorithms, the growth stage of crops, the risk of occurrence of plant diseases and insect pests, and optimal agricultural management measures can be predicted. Compared with traditional agriculture, modern agriculture can more objectively identify and classify the shape, structure and physiological characteristics of plants by utilizing a modern information technology, so that the possibility of misjudgment is reduced; compared with the traditional manual observation and measurement, the automatic technology can realize high-flux data collection and processing, improve the working efficiency and reduce the human resource cost.

However, current plant growth stage detection still faces some challenges and problems. For example, deep learning based plant growth stage detection methods require a large amount of marker data for training a network model. Manually labeling plant growth stages is a time consuming and subjective task, and may have inconsistencies and errors, and building a high quality, accurately labeled training set remains a challenge. In addition, plants can have very different morphological and growth characteristics under different varieties and environmental conditions. This makes it difficult to develop detection algorithms suitable for various plants and environments. Apart from that, there are fuzzy transition regions between the growth phases of the plant, which are difficult to divide accurately, and the accuracy of the detection algorithm at these phases is limited.

Disclosure of Invention

The invention aims to solve the problem that the existing plant growth stage detection algorithm cannot accurately divide the plant growth stage, and provides a plant growth stage video detection method.

The video detection method for the plant growth stage comprises the following specific processes:

step 1: constructing a neural network model to obtain a trained neural network model; the specific process is as follows:

step 11: collecting a plant growth stage video and action categories and action boundaries of the corresponding plant growth stage video as a training set;

the action category refers to a germination period, a seedling period, a flowering period and a fruiting period;

the action boundary refers to the moment when the action starts and ends, and the action refers to the germination period, the seedling period, the flowering period and the fruiting period;

step 12: constructing an I3D network, inputting a plant growth stage video into the I3D network, and extracting a plant growth stage video feature sequence;

step 13: encoding the feature sequence by using a transducer network encoder, and respectively outputting an action category and an action boundary by using an action classification head and a boundary regression head of a decoder;

step 14: obtaining a trained neural network model;

step 2: inputting the video of the plant growth stage to be detected into a trained neural network model to obtain the action category and action boundary corresponding to the video of the plant growth stage to be detected.

The beneficial effects of the invention are as follows:

in view of the above problems, the present invention has been made with the aim of locating and identifying the growth process (leaf growth, flowering, fruiting, etc.) of plants until the action occurs and then identifying the category to which each action belongs. The invention provides a plant growth stage video detection method based on a transducer, which utilizes an I3D network to extract the video characteristics of the plant growth stage, utilizes the transducer to detect the plant growth stage, improves the accuracy of algorithm detection and realizes the accurate detection of the plant growth stage.

The invention provides a plant growth stage video detection method, which utilizes an I3D network to extract video features of a plant growth stage, and utilizes a transducer network to detect the plant growth stage so as to realize accurate detection of the plant growth stage.

The method aims at the problems of different time period scales and fuzzy time period boundary faced by video detection in the plant growth stage, and utilizes a multiscale pyramid structure of a transducer and a local self-attention mechanism. The method realizes the accurate detection of the plant growth stage, relieves the problem of boundary blurring in the growth stage, and improves the algorithm detection precision.

In order to verify the performance of the algorithm provided by the invention, a simulation experiment is carried out on the constructed plant growth stage detection data set, and the experimental result verifies the effectiveness of the plant growth stage video detection method based on the transducer.

Drawings

FIG. 1 is a schematic flow diagram of an implementation of the present invention;

FIG. 2 is a diagram of the body structure of an I3D network;

FIG. 3 is a block diagram of an acceptance module of an I3D network;

FIG. 4 is a diagram of a architecture of a transducer network;

FIG. 5 is a diagram of a transducer encoder configuration;

FIG. 6 is a diagram of a transform decoder;

FIG. 7 is a diagram of a visual comparison of timing behavior detection values and action instance truth for the method of the present invention.

Table 1 is a specific composition of the plant growth stage detection dataset;

table 2 shows the accuracy of the timing behavior detection of the method of the present invention compared to the accuracy of other comparison methods (E2E-TAD, R-C3D, basicTAD).

Detailed Description

The first embodiment is as follows: the video detection method for the plant growth stage in the embodiment comprises the following specific processes:

step 11: collecting a plant growth stage video and action categories and action boundaries of the corresponding plant growth stage video as a training set (input step 12);

step 14: obtaining a trained neural network model;

The second embodiment is as follows: the first difference between this embodiment and the specific embodiment is that: in the step 11, acquiring a plant growth stage video and action categories and action boundaries of the corresponding plant growth stage video as a training set; the specific process is as follows:

only times that are within the action instance center interval (the action instance center is predicted by the network) are considered positive samples, where the length of the action instance center interval (manually set,) Proportional to the characteristic stride of the pyramid (set by man);

dividing positive and negative samples by adopting a central sampling strategy;

given an instance of c-centered action, anywhereIs considered as a positive sample;

where c is the action instance center (intermediate value of start time and end time); α is a parameter, α=1.5;the sampling rate of the finger feature pyramid (set, +.>Is a parameter); time step t= {1,2, …, T }, T is the total duration, which varies with video variations;

each action instance is defined by a start time s _i End time e _i And action label a _i Composition, wherein s _i ∈[1,T]，e _i ∈[1,T]，a _i E { 1..times., C }, C is a predefined action category number, s _i ＜e _i ；

Center sampling does not affect the results of model inference, but only to be able to achieve higher accuracy around the center of motion.

The IoU value is not used in selecting the negative samples, but the input video is considered as a negative sample when the time span of the input video (timing proposal, such as 3 seconds to 10 seconds of video) overlaps less than 5% with all annotated action instances. The reason for this is that erroneously located samples, e.g. samples covering only a small part of the actions, may also have a lower IOU value. However, if we consider them as negative examples, the action classifier may be severely confused, so that the classification accuracy is degraded. Because they are located inside the action instance, it is still possible to include a portion of the action features that are significantly discriminative. Using the modified criteria described above, the possibility of delivering such samples to the training set may be eliminated, such that the action classifier focuses on distinguishing actions of interest from the background.

And a third specific embodiment: this embodiment differs from the first or second embodiment in that: in the step 12, an I3D network is constructed, a plant growth stage video is input into the I3D network, and a plant growth stage video feature sequence is extracted; the specific process is as follows:

the I3D adopts a higher sampling rate, can capture a fine-grained temporal structure of motion, and can obtain a higher timing detection fraction (mAP) at a high IoU threshold. Thus, the I3D network was chosen as the backup of the video encoder to extract plant growth stage features.

There are many classical models (e.g., inception, resnet) in two-dimensional Convolutional Neural Networks (CNNs), based on the 2DCNN model, to better solve the video understanding problem, expanding the existing 2D convolution kernel into a 3D convolution kernel increases the time dimension, e.g., expanding the kxk convolution kernel into kxk x k, thereby expanding the 2DCNN into a 3D CNN suitable for video understanding tasks. To ensure that the output response of the convolution kernel does not change, the normalization operation is performed by dividing the two-dimensional convolution kernel by the number of repetitions k after repeating k along the time axis. By the aid of the operation, two-dimensional convolution network model parameters can be directly expanded to be three-dimensional, on the basis, 3D model parameters are initialized by the aid of 2D model parameters, trouble of training the 3D network from scratch is effectively avoided, and manpower and material resources are greatly saved.

Initializing parameters of an I3D model, pre-training an image classification network on a large image dataset ImageNet, expanding the trained network parameters to be three-dimensional, and fine-tuning the I3D model on a constructed plant growth stage detection dataset to extract a plant video feature sequence. The network body used by the I3D is an acceptance-V1, and the network structure is shown in figure 1.

The I3D network sequentially comprises: the method comprises the steps of a first convolution layer, a first maximum pooling layer, a second convolution layer, a third convolution layer, a second maximum pooling layer, a first acceptance module, a second acceptance module, a third maximum pooling layer, a third acceptance module, a fourth acceptance module, a fifth acceptance module, a sixth acceptance module, a seventh acceptance module, a fourth maximum pooling layer, an eighth acceptance module, a ninth acceptance module, a first average pooling layer and a fourth convolution layer;

the connection relation of the I3D network is as follows:

the method comprises the steps that a plant growth stage video is sequentially input into a first convolution layer, a first maximum pooling layer, a second convolution layer, a third convolution layer, a second maximum pooling layer, a first acceptance module, a second acceptance module, a third maximum pooling layer, a third acceptance module, a fourth acceptance module, a fifth acceptance module, a sixth acceptance module, a seventh acceptance module, a fourth maximum pooling layer, an eighth acceptance module, a ninth acceptance module, a first average pooling layer and a fourth convolution layer, and the fourth convolution layer outputs a plant growth stage video feature sequence;

the acceptance-V1 consists of several convolution layers, a pooling layer, and a series of acceptance modules, each of which has a fixed output in the time dimension, with the non-linear layer and pooling layer introduced after the convolution layers. In acceptance-v 1, the convolution kernel of the first convolution layer has a scale of 7 x 7, a step size of 2, this is followed by four maximum pooling layers of step size 2 and a 2 x 7 average pooling layer before the last linear classification layer. In the first two max pooling layers, the convolution kernel of 1 x 3 and step size 1 are used in time, no time pooling is performed, while in the other max pooling layers there are symmetric kernels and step sizes.

Other steps and parameters are the same as in the first or second embodiment.

The specific embodiment IV is as follows: this embodiment differs from one of the first to third embodiments in that: the convolution kernel size of the first convolution layer is 7 multiplied by 7, and the step length is 2;

the convolution kernel size of the second convolution layer is 1 multiplied by 1, and the step length is 2;

the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the step length is 2;

the convolution kernel size of the fourth convolution layer is 1 x 1, the step size is 2.

Other steps and parameters are the same as in one to three embodiments.

Fifth embodiment: this embodiment differs from one to four embodiments in that: the size of the first maximum pooling layer is 1 multiplied by 3, the step size in the x dimension is 2, the step size in the y dimension is 2, and the step size in the time dimension is 1;

the input of the invention is video, and the video is three-dimensional; three-dimensional is divided into a space dimension and a time dimension, and xy dimension is spatial;

the size of the second maximum pooling layer is 1 multiplied by 3, the step size in the x dimension is 2, the step size in the y dimension is 2, and the step size in the time dimension is 1;

the size of the third maximum pooling layer is 3 multiplied by 3, and the step sizes in the x dimension, the y dimension and the time dimension are all 2;

the size of the fourth maximum pooling layer is 2 multiplied by 2, and the step sizes in the x dimension, the y dimension and the time dimension are all 2;

the size of the first average pooling layer is 2 multiplied by 7, and the step sizes in the x dimension, the y dimension and the time dimension are all 1;

the receptive field size of each pixel on the first maximum pooling layer output feature map is 7×11×11; the receptive field is the area size of the pixel point mapping on the input picture of the feature map (feature map) output by each layer of the convolutional neural network;

the receptive field size of each pixel on the second maximum pooling layer output feature map is 11×27×27;

the receptive field size of each pixel on the third maximum pooling layer output feature map is 23×75×75;

the receptive field size of each pixel on the fourth maximum pooling layer output feature map is 59×219×219;

the receptive field size for each pixel on the first averaged pooling layer output feature map is 99 x 539.

Other steps and parameters are the same as those of embodiments one to four to one.

Specific embodiment six: this embodiment differs from one of the first to fifth embodiments in that: any one of the first, second, third, fourth, fifth, sixth, seventh, eighth and ninth acceptance modules comprises:

a fifth convolution layer, a sixth convolution layer, a fifth maximum pooling layer, a seventh convolution layer, an eighth convolution layer, a ninth convolution layer, a tenth convolution layer;

the connection relation of the acceptance module is as follows:

the characteristic A is respectively input into a fifth convolution layer, a sixth convolution layer, a fifth maximum pooling layer and a seventh convolution layer;

the fifth convolution layer output feature is input into the eighth convolution layer;

the output characteristics of the sixth convolution layer are input into the ninth convolution layer;

the output characteristics of the fifth maximum pooling layer are input into a tenth convolution layer;

performing feature fusion on the seventh convolution layer output feature, the eighth convolution layer output feature, the ninth convolution layer output feature and the tenth convolution layer output feature, and outputting the fused features as an acceptance module;

the fifth convolution layer convolution kernel size is 1 multiplied by 1;

the sixth convolution layer convolution kernel size is 1 multiplied by 1;

the seventh convolution layer convolution kernel size is 1 multiplied by 1;

the eighth convolution layer convolution kernel size is 3 multiplied by 3;

the ninth convolution layer convolution kernel size is 3 multiplied by 3;

the tenth convolution layer convolution kernel size is 1 multiplied by 1;

the fifth maximum pooling layer size is 3× 3X 3.

The structure of each acceptance module is shown in fig. 2, the acceptance module does not promote the network effect by continuously superposing the network layers, but decomposes the convolution, A plurality of convolution kernels (1 multiplied by 1 and 3 multiplied by 3) with different sizes are designed in each layer, convolution operation is carried out on an input image or a feature image respectively, and finally feature image results obtained through the convolution kernels with different sizes are fused together. The input images are simultaneously operated by utilizing a plurality of convolution cores, so that a wider receptive field than a single convolution core can be obtained, the dimension of the convolution core is reduced, and the calculated amount during convolution is reduced.

Other steps and parameters are the same as those of embodiments one to five to one.

Seventh embodiment: this embodiment differs from one of the first to sixth embodiments in that: in the step 13, the feature sequence is encoded by using a transducer network encoder, and the action classification head and the boundary regression head of the decoder are used for respectively outputting the action category and the action boundary; the specific process is as follows:

given an input video X, it is assumed that a set of feature vectors x= { X defined over a discretized time step t= {1, 2.. ₁ ,x ₂ ,...,x _T -representing an input video X;

wherein the total duration T varies with video variations, x _t Is a time segment at time t;

utilizing an I3D networkExtraction of x _t A feature vector representation of the time period;

the purpose of the time sequence motion detection task is to be able to predict and output the motion label y= { Y for the input video feature sequence X ₁ ,y ₂ ,...,y _N Y is composed of N action instances Y _i Composition, each action instance y _i ＝(s _i ,e _i ,a _i ) From the start time s _i End time e _i And action label a _i Composition, wherein s _i ∈[1,T]，e _i ∈[1,T]，a _i E { 1., where, C } (C is a predefined number of action categories), s _i ＜e _i ；

The Transformer-based plant growth stage detection method is built on the anchor-free action positioning method, and classifies each time t into action categories or backgrounds, and further returns the distance between the time step and the beginning and the end of the action. A structured output prediction problem (x= { X) ₁ ,x ₂ ,...,x _T }→Y＝{y ₁ ,y ₂ ,…,y _N }) into a sequence tagging problem:

output of the model at time tThe definition is as follows: p (a) _t ) Consists of C different values, each value is a binomial variable and represents the action probability a of different categories obtained by the model at the time t _t ∈{1,2,...,C}；/> and />The distance between the current time t and the start and end of the action, respectively;

if the time step t is within the background period,then do not define and />

The motion localization result is decoded by the following equation.

Specifically, the Actionformer model marks the input video sequence XThe function f is decomposed into +_ by the encoder-decoder structure>Two parts, wherein the encoder g, X-Z, is parameterized by an Actionformer network, encoding the input video sequence into a feature vector Z; the decoder adopts a lightweight convolution network to decode into sequence labels, and the whole structure is shown in figure 3. To capture timing actions on multiple time scales, multi-scale feature representations are employed to form feature pyramids with different resolutions.

Step 131, outputting a plant growth stage video feature sequence X= { X by the I3D network ₁ ,x ₂ ,…,x _T Inputting projection functions in an Actionformer model, each feature x _t Embedding into 512-dimensional vector space;

where the time step t= {1,2,.. the total duration T varies with video variations, x _t Is a video feature at time t;

step 132, inputting the feature vector output by the projection function into a multi-head transducer encoder, and further encoding the feature by the multi-head transducer encoder to output a feature pyramid Z= { Z ₁ ,Z ₂ ,…,Z _L Multi-scale feature representation (local self-injection)A force module);

wherein Z is the collection of multi-scale feature pyramids, Z ₁ 、Z ₂ 、Z _L Feature vectors of different scale sizes;

step 133, outputting the feature pyramid z= { Z from step 132 ₁ ,Z ₂ ,...,Z _L Decoder of input Actionformer model, feature pyramid z= { Z ₁ ,Z ₂ ,...,Z _L Decoding into sequence tags

wherein ,for decoding the resulting set of action instances, each instance comprising a start time, an end time, an action category,/for example>Are action instances at different moments.

Other steps and parameters are the same as in one of the first to sixth embodiments.

Eighth embodiment: this embodiment differs from one of the first to seventh embodiments in that: in the step 131, the plant growth stage video feature sequence x= { X output by the I3D network ₁ ,x ₂ ,...,x _T Inputting projection functions in an Actionformer model, each feature x _t Embedding into 512-dimensional vector space;

the specific process is as follows:

projection function E is a convolutional network with ReLU as the activation function, defined as:

Z ₀ ＝[E(x ₁ ),E(x ₂ ),...,E(x _T )] ^T

wherein E(x_t ) Is x _t Is embedded in (a)The feature is that the convolution layer is added before the backbone of the transform network, which is helpful to better combine the local context information of the time sequence data, and is helpful to stabilize the training process of the transform network, and the convergence speed is faster; z is Z ₀ The set of embedded features at each time is represented, and T represents the transpose.

Other steps and parameters are the same as those of one of the first to seventh embodiments.

Detailed description nine: this embodiment differs from one to eight of the embodiments in that: in the step 132, the feature vector output by the projection function is input to a multi-head transform encoder, which further encodes the feature to output a feature pyramid z= { Z ₁ ,Z ₂ ,...,Z _L -multi-scale feature representation) (local self-attention module);

the specific process is as follows:

the multi-headed transducer encoder includes L transducer encoders; each transducer encoder is composed of a multi-layer perceptron (MLP) and a local multi-head self-attention network (MSA), as shown in fig. 4.

Each transducer encoder includes a LayerNorm (LN) layer, a local multi-head self-attention network (MSA), a multi-layer perceptron (MLP);

the connection relation of each transducer encoder is as follows:

the method comprises the steps that a feature vector B is input into a LayerNorm (LN) layer, a LayerNorm (LN) layer output feature vector is input into a local multi-head self-attention network (MSA), the local multi-head self-attention network (MSA) output feature vector and the feature vector B are added to obtain a feature vector C, the feature vector C is input into the LayerNorm (LN) layer, the LayerNorm (LN) layer output feature vector is input into a multi-layer perceptron (MLP), the multi-layer perceptron (MLP) outputs a feature vector D, the feature vector D and the feature vector C are added to obtain a feature vector E, and the feature vector E is downsampled to obtain a feature vector F;

the expression is:

the addition of a LayerNorm (LN) layer to the Transformer block performs normalization processing on the data, helping to speed up model convergence, and reducing model complexity and preventing gradient vanishing by using residual connection. In order to capture motion on different time scales, the features are downsampled by using a downsampling operator, and the final output result is:

wherein ,representing feature vectors C, Z ^l-1 Representing the output feature vector of the first-1 transducer encoder, alpha ^l Represents a learnable channel scale factor, +.>Representing the feature vector E, < >>Represents a learnable channel scale factor, Z ^l Representing the feature vector F, ∈r represents downsampling;

each transducer encoder obtains a Z ^l L Transformer encoders obtain L Z ^l I.e. generating feature pyramids z= { Z ₁ ,Z ₂ ,...,Z _L }；

T ^l-1 /T ^l Represents the downsampling rate, alpha ^l and />Is a learnable channel scale factor, using one-dimensional, step-by-step depth convolution as a downsampling operator. The Actionformer model uses a 2-fold downsampling rate, and a feature pyramid z= { Z is generated by combining a plurality of fransformer blocks and downsampling therebetween ₁ ,Z ₂ ,...,Z _L }。

Other steps and parameters are the same as in one to eight of the embodiments.

Detailed description ten: this embodiment differs from one of the embodiments one to nine in that: in the step 133, the feature pyramid z= { Z output in the step 132 ₁ ,Z ₂ ,...,Z _L Decoder of input Actionformer model, feature pyramid z= { Z ₁ ,Z ₂ ,...,Z _L Decoding into sequence tagsThe specific process is as follows:

the decoder of the Actionformer model is a lightweight convolutional network with a classification header and a regression header, as shown in fig. 5, using decoder h to decode the feature pyramid Z from encoder g into sequence tags

(1) Sorting head

Given feature pyramid z= { Z ₁ ,Z ₂ ,...,Z _L The classification head classifies each time t on the L-layer pyramid by using a lightweight convolution network, and predicts the action probability p (a) of C categories to which each time t belongs _t ) The convolution network parameters are shared among all layers of the feature pyramid;

the classification head consists of a 1-dimensional convolution layer (the convolution kernel size is 3), a normalization layer and a ReLU activation function layer, and the probability of predicting the action class is output by using a Sigmoid function;

the specific process is as follows:

feature pyramid z= { Z ₁ ,Z ₂ ,...,Z _L Sequentially inputting a 1-dimensional convolution layer (wherein the convolution kernel size is 3), a normalization layer and a ReLU activation function layer to obtain a vector, and predicting the action probabilities of C categories to which each moment t belongs through a Sigmoid function;

(2) Regression head

Given feature pyramid z= { Z ₁ ,Z ₂ ,...,Z _L The regression head carries out action boundary regression on each time t on the L-layer pyramid;

the difference is that the regression head can predict the distance between the current time t and the start and offset (end time) of the action only when the time t is in an action (an action, meaning a growth segment such as germination)

The regression range of each layer of pyramid features needs to be specified in advance, and the regression heads are consistent with the classification structure;

the regression head consists of a 1-dimensional convolution layer (wherein the convolution kernel size is 3), a normalization layer and a ReLU activation function layer; wherein, the ReLU activation function is used for estimating the distance between the current moment and the action boundary.

The specific process is as follows:

feature pyramid z= { Z ₁ ,Z ₂ ,...,Z _L Sequentially inputting a 1-dimensional convolution layer (wherein the convolution kernel size is 3), a normalization layer and a ReLU activation function layer, and outputting the distance between the current time t and the action start and offset

The regression network is responsible for generating timing proposals for different actions and action integrity scores. Ideally, the network can not only find the identified action and accurate action proposal, but also determine whether a proposal is an action or background. The prediction score of the time sequence proposal with higher overlapping degree with the ground truth example is improved, and the prediction score of the time sequence proposal with smaller overlapping degree with the ground truth example is reduced, so that the time sequence proposal with higher overlapping degree is easier to output by the network through post-processing operation.

(3) Loss function

The Actionformer model outputs triples for each time t of the input sequenceComprises, action category p (a _t ) Is the current time t, the distance between the start of the action and the offset->

The design of the loss function is also relatively simple and only comprises two parts:

(1) Classification lossA two-class focus loss for class C actions;

(2) Regression lossIs the DIoU loss for distance regression;

the loss of each video X is defined as:

wherein T is the length of the video sequence, T ₊ Is the total number of positive samples, 1 _ct To indicate a function, to determine whether the current time t is a positive sample,for outputting all levels, lambda, on the pyramid _reg Is the coefficient of balance classification loss and regression loss, lambda _reg ＝1；

Focus Loss (Focal Loss) is used to identify class C actions. With image classification tasks asFor example, in order to solve the problem that the class imbalance problem (the number of positive and negative samples is very large) easily occurs in the actual training set, the Focal local improves the existing cross entropy Loss function, and the center of gravity of the neural network learning is moved to the example of easy error separation by adding a modulation coefficient. Specifically, the higher the probability that a class is correctly classified, the lower the modulation factor, and gradually approaches zero, which is inversely proportional. The proportion of simple examples occupied in the training set is reduced by modulating the coefficients, and the emphasis of model learning is focused on examples which are difficult to classify.

Formally, focal Loss defines a modulation factor (1-p _t ) ^γ The cross entropy loss can be dynamically adjusted, and gamma > 0 can minimize the loss of the examples which are easy to be classified correctly, so that the center of gravity of the network is put on the examples which are easy to be classified incorrectly, and the loss of classificationThe calculation formula is as follows:

FL(p _t )＝-(1-p _t ) ^γ log(p _t )

wherein FL (p) _t ) Representing classification loss(1-p _t ) ^γ Representing the modulation factor, p _t Representing the magnitude of the probability of being predicted to be correctly classified, and gamma representing the super-parameter, and manually adjusting;

a slight DIoU loss is used, only when the current time t contains a positive sample,/->The specific formula is as follows:

wherein DIoU and IoU represent the cross ratio, b is the position of the center point of the prediction frame, and b ^gt Is the center point position of the target frame, the Euclidean distance between the two positions is calculated by utilizing the function of rho (& gt),is the diagonal distance of the smallest rectangular box that contains the predicted box and the target box.

In the target detection task, l is often selected _n Norm loss is used as a measure of regression of the prediction box, but it does not match exactly with the evaluation index (cross-ratio) of the target detection.

Recently, some scholars propose the GIoU loss as a IoU evaluation index, and the problem that the accuracy is low, the convergence is slow, the regression prediction frame error is large and the like still exists in the target detection by utilizing the GIoU loss.

And Distance-IoU (DIoU) contains the normalized Distance between the predicted frame and the target frame, and the overlapping area and aspect ratio, so that the target frame regression is more stable, the convergence speed is faster, and the performance is better.

In addition, DIoU can be easily employed in non-maximum suppression (NMS) as a criterion to further facilitate performance improvement.

Other steps and parameters are the same as in one of the first to ninth embodiments.

The following examples are used to verify the benefits of the present invention:

embodiment one:

the data used in the experiment are constructed plant growth stage video detection data sets, 372 plant videos are collected in total, the videos are classified according to different growth stages, the number of videos in a germination stage and a seedling stage is far more than that in a flowering and fruiting stage, the videos in different growth stages are different in length from tens of seconds to several minutes, the total duration of the videos is between one and four minutes, the number of plant growth days is between 55 and 126 days, and specific results are shown in table 1.

FIG. 7 shows the results of visualization comparison of the detected values of the plant growth stage and the true values of the action examples based on the transformers, and Table 2 shows the accuracy of the detection of the time series behavior of the method of the present invention compared with the accuracy of other comparison methods (E2E-TAD, R-C3D, basicTAD). From the comparison results, it can be seen that the plant growth stage generated by the method of the invention is more accurate.

TABLE 1

TABLE 2

The present invention is capable of other and further embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A plant growth stage video detection method is characterized in that: the method comprises the following specific processes:

step 14: obtaining a trained neural network model;

2. A plant growth stage video detection method according to claim 1, wherein: in the step 11, acquiring a plant growth stage video and action categories and action boundaries of the corresponding plant growth stage video as a training set; the specific process is as follows:

dividing positive and negative samples by adopting a central sampling strategy;

wherein c is the center of the action instance; α is a parameter, α=1.5;the sampling rate of the feature pyramid; time step t= {1,2,..;

An input video is considered a negative sample when the time span of the input video overlaps less than 5% with all annotated action instances.

3. A plant growth stage video detection method according to claim 1 or 2, characterized in that: in the step 12, an I3D network is constructed, a plant growth stage video is input into the I3D network, and a plant growth stage video feature sequence is extracted; the specific process is as follows:

the connection relation of the I3D network is as follows:

the plant growth stage video is sequentially input into a first convolution layer, a first maximum pooling layer, a second convolution layer, a third convolution layer, a second maximum pooling layer, a first acceptance module, a second acceptance module, a third maximum pooling layer, a third acceptance module, a fourth acceptance module, a fifth acceptance module, a sixth acceptance module, a seventh acceptance module, a fourth maximum pooling layer, an eighth acceptance module, a ninth acceptance module, a first average pooling layer and a fourth convolution layer, and the fourth convolution layer outputs a plant growth stage video feature sequence.

4. A plant growth stage video inspection method according to claim 3, characterized in that: the convolution kernel size of the first convolution layer is 7 multiplied by 7, and the step length is 2;

5. The method for video detection during plant growth according to claim 4, wherein: the size of the first maximum pooling layer is 1 multiplied by 3, the step size in the x dimension is 2, the step size in the y dimension is 2, and the step size in the time dimension is 1;

the receptive field size of each pixel on the first maximum pooling layer output feature map is 7×11×11;

6. The method for video detection during plant growth according to claim 5, wherein: any one of the first, second, third, fourth, fifth, sixth, seventh, eighth and ninth acceptance modules comprises:

the connection relation of the acceptance module is as follows:

the fifth convolution layer convolution kernel size is 1 multiplied by 1;

the sixth convolution layer convolution kernel size is 1 multiplied by 1;

the seventh convolution layer convolution kernel size is 1 multiplied by 1;

the eighth convolution layer convolution kernel size is 3 multiplied by 3;

the ninth convolution layer convolution kernel size is 3 multiplied by 3;

the tenth convolution layer convolution kernel size is 1 multiplied by 1;

the fifth maximum pooling layer size is 3× 3X 3.

7. The method for video detection during plant growth according to claim 6, wherein: in the step 13, the feature sequence is encoded by using a transducer network encoder, and the action classification head and the boundary regression head of the decoder are used for respectively outputting the action category and the action boundary; the specific process is as follows:

step 131, outputting a plant growth stage video feature sequence X= { X by the I3D network ₁ ,x ₂ ,...,x _T Inputting projection functions in an Actionformer model, each feature x _t Embedding into 512-dimensional vector space;

step 132, inputting the feature vector output by the projection function into a multi-head transducer encoder, and further encoding the feature by the multi-head transducer encoder to output a feature pyramid Z= { Z ₁ ,Z ₂ ,...,Z _L }；

Wherein Z is a multi-scale featureCollection of symptom pyramids, Z ₁ 、Z ₂ 、Z _L Feature vectors of different scale sizes;

wherein ,for decoding the resulting set of action instances, each instance includes a start time, an end time, an action category,are action instances at different moments.

8. The method for video detection during plant growth according to claim 7, wherein: in the step 131, the plant growth stage video feature sequence x= { X output by the I3D network ₁ ,x ₂ ,...,x _T Inputting projection functions in an Actionformer model, each feature x _t Embedding into 512-dimensional vector space;

the specific process is as follows:

Z ₀ ＝[E(x ₁ ),E(x ₂ ),...,E(x _T )] ^T

wherein E(x_t ) Is x _t Is embedded with features of (a); z is Z ₀ The set of embedded features at each time is represented, and T represents the transpose.

9. The method for video detection during plant growth according to claim 8, wherein: in the step 132, the feature vector output by the projection function is input to a multi-head transform encoder, which further encodes the feature to output a feature pyramid z= { Z ₁ ,Z ₂ ,...,Z _L }；

The specific process is as follows:

the multi-headed transducer encoder includes L transducer encoders;

each transducer encoder comprises a LayerNorm layer, a local multi-head self-attention network, and a multi-layer perceptron;

the connection relation of each transducer encoder is as follows:

the feature vector B is input into a LayerNorm layer, the LayerNorm layer output feature vector is input into a local multi-head self-attention network, the local multi-head self-attention network output feature vector and the feature vector B are added to obtain a feature vector C, the feature vector C is input into the LayerNorm layer, the LayerNorm layer output feature vector is input into a multi-layer perceptron, the multi-layer perceptron outputs a feature vector D, the feature vector D and the feature vector C are added to obtain a feature vector E, and the feature vector E is downsampled to obtain a feature vector F;

the expression is:

each transducer encoder obtains a Z ^l L Transformer encoders obtain L Z ^l I.e. generating feature pyramids z= { Z ₁ ,Z ₂ ,…,Z _L }。

10. A plant growth stage video inspection method according to claim 9 and characterized in that: in the step 133, the feature pyramid z= { Z output in the step 132 ₁ ,Z ₂ ,…,Z _L Decoder of input Actionformer model, feature pyramid z= { Z ₁ ,Z ₂ ,…,Z _L Decoding into sequence tags

The specific process is as follows:

the decoder of the Actionformer model is a convolutional network with a classification header and a regression header;

(1) Sorting head

Given feature pyramid z= { Z ₁ ,Z ₂ ,…,Z _L The classification head classifies each time t on the L-layer pyramid and predicts the action probability p (a) of C categories to which each time t belongs _t )；

The classification head consists of a 1-dimensional convolution layer, a normalization layer and a ReLU activation function layer, and the probability of the predicted action class is output by using a Sigmoid function;

the specific process is as follows:

feature pyramid z= { Z ₁ ,Z ₂ ,…,Z _L Sequentially inputting 1-dimensional convolution layer, normalization layer and ReLU activation function layer to obtain a vectorThe vector predicts the action probabilities of C categories to which each time t belongs through a Sigmoid function;

(2) Regression head

only when the time t is in a certain action, the regression head can predict the distance between the current time t and the action start and offset

The regression head consists of a 1-dimensional convolution layer, a normalization layer and a ReLU activation function layer;

the specific process is as follows:

feature pyramid z= { Z ₁ ,Z ₂ ,…,Z _L Sequentially inputting a 1-dimensional convolution layer, a normalization layer and a ReLU activation function layer, and outputting the distance between the current time t and the action start and offset

(3) Loss function

The loss function comprises two parts:

(1) Classification lossA two-class focus loss for class C actions;

(2) Regression lossIs the DIoU loss for distance regression;

the loss of each video X is defined as:

wherein T is the length of the video sequence, T ₊ Is the total number of positive samples, 1 _ct In order to indicate the function,for outputting all levels, lambda, on the pyramid _reg Is the coefficient of balance classification loss and regression loss, lambda _reg ＝1；

Classification lossThe calculation formula is as follows:

FL(p _t )＝-(1-p _t ) ^γ log(p _t )

wherein FL (p) _t ) Representing classification lossRepresenting the modulation factor, p _t Representing the magnitude of the probability of being predicted to be correctly classified, gamma representing the hyper-parameter;

the specific formula is as follows: