CN115223021A

CN115223021A - Visual question-answering-based fruit tree full-growth period farm work decision-making method

Info

Publication number: CN115223021A
Application number: CN202210863967.7A
Authority: CN
Inventors: 邓小玲; 郭雅琦; 陈奇真; 兰玉彬; 陈欣; 林晓晴
Original assignee: South China Agricultural University
Current assignee: South China Agricultural University
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-10-21

Abstract

The invention discloses a visual question-answering-based decision method for farming operation of fruit trees in the whole growth period, which comprises the following steps: acquiring an image sample and a first text sample for a growth cycle of a target fruit tree; the first text sample comprises a fruit tree disease treatment problem; respectively extracting the features of the image sample and the first text sample to obtain a corresponding image feature vector and a corresponding problem keyword feature vector; introducing a multi-modal fusion model; respectively transmitting the image feature vector and the problem keyword feature vector into a multi-modal fusion model, and outputting fused multi-modal features; inputting the fused multi-modal characteristics into a trained classifier, and outputting correct answers corresponding to the fruit tree disease control problems; by the method, multi-mode data fusion and visual question answering can be combined, the method is applied to the fruit tree image text data set, good accuracy is obtained, and the visual question answering-based fruit tree full-growth period farm work decision method is realized.

Description

Visual question-answering-based fruit tree full-growth period farm work decision method

Technical Field

The invention belongs to the technical field of multi-modal data fusion, the technical field of visual question answering based on the combination of computer vision and natural language processing and the technical field of bilinear fusion among multi-modal data, and particularly relates to a visual question answering-based fruit tree full-growth period farming operation decision method.

Background

With the development of intelligent agriculture, the combination of the artificial intelligence field and agriculture effectively solves many accurate operation problems, most of decision information of the growth period of the fruit tree at present is based on single-mode data such as an image or a text knowledge base, but the application of single-source information is very limited, and the disease and insect damage identification can be carried out through the image information, or the fertilization, the pesticide application, the disease and insect damage prevention and control of the fruit tree and the management of the growth period of the fruit tree can be obtained through the text knowledge base. Therefore, the two modes can be fused, the text decision obtained by fusion is input into an unmanned vehicle or an unmanned remote sensing pesticide application airplane in the orchard to carry out accurate decision on fruit trees, and accurate management and control are realized on the intelligent orchard.

The visual question-answering technology is to give an image, input questions related to the image, and output natural language by obtaining appropriate answers according to the relevance between the image and text by the model. However, the current visual question-answering field has the problems of insufficient inter-modal feature extraction and neglect of fine-grained interaction between images and texts. Therefore, the visual question-answering model is not feasible to be directly used in the fruit tree image text data set, and the model must be improved.

The closest patent technical scheme file to the invention is as follows: the application number is 201910647573.6, the publication number is CN 110348535A, and the name is a visual question-answer model training method and device, and the patent method comprises the following steps: acquiring an image and a text data set to obtain a training sample and a label; performing feature extraction on an image in the data set to obtain visual features of the image, and performing feature extraction on a problem posed by the image to obtain features of the text keywords; performing interactive processing on the image features and the text features to obtain feature vectors with text features in the image features after feature fusion; performing bilinear pooling on the fused feature vectors, inputting the feature vectors into a visual question-answer model, and obtaining a predicted answer through the visual question-answer model; the visual question-answer model in the method is updated based on the annotated correct answers and the predicted answers to determine the loss values of the loss function. However, the method is only trained on public data sets and is not applied to real life, and the patent does not give out how to apply the visual question-answering technology to the decision of the growth period of the fruit tree.

Therefore, how to combine multi-modal data fusion and visual question answering to apply to accurate decision of farm work in the whole growth period of fruit trees becomes a key problem of current research.

Disclosure of Invention

Aiming at the problems that the visual question-answering technical parameters are too large and cannot be directly applied to the agricultural field and the single-source information in the field of the smart orchard is insufficient at present, the invention provides a visual question-answering-based fruit tree full-growth period farm work decision method which at least solves part of technical problems; by the method, multi-mode data fusion and visual question answering can be combined, the method is applied to the fruit tree image text data set, good accuracy is obtained, and the visual question answering-based fruit tree full-growth period farm work decision method is realized.

The embodiment of the invention provides a visual question and answer-based fruit tree full-growth period farm work decision method, which comprises the following steps:

s1, obtaining an image sample and a first text sample for a growth cycle of a target fruit tree; the first text sample comprises a fruit tree disease treatment problem;

s2, respectively extracting the features of the image sample and the first text sample to obtain corresponding image feature vectors and corresponding problem keyword feature vectors;

s3, introducing a multi-mode fusion model; respectively transmitting the image feature vectors and the problem keyword feature vectors into the multi-modal fusion model, and outputting fused multi-modal features;

and S4, inputting the fused multi-modal characteristics into a trained classifier, and outputting correct answers corresponding to the fruit tree disease control problems.

Further, the image feature vector comprises a fruit tree image feature vector and a disease image position feature vector.

Further, the S2 specifically includes:

performing feature extraction on the image sample through a target detection algorithm based on a residual error network ResNet-152 to obtain an image feature vector;

and performing feature extraction on the first text sample by using a word vector embedding method and a long-short term memory neural network to obtain a problem keyword feature vector.

Further, the image feature vector is to divide each image sample into a plurality of regions; each region is represented by a vector with 2048 dimensions and is used as the input of a subsequent network;

wherein, configuring corresponding object detector and attribute classifier for each region; the object bounding box of each object detector has a corresponding attribute class.

Further, the extracting features of the first text sample by using a word vector embedding method and a long-short term memory neural network to obtain a problem keyword feature vector specifically includes:

processing the input fruit tree disease control problem into a plurality of single words, and intercepting N words from the plurality of single words; if the number of the plurality of single words is less than N, filling with 0;

capturing semantic features of the intercepted words by combining a 300-dimensional word vector model (Glove), and converting the semantic features into a problem feature vector;

and coding the problem feature vector by using a long-short term memory neural network (LSTM), and extracting problem keyword feature information from the problem feature vector to obtain the problem keyword feature vector.

Further, in S4, the classifier is trained as follows:

acquiring a large number of image samples and second text samples for the growth cycle of the fruit tree; the second text sample comprises a fruit tree disease treatment problem and a real answer corresponding to the fruit tree disease treatment problem;

respectively extracting the features of the image sample and the second text sample to obtain an image feature vector and a problem keyword feature vector;

preprocessing the image feature vector and the problem keyword feature vector;

introducing a multi-modal fusion model; respectively transmitting the preprocessed image feature vectors and the problem keyword feature vectors into the multi-modal fusion model, and outputting fused multi-modal features;

and taking the fused multi-modal characteristics as input, and taking a real answer corresponding to the fruit tree disease control problem as output for training a classifier.

Further, the pretreatment specifically comprises:

extracting fruit tree image characteristic vectors and disease image position characteristic vectors from the image characteristic vectors through a multi-view attention mechanism;

capturing the relation between the fruit tree image characteristic vector and the problem keyword characteristic vector, and performing text representation learning to obtain the correlation between the fruit tree image characteristic of the target fruit tree and the problem keyword characteristic;

and interacting the fruit tree image characteristic vector and the disease image position characteristic vector to obtain the correlation between the fruit tree image characteristic and the disease image position characteristic.

Further, the multi-modal fusion model adopts a multi-view attention mechanism, scores the embedded disease image position area according to the problem keyword feature vector, and calculates a global visual vector as a sum pool weighted by the scores.

Further, the multi-modal fusion model adopts a bilinear fusion mechanism based on tensor decomposition; simulating the correlation between the characteristic vectors of the fruit tree images and the characteristic vectors of the problem keywords through a full tensor; and decomposing the full tensor by adopting a bilinear fusion method to obtain a structure with three internal model matrixes and a core tensor.

Further, the complexity of the core tensor is controlled by a structural sparsity constraint on the tensor slice matrix.

Compared with the prior art, the visual question-answer-based fruit tree full-growth period farming operation decision method has the following beneficial effects:

the existing intelligent agriculture related technology adopts single-mode data and is difficult to be intelligently applied. The fruit tree visual question-answering model can automatically extract information contained in images during the growth period of the fruit tree, and capture the positions of plant diseases and insect pests; the image data of the fruit tree is detected in real time, and relevant text knowledge of a fruit tree governing knowledge base is fused, so that the fusion of the text and the image data can obtain a more accurate decision of the growth period of the fruit tree; can provide relatively expert suggestions for analyzing images for orchard managers, and is favorable for selecting more targeted farm work decisions for the fruit trees.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic flow chart of a decision-making method for farming operation in the whole growth period of fruit trees based on visual question answering according to an embodiment of the invention.

Fig. 2 is a schematic diagram of a format of a text data set of a fruit tree image problem according to an embodiment of the present invention.

Fig. 3 is a schematic flowchart of training a classifier according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a framework of a training classifier according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of the accuracy of the agricultural operation decision method for the fruit tree in the whole growth period based on visual question answering according to the embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Referring to fig. 1, the embodiment of the invention provides a visual question-answer-based decision method for farm work in the whole growth period of fruit trees, which specifically comprises the following steps:

s3, introducing a multi-modal fusion model; respectively transmitting the image feature vectors and the problem keyword feature vectors into the multi-modal fusion model, and outputting fused multi-modal features;

and S4, inputting the fused multi-modal characteristics into a trained classifier, and outputting a correct answer corresponding to the fruit tree disease control problem.

The above steps will be described in detail below.

In the step S1, when the image sample and the first text sample are specifically obtained, the visible light image shooting work of the fruit tree can be selected to be performed in the experiment base, and a part of disease data images are downloaded from the related website, and the question and answer pair of the manual annotation image is performed according to the fruit tree governing book and the knowledge base collected on the internet. The image sample mainly comprises a close-range image and an unmanned aerial vehicle remote sensing image of a target fruit tree (such as citrus, litchi, longan, grape and other fruits); the first text sample mainly comprises a fruit tree disease treatment problem. A schematic format diagram of a text data set of fruit tree image problems provided by the embodiment of the invention is shown in fig. 2.

In the step S2, when extracting features of an image sample, after an image is input, extracting features of the image sample by using a target detection algorithm based on a residual error network ResNet-152 to obtain an image feature vector; the image characteristic vector comprises a fruit tree image characteristic vector and a disease image position characteristic vector; the image feature vector is to divide each image sample into a plurality of areas; each region is represented by a vector with 2048 dimensions; configuring a corresponding object detector and an attribute classifier for each region; the object bounding box of each object detector has a corresponding attribute class, such that a binary description of the object is obtained;

when the first text sample is subjected to feature extraction, inputting an English sentence problem, and performing feature extraction on the first text sample by using a word vector embedding method and a long-short term memory neural network to obtain a problem keyword feature vector; the method specifically comprises the following steps: processing the input fruit tree disease control problem into a plurality of single words, and intercepting N words from the plurality of single words; if the number of the single words is more than N words, redundant words are deleted; if the number of the plurality of single words is less than N, filling with 0; in this process, a TrimZero function is used to avoid zero values in the padding; then capturing semantic features of the intercepted words by combining a 300-dimensional word vector model (namely a GloVe model), and converting the semantic features into problem feature vectors; coding the problem feature vector by using a long-short term memory neural network (namely an LSTM network), and extracting problem keyword feature information from the problem feature vector to obtain a problem keyword feature vector; during the period, a pre-trained Bert-base model is adopted as a text feature extraction model and is used as the input of a subsequent network;

the method for acquiring the problem keyword feature vector by adopting the LSTM network specifically comprises the following steps: calculating an L2 distance matrix according to the problem characteristic vector; carrying out unsupervised clustering on sentence levels according to the distance matrix, and dividing sentences forming a report into different pest and disease damage corresponding groups; according to the result of the clustering algorithm, similarity sequencing is carried out on the clustered sentences in the class, the sentences which are sequenced in the front are selected according to the sequencing result, the verb prototype is restored by utilizing a syntax analysis tool, and the vocabulary in the current sentence group is counted; setting a threshold value according to a word analysis result to select high-frequency words in a group, screening the high-frequency words according to part of speech analysis to obtain nouns and noun phrases in the high-frequency words, setting noun parts in the high-frequency words as question core parts, setting adjective parts and adverb parts as answer core contents, completing other parts in a question-answer combination pair according to grammatical rules, and generating a question-answer part data set required by a visual question-answer model.

In the step S3, a multi-modal fusion scheme is adopted, mainly to control the number of model parameters, so as to reduce the size of single-modal embedding. Simulating the correlation between the fruit tree image characteristic vector and the problem keyword characteristic vector through full tensor T representation; to further control the number of model parameters; when the multi-modal fusion model is applied to a training process, the multi-modal fusion model can be used as a regularizer, overfitting is prevented, and flexibility of adjusting the size of input/output is improved. Decomposing the full tensor by a bilinear fusion method to obtain a structure with three internal model matrixes and a core tensor; wherein the complexity of the core tensor is controlled by a structural sparsity constraint on the tensor slice matrix; is formulated as:

T＝((T _C ×1W _q )×2W _v )×3W _o

where T represents the full tensor; w is a group of _q 、W _v And W _o Representing three internal mode matrices; t is _c Representing a core sheet; q represents a fruit tree image feature vector; v represents a problem keyword feature vector;

the multi-modal fusion model is an attention mechanism model of H parallel heads, which allows the model to focus on information from different representation subspaces at different positions simultaneously, and calculates an output feature matrix as:

F＝MultiHead(q,k,v)＝Concat([head ₁ ,head ₂ ,…head _H ])W ₀

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V )

f represents the output head that pools these H attention _i Spliced together and passed through another linear projection W that can be learned ₀ The transformation is performed to produce the final output.

head _i The calculation method for each attention head is shown.

W _i ^q W _i ^k W _i ^v Representing learnable parameters, i.e., the weight matrix, Q, K, and V represent three fixed matrices.

Outputting the problem characteristics after the learning attention characteristics obtain the weight; then inputting them into LayerNorm layer; the feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally the LayerNorm layer obtains the final characteristic y through self-attention; expressed as:

y＝((T _C ×1(q ^T W _q ))×2(v ^T W _v ))×3W _o

in step S4, as shown in fig. 3 and 4, the classifier is trained as follows:

acquiring a large number of image samples and second text samples for the growth cycle of the fruit tree; the second text sample comprises a fruit tree disease treatment problem and a real answer corresponding to the fruit tree disease treatment problem; based on which a fruit vision problem dataset can be composed;

respectively extracting the features of the image sample and the second text sample to obtain an image feature vector and a problem keyword feature vector; preprocessing the image feature vector and the problem keyword feature vector; the pretreatment specifically comprises: extracting fruit tree image characteristic vectors and disease image position characteristic vectors from the image characteristic vectors through a multi-view attention mechanism; capturing the relation between the fruit tree image characteristic vector and the problem keyword characteristic vector, and performing text representation learning to obtain the correlation between the fruit tree image characteristic of the target fruit tree and the problem keyword characteristic; interacting the fruit tree image characteristic vector and the disease image position characteristic vector to obtain the correlation between the fruit tree image characteristic and the disease image position characteristic; therefore, the fruit tree disease and insect pest image can be better understood, the problem of the position relation can be effectively solved through position relation modeling, such as front, back, left, right, foreground, background and the like, the fruit tree disease and insect pest region can be conveniently positioned, and effective diagnosis can be conveniently provided for fruit growers and experts.

Introducing a multi-modal fusion model based on bilinear interaction between modes; respectively transmitting the preprocessed image feature vectors and the problem keyword feature vectors into a multi-modal fusion model, and outputting fused multi-modal features; mapping the fused multi-modal features to a vector space s ∈ RL through an s-shaped function, wherein L is the number of the most frequent answers in the training set; expressed as:

s＝Linear(f)

A＝sigmoid(s)

wherein A represents a model predictive answer;

the multi-mode fusion model adopts a multi-view attention mechanism, scores the embedded disease image position region according to the problem keyword feature vector, and calculates a global visual vector as a sum pool weighted by the scores; the multi-modal fusion model adopts a bilinear fusion mechanism based on tensor decomposition; simulating the correlation between the characteristic vectors of the fruit tree images and the characteristic vectors of the problem keywords through a full tensor; decomposing the full tensor by a bilinear fusion method to obtain a structure with three internal model matrixes and a core tensor; the complexity of the core tensor is controlled by the structural sparsity constraints on the tensor slice matrix.

And finally, taking the fused multi-modal characteristics as input, and taking a real answer corresponding to the fruit tree disease control problem as output for training a classifier.

In the embodiment of the present invention, the method further comprises evaluating the classifier, that is, when the model predicts a visual question-answering task, the accuracy metric of the VQA data set is as follows:

if the predicted answer is at least the same as the answers provided by 3 annotators, the prediction accuracy is considered to be 100%. This metric takes into account consensus among annotators and is adopted by most researchers. The evaluation index generally employs a mean Precision (MAP), accuracy (Precision), recall (Recall), and F1 value as evaluation indices of accuracy.

The prediction stage can be regarded as a logistic regression that predicts the correctness of each candidate answer; selecting the answer with the highest probability from all predicted answers as a final prediction; returning and predicting by using a binary cross entropy function; and determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value. The schematic diagram of the accuracy of the fruit tree full-growth period farm work decision method based on visual question answering is shown in fig. 5.

In the embodiment of the invention, the image sample and the question sample for the growth cycle of the fruit tree are obtained to combine into the fruit tree visual question-answer data set. By extracting image and text features separately. The method not only extracts the image and problem features, but also explores the correlation of words and image to problems by using a multi-view attention mechanism, thereby effectively mining important information of texts by using semantic relations between the images and the texts, adopting a tensor decomposed bilinear fusion mechanism in a fusion mode, and training a model to optimize the method by using a multi-head attention mechanism, and finally improving the accuracy of a fruit tree growth period decision task.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A visual question-answering-based fruit tree full-growth period farming operation decision method is characterized by comprising the following steps:

2. The fruit tree full-growth-period farming work decision method based on visual question answering according to claim 1, wherein in the step S2, the image feature vector comprises a fruit tree image feature vector and a disease image position feature vector.

3. The visual question-answer-based fruit tree full-growth period farming work decision method according to claim 1, wherein the S2 specifically comprises:

4. The visual question-answer based fruit tree full-growth period farming operation decision method according to claim 3, wherein the image feature vectors are that each image sample is divided into a plurality of regions; each region is represented by a vector with 2048 dimensions and is used as the input of a subsequent network;

5. The visual question-answering-based fruit tree full-growth period farming work decision method as claimed in claim 3, wherein the feature extraction is performed on the first text sample by using a word vector embedding method and a long-short term memory neural network to obtain a question keyword feature vector, and the method specifically comprises:

processing the input fruit tree disease control problem into a plurality of single words, and intercepting N words from the plurality of single words; if the number of the single words is less than N, filling with 0;

capturing semantic features of the intercepted words by combining a 300-dimensional word vector model, and converting the semantic features into problem feature vectors;

and coding the problem feature vector by using a long-short term memory neural network, and extracting problem keyword feature information from the problem feature vector to obtain the problem keyword feature vector.

6. The visual question-answer based fruit tree full-growth period farming operation decision method as claimed in claim 1, wherein in S4, the classifier is trained by the following means:

preprocessing the image feature vector and the problem keyword feature vector;

7. The visual question-answering-based fruit tree full-growth period farming work decision method according to claim 6, wherein the pretreatment specifically comprises:

8. The visual question-answering-based fruit tree full-growth period farming work decision method as claimed in claim 7, wherein the multi-modal fusion model employs a multi-view attention mechanism, scores disease image position region embedding according to problem keyword feature vectors, and calculates a global visual vector as a sum pool weighted by the scores.

9. The visual question-answering-based fruit tree full-growth period farming work decision method as claimed in claim 6, wherein the multi-modal fusion model adopts a bilinear fusion mechanism based on tensor decomposition; simulating the correlation between the fruit tree image eigenvector and the problem keyword eigenvector through a full tensor; and decomposing the full tensor by adopting a bilinear fusion method to obtain a structure with three internal model matrixes and a core tensor.

10. The visual question-answering-based fruit tree full-growth period farming work decision method of claim 9, wherein the complexity of the core tensor is controlled by structural sparsity constraint on a tensor slice matrix.