NL2034409A

NL2034409A - A biological invasion identification method based on multi-source data fusion analysis

Info

Publication number: NL2034409A
Application number: NL2034409A
Authority: NL
Inventors: Chen Biyun
Original assignee: Univ Yancheng Teachers
Priority date: 2022-05-25
Filing date: 2023-03-23
Publication date: 2023-05-19
Also published as: NL2034214A; CN114943290B; CN114943290A

Abstract

The present invention discloses a method, for identifying biological invasions based on multi—source data fusion analysis, comprising the steps of: acquiring a multi—source data set containing invasive biological data, and tagging the invasive biological data;data sets include: text data, image data, temporal data, geolocation data; classification of text data and output of text probability matrix with markers;ldentifying the location of invasive organisms in images, determining boundaries and sizes, and training a probability matrix of images with markers on the image data.solo thermal encoding of temporal data and construction of a time—space feature matrix from the encoded data with said geolocation data;construction of a multi—feature vector based on a text probability matrix, a picture probability matrix, and a time— space feature matrix;Assigning weights to multiple feature vectors and training binary classifiers using machine learning algorithms;the predicted data is fed into a binary classifier to obtain invasive organism data.

Description

A BIOLOGICAL INVASION IDENTIFICATION METHOD BASED ON MULTI-SOURCE

DATA FUSION ANALYSIS

Technical field

The present invention relates to the field of big data arti- ficial intelligence technology, and in particular to a biological intrusion identification method based on multi-source data fusion analysis.

Background technology

As the process of globalisation accelerates and land use pat- terns change.Biological invasions have become a worldwide ecologi- cal security problem, and the means to combat them at this stage include: the establishment of appropriate monitoring systems to identify the species, number, distribution and role of alien spe- cies;increase education on the dangers of biological invasions and raise awareness of the community to prevent them;active search for identification and control techniques for invasive alien species in order to effectively curb the current trend of biological inva- sion.

At present, artificial intelligence technology is becoming a new engine in the field of ecological resources, but it is worth noting that there is a problem of matching data structure and ac- curacy in the process of fusion of data from multiple sources;the monitoring information obtained contains many different types of data, and it is a question of how to use the multi-characteristic data for rapid identification and intelligent diagnosis of alien species, and based on this for risk analysis and prognosis.

Summary of the invention

The present invention addresses problems in existing solu- tions and provides a method for identifying biological invasions based on fusion analysis of data from multiple sources, comprising the steps of:

Obtaining a multi-source data set containing invasive organ- ism data and tagging the invasive organism data: said data set in- cludes: text data, image data, time data, geographic location da-

ta.

Classification of said text data and output of the text prob- ability matrix with markers.

The location of the invasive organisms in the images is iden- tified for said image data, the boundaries and sizes are deter- mined and a probability matrix of images with markers is trained.

Solo thermal encoding of said temporal data and construction of a time-space feature matrix from the encoded data with said ge- olocation data.

Constructing a multi-feature vector based on said text proba- bility matrix, said picture probability matrix, and said time- space feature matrix; assigning weights to said multi-feature vec- tor and training a binary classifier using a machine learning al- gorithm.

The predicted data is fed into a binary classifier to obtain invasive organism data.

Further, classifying said text data, specifically comprising: deactivating said text data, constructing N-grame features using

Fast-Text, subjecting the text content to a sliding window opera- tion of size N in byte order, finally forming a sequence of byte fragments of length N. The resulting sequence is used as a text feature candidate set, filtering out important features, and out- putting a text probability matrix with markers using Soft-Max.

Further, identifying the location of the invasive organism in said picture data, specifically comprising: determining the loca- tion of the invasive organism to be identified by means of a pic- ture recognition algorithm such as a CNN for said picture data, zooming in, determining the boundaries as well as the picture size, and training a picture probability matrix with markers using the CNN.

Further, assigning weights to said multiple feature vectors and training a binary classifier using a machine learning algo- rithm, specifically comprising: normalizing said multiple feature vectors, assigning weights using an entropy weighting method, and training into a binary classifier using a machine learning algo- rithm such as SVM.

Further, the data to be predicted is fed into a binary clas-

sifier to obtain invasive organism data, specifically comprising: inputting the data to be predicted, using the SVM to make the fi- nal marker, and when the output marker is 1, representing that the data uploaded by the user at that location is true for that time period, representing that invasive species have been present here and should be dealt with promptly.

Further, said a method for identifying biological invasions based on fusion analysis of multiple sources of data, further com- prising: using a GM model for said time-space feature matrix to predict the migration or reproduction patterns of future invasive organisms.

Further, the classifier in said binary classifier can also be replaced by methods such as random forest, logistic regression, neural network, etc.

Compared to the prior art, the present invention provides a biological invasion identification method based on multi-source data fusion analysis, which has the following beneficial effects:

This multi-source data fusion analysis-based biological inva- sion identification method incorporates textual data, image data, temporal data and geographic location data, and is applied to the rapid identification and monitoring of species, while also pre- dicting the development trend of species over time, providing a basis for the formulation of corresponding rational and efficient conservation and management measures.

Description of attached drawings

Fig 1 shows the Fast-Text flowchart;

Fig 2 shows a flowchart specific to the output layer from the fully connected layer;

Fig 3 shows the probability distribution of the 11-100 train- ing sets;

Fig 4 shows the test set recall statistics;

Fig 5 shows a graph of test set accuracy statistics;

Fig 6 shows a plot of the stochastic predictions of the training model;

Fig 7 shows a plot of the rate of change metrics for the training model;

Fig 8 shows a schematic diagram of the actual geographical location of the training model;

Fig 9 shows a map of thermal predictions;

Fig 10 shows the overall flowchart of the biological invasion identification method based on multi-source data fusion analysis.

Specific implementation method

Specific embodiments of the invention are described further below in conjunction with the accompanying drawings.The following embodiments are only intended to illustrate the technical solu- tions of the invention more clearly and cannot be used to limit the scope of protection of the invention.

Example 1

The present invention provides a biological invasion identi- fication method based on fusion analysis of multiple sources of data, as shown in the overall flowchart in Fig 10, comprising the steps of:

Obtain a multi-source dataset containing invasive organism data and tag the invasive organism data; the dataset includes: text data, image data, temporal data, and geographic location da- ta.

The dataset consists of text, images, time and geographical location, expanding the range of data in the dataset. By dividing the dataset into a training set and a test set, controlled experi- ments can be carried out to facilitate verification of the tech- nical effectiveness of this algorithm.

Classification of the text data and output of the text proba- bility matrix with markers:Classifying text data, specifically comprising: deactivating said text data, constructing N-grame fea- tures using Fast-Text, subjecting the text content to a sliding window operation of size N in byte order, and finally forming a sequence of byte fragments of length N. The resulting sequence is used as a candidate set to filter out important features, and out- putting a text probability matrix with markers using Soft-Max.

The data provided by people commenting on invasive organisms can be useful for laboratory identification, so the textual data can have a significant impact on the identification of invasive organisms.To ensure the morphological features within each of its 5 words, feature extraction is performed on the word vector of each record.This is a sliding window operation of size N in byte order, resulting in a sequence of byte fragments of length N.Where '<! denotes a prefix and '>' denotes a suffix. A trigram made up of '<>' can then be used to represent a word, and furthermore 5 vec- tors can be used superimposed to better represent word vectors.The discrete variables are converted into continuous vectors by Embed- ding to form a word vector for this record, Wj = [wy wy ... Wij ..., Wy, where W‚ = [wij waj Wij coe, Wy] denotes the word vector in the jth rec- ord.As Fig 1 shows the flowchart of Fast-Text, the Embedding- processed word vectors are used as input features, and the implic- it layer is averaged over multiple word vectors. and using the negative log-likelihood as the loss func- tion: LSA ynlog(f(BAx,)), where N is the text and xn is the word features in the text.yn is the label, A and B are the weight ma- trices, A is used to convert to a textual representation and B is used to compute the category by linear transformation, fis a Soft-

Max function used to calculate the probability of the final clas- sification.Soft-Max uses a Hoffman tree-based hierarchical struc- ture to classify the text data and output the probabilities for each category.

The expression for layered Soft-Max is shown below, p(oclwt) = estote) es Where p(wclw ) is the final probability of the text. w de- notes the word vector.

The final probability matrix for the n records is obtained.

As shown in Table 1, where T, denotes n records and Lk denotes that there are k categories, where pij denotes the probability of cate- gory j in the i" record.

Table 1: Text data probability matrix

L, en L; en Ly

Ti P11 bij Pik

Ti Pit so Pij oo Pik

Ty Pni Bh Pnj ie Pnk

Identifying the location of invasive organisms in images, de- termining boundaries and sizes, and training a probability matrix of images with markers on the image data;the location of the inva- sive creature is identified in the image data by means of an image recognition algorithm such as CNN, zooming in, determining the boundaries and the size of the image, and training the image prob- ability matrix with markers using CNN.

It is important to extract features from the image data, as the information uploaded by the public has a great influence on the laboratory judgement.First of all, the image data is now pre- processed: data that is not an image is deleted and the suffix name is changed.CNN convolutional neural networks are mainly di- vided into a convolutional layer (CONV), a pooling layer (POOL) and a fully connected layer (FC) .CNNs are mainly used to extract features, one by one, from local features to overall features, to perform image recognition and other functions.The processing of images using CNN neural networks is divided into four steps: the input layer, the convolutional and downsampling layers, the fully connected layer and the output layer.The input layer uses RGB col- our images, and the output of the RGB components, convolved with the convolution layer weights W to obtain the individual C layers, which are then downsampled to obtain the individual S layers.Using the activation function, the output of these layers is called a

Feature-Map, and the fully connected layer expands each element of the Feature-Map in turn, arranging them in a column, and using

Soft-Max for classification in the output layer.

Fig 2 is a specific flow diagram from the fully connected layer to the output layer, where the invasive Asian creatures are used as the input layer data, and since the dataset is RGB colour images, three separated 2D kernels are used to scale and greyscale the images, and the 3-channel RGB colour images are quickly con- verted to 1-channel greyscale.After multiple convolution, pooling and activation, the features are extracted and passed through a fully connected layer to output the probability of being an inva- sive Asian organism using the Soft-Max function.

Where X (height x width x channel) is the input pixel matrix,

Y is the output matrix, which is convolutionally pooled to flatten the multidimensional data, connected to a fully connected layer, and the output is the class probability for classification using traditional Soft-Max.A vector of T x L is obtained, where each value represents the probability value of the input corresponding to all samples. This gives the classification probability for the image file, denoted using C.

The final probability matrix for the picture data is obtained as shown in Table 2, with qij denoting the probability of category j in the i" record.

Table 2: Picture data probability matrix

Li ne L; en Ly

Ti dit vw dij u dik

T; di ne dij . dik

Ta 4n1 oe nj ot Ink

Solo thermal coding of temporal data and construction of a time-space feature matrix from the coded data with geolocation da- ta;construction of a multi-feature vector based on a text proba- bility matrix, a picture probability matrix, and a time-space fea- ture matrix;assigning weights to multiple feature vectors and training binary classifiers using machine learning algorithms; The multi-feature vector is assigned weights and the binary classifier is trained using machine learning algorithms, specifically: the multi-feature vector is normalised, the weight coefficients are assigned using the entropy weight method and trained into a binary classifier using machine learning algorithms such as SVM; the classifiers in the binary classifier can also be replaced by meth- ods such as random forest, logistic regression and neural net- works.

Inputting predicted data into a binary classifier to obtain invasive organism data;the data to be predicted is fed into a bi- nary classifier to obtain invasive organism data, specifically: the data to be predicted is input and the final marker is made us- ing SVM, when the output marker is 1, it means that the data up- loaded by the user is true for that location at that time, which means that invasive species have been present here and should be dealt with promptly.

A method for identifying biological invasions based on multi- source data fusion analysis, further comprising: using a GM model for a time-space feature matrix to predict the migration or repro- duction patterns of future invasive organisms.

Dataset consisting of: sighting report data aggregated by the

Washington State Department of Agriculture for December 2020, the dataset contains a spreadsheet of 4440 sighting reports and 3305 user uploaded image data;sighting reports that have been laborato- ry-identified are marked with a 1 for invasive organisms and a 0 for the opposite.Randomly, 70% of the total data in the dataset is divided into the training set and the remaining data in the da- taset is divided into the test set.

For the sighting reports provided by the population, each re- port is independent of the other, and the information is not con- tinuous between them, nor between the individual characteristic values, but discrete and disordered. Features can be digitised us- ing one-hot coding, also known as one-bit valid coding.This is achieved by encoding N states using N-bit status registers, each of which has its own register bit and only one of which is valid at any given time.

As invasive organisms are exotic species, there may be few occurrences, and the GM model is well suited to small amounts and incomplete information, calculating the range of variation for the time of occurrence of invasive organisms as well as their geo- graphical location.The GM{1,1) model can be expressed as Y=Bu, with the predicted sample range indicating the regional divi- sion.In order to ensure the viability of the GM(1, 1) modelling approach, the necessary tests need to be done on the known da- ta.Let the original data column be x = (x°(1),x°(2), + x%(n)), where x{0}) is the original time data column, and calculate the rank ra- tio of the series.The series can be modelled as GM(1l, 1) and grey predictions can be made if all the ratios fall within the allowa- ble coverage interval (Deli) otherwise, make appropriate transformations to the data, e.g. translational mations YOK) =x(0)(K) +c‚,k= 1,2, …,n.

The model used to derive the range of areas where events oc- cur in the different records and the planning forecasts collate the range of trends.For the range predicted by the GM, a judgement is first made as to whether it is within the range, and if it is, 1 is 0 if it is not.Use L{(Location)=(0,1) to denote the geographic feature value of the record. For the temporal feature T, which has uncertainty in event occurrence, one-hot coding is used to effec- tively differentiate between different time periods for discrete values.

The first step is data feature extraction, which is normal- ised according to four different features (text, image, time, lo- cation) .For the people to provide text data and image data with a high degree of authenticity, and for the time and geographical range is relatively broad, does not represent the specific mean- ing, so the overall values need to be weighted, the sum of the as- signed weights should be equal to 1.

For the extracted probabilistic features F for text data and

C for image data, the default value of 1/k is used.Since the events are independent of each other, the probability of k events occurring at the same moment is only 1/k, where the sum of the k time probabilities should be 1.For missing time and geographical locations the average of the top and bottom two records can be used to fill in the gaps.

For multi-characteristic events, which are inherently random in nature, entropy values are calculated to determine the degree of randomness and disorder of the event and to judge the degree of dispersion of the indicator; the greater the dispersion of the in- dicator, the greater the influence of that indicator on the over- all evaluation.

The eigenvector X = {Xi Xn} first requires a Z-Score normal- ization of the values between the individual matrices, xj ‚where ¥ and ¢ denote all the means and standard deviations in the feature vector X.

Using information entropy to calculate the entropy weight of each indicator, Hj= —k}il,yjlny;(G = 12, ..,n)where Hj denctes the infor- mation entropy of the jth indicator.

In order to ensure that 0 £€ Hj S 1, it is common to take k = 1/ (lnm), before calculating the deviation degree dj for each indi- cator.

The normalised eigenvector matrix is multiplied with each in- dicator wj to obtain the weighted multi-feature evaluation matrix

V.

As a larger number of feature vectors are provided, there is no need to use overly complex algorithms in the final multi- feature fusion classification, as the probability statistics for the first round of predictive classification have already been performed on the data. The use of a traditional SVM classification model is sufficient.

In the SVM classification problem given the input data X and the learning target Y.

Nik Ag} ¥={¥. Ty}

The input data here are X=F, ‚LT.

A simple explanation of FCLT is given by the fact that each sample of the input data contains multiple features, thus forming the feature space X:X=[X|, .., Xn] EX,

The learning objectives are binary variables, representing negative class and positive class.If the feature space X in which the input data is located exists, then it will be used as the hy-

perplane of the Decision Boundary:decision boundary:w'X +b =0, the learning targets are then separated by positive and negative clas- ses and the point-to-plane distance of any sample is made equal to or greater than 1: point to plane distance: y; * (wTX; +b) = 1, where the param- eters w,b are the normal vector and intercept of the hyperplane, respectively.

The decision boundary satisfying this condition in effect constructs 2 parallel hyperplanes as interval boundaries to dis- criminate the classification of the sample: wiXi+ bz +l= 3 =+lpositive sample wi ths —l= y= 1 Negative sample

All samples above the upper interval boundary belong to the positive class and those below the lower interval boundary belong to the negative class. The distance d between the two interval boundaries is defined as the margin (margin) d= 2 iw]

The positive and negative class samples on the interval boundary are the Support Vector.

The Fast-Text tool was used to classify the data into text.

The training set is first subjected to a data egualisation opera- tion to obtain a probability distribution of 100 training sets, which is eventually peaked in the course of continuous training.

As shown in Fig 3, the probabilities converge to 0.9 or 0.1 from roughly the 10th, and with the uneven sample, the Fast-Text helper does a good job of balancing the sample and judging the probabilistic events.

The sample data were scrambled for training and the training results for the Asian invasive organism problem, shown in Figs 4 and 5, showed a recall and accuracy of 94.6%.

In practice, the PyTorch framework is used to build CNN mod- els. Use 70% of the dataset for training and 30% of the data for testing.Due to the uneven data sample, a simple oversampling was performed and the final training model performed well on the test set. Several images were randomly selected and then predicted us- ing the trained model as shown in Figure 6 (True values are the actual categories, Predictions are the predictions of the model,

Negative values indicate that the image is an invasive Asian or- ganism, Positive indicates that the image is not an invasive Asian organism) .

Finally, the metrics for evaluating the training model are shown in Table 3 and Fig 7.

Table 3: Metrics for training models

Indicator Score

Precision 93.11%

Recall 100.00%

ACC 96.32%

Fl-score 96.43%

AUC 99.99%

For the small number of invasive organism occurrences, the GM model is well suited to small amounts and incomplete information, calculating the range of variation for the time of invasive organ- ism occurrence as well as the geographical location.

Take c so that the data columns all fall within the allowable coverage, after calculating the grade ratio values it was found that the grade ratio test values for both data were within the standard range interval [0.857,1.166], implying that this data is suitable for GM(1l,1) model construction.After testing the data, the development coefficient a, the grey effect b and the posterior variance ratio C are calculated for the GM model, as shown in Ta- ble 4:

Table 4: Results of the model construction “Index Longitude Latitude

Development coefficient A 0.0001 0.0001

Grey action B -122.7586 48.9994

C value of posterior error 0.0468 0.5956 ratio

The posterior difference ratio C values for both models are less than 0.65, with only 0.0468 less than 0.35 for the longitude model indicating that the longitude model is particularly good.So forecasts are made for latitude and longitude, and the residuals are tested after they have been predicted, including relative er-

rors, and step deviations.For both sets of data, longitude and latitude, the maximum values of relative errors are less than 0.1, and for the grade deviation values, both less than 0.1 means that the higher requirements are met, implying that the model fit is at a high level, and the relevant ranges are plotted according to their geographical location, as shown in Fig 8.

After predicting the latitude and longitude using the differ- ence between the latitude and longitude of the two points to cal- culate the distance between the two points, the goodness of fit was calculated for the above two models: Latitude R-squared 71.45%

Longitude R-squared 95.31% It is easy to see from Figure 8 that these samples verified as true invasive Asian organisms Latitude range: [48.7775,49.1494], and Longitude range: [-123.9431,- 122.4186].

For the extracted probabilistic features F for text data and

C for image data, the default value of 1/k is used. Since there is mutual independence between the events, the probability of k events occurring at the same moment is only 1/k.For the range pre- dicted by the GM, a judgement is first made as to whether it is within the range, and if it is, 1 is 0 if it is not.The use of

L(Location) = (0,1) to denote the geographic feature value of the rec- ord, and for the temporal feature T due to the uncertainty of the event occurrence, the use of one-hot coding allows for effective differentiation of discrete values between different time periods.

Correlation weights need to be determined between the fea- tures, and the weight assignments for each feature are calculated using the entropy weighting method, as shown in Table 5:

Table 5: Table for the distribution of weights for different features

For several eigenvalues, being linearly inseparable, i.e. hy- perbolic in the eigenspace using a non-linear function allows the non-linearly separable problem to be mapped from the original ei- genspace to a higher dimensional Hilbert space, thus transforming it into a linearly separable problem, using linear regression to calculate Table 6.

Table 6: Linear regression calculation table

R2<0 was found and the data may not have any linear relation- ship.For such hyperplane multi-feature problems, the kernel func- tion has good convergence using a radial basis function kernel for classification.SVM multi-feature fusion analysis was performed and the results were highly accurate compared to Fast-Text and CNN neural network models alone. The results are highly accurate com- pared to Fast-Text alone and CNN neural network models, and have good generalisability compared to just predicting individual data.

Training on 300 sets of data and evaluation of 521 sets of data for classification resulted in Table 7:

Table 7: Classification assessment table =S vm es ee mr [wm we

The diagonal line is the number of correct predictions, and for 521 records the number of errors in identifying invasive or- ganisms was only one, the number of records that were correctly identified as invasive, and the extent of their occur- rence. Predictions were made for population data that had not yet been experimentally judged, and heat maps were plotted by time based on the predicted data, as shown in Fig 9: It was found that traces of invasive Asian organisms may still be present in parts of Washington in the second half of the year, and it may not be possible to eliminate such hazards in the short term.

A comprehensive evaluation of the multi-feature data fusion analysis algorithm shows from 8 that the use of the algorithm al- lows for a good combination of multiple feature data sources and a reasonable judgement of different events.

Table 8: Comprehensive evaluation of multi-feature data fu- sion analysis algorithms

The classifiers in the binary classifier can also be replaced by methods such as random forests, logistic regression and neural networks.

The technical features of the above embodiments can be com- bined in any number of ways. For the sake of brevity, not all pos- sible combinations of the technical features of the above embodi- ments have been described, however, as long as these combinations of technical features are not contradictory, they should be con- sidered to be within the scope of the present specification.

Claims

CONCLUSIONS

A method for identifying biological invasions based on the analysis of data fusion from different sources, characterized in that it comprises the steps of: acquiring multi-source data sets containing invasive organism data and tagging invasive organism data; said data sets include: text data, image data, time data, geolocation data; classification of said text data and output of a matrix of text probabilities with markers; identifying the location of the invasive organism in the image, determining its boundary and size, and training a probability matrix of images with markers, for said image data; solo thermal encoding of said temporal data and construction of a time-space feature matrix from the encoded data with said geolocation data; constructing a multi-feature vector from said text probability matrix, said image probability matrix and said time-space characteristic matrix; constructing a binary classifier by assigning weights to said multiple feature vectors and training the binary classifier using machine learning algorithms; the predicted data is fed into a binary classifier to obtain data on invasive organisms.

A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : classifying said text data, specifically comprising: deactivating said text data, using Fast-Text to construct features with an N-gram, and the text content is subjected to a sliding window operation of size N in byte order, resulting in a sequence of byte-

fragments of length N; the resulting sequence is used as a candidate set for text features, major features are filtered out, and a text probability matrix with markers is output using Soft-Max.

A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : classifying said text data and performing a tagged text probability matrix, specifically comprising: the location of the invasive creature to be identified is determined for said image data by the image recognition algorithm CNN, the location is zoomed in, the boundaries are determined as well as the image size, and a probability matrix of images with markers is trained using the CNN.

A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : assigning weights to said multi-feature vectors and training a binary classifier using a machine learning algorithm, specifically consisting of: the described multi-feature vectors are normalized, weights are assigned using the entropy weighting method and trained in a binary classifier using the machine learning algorithm SVM.

A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : the data to be predicted is entered into a binary classifier to obtain data on invasive organisms, more in in particular: the data to be predicted is entered and the final marker is created using the SVM; when the output flag is 1, it means that the user uploaded data is true for that location at that time, meaning that invasive species have been present here and need to be addressed immediately.

The method of identifying biological invasions based on fusion analysis of multiple data sources as recited in claim 1, characterized in that the method further comprises: the GM model is used to predict future migration or reproductive patterns of invasive organisms predict for the described time-space feature matrix.

A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : the classifiers in said binary classifier include: random forest, logistic regression, and neural network.